IFilterShop PDF+ IFilter Server Edition Release 3.1 README

CONTENT OF README FILE


General Information

PDF+ IFilter is an enhanced IFilter for Adobe PDF files. It extends Adobe PDF IFilter to extract text and XMP metadata from PDF files. It may also work without Adobe PDF IFilter, in which case only XMP metadata will be indexed. PDF+ IFilter supports Dublin Core, XMP Basic, Adobe PDF and custom XMP schemas. PDF+ IFilter is easily extensible and can support other XMP core schemas such as Rights Management or Media Management. If your metadata needs are not covered by the core schemas, you may add custom schemas as extensions. Please refer to "Support for custom XMP schemas" section for more information. For better integration with Microsoft applications PDF+ IFilter also outputs common office document properties such as 'DocAuthor', 'DocKeywords' and others. See "Office Document Properties" sections below for more information.

PDF+ IFilter supports Indexing Service, SharePoint Portal Server, SQL Server Full-Text Search, Window Search Service and all other products based on Microsoft Search technology.


System Requirements

PDF+ IFilter supports the following Microsoft server operating systems:

PDF+ IFilter supports the following Microsoft desktop operating systems:

PDF+ IFilter supports the following Microsoft Search products


Information Retrieval

PDF+ IFilter extends Adobe PDF IFilter to extract text and XMP metadata from PDF files. It may also work without Adobe PDF IFilter, in which case only XMP metadata will be indexed.

Dublin Core Schema support

The Dublin Core Schema provides a set of commonly used properties.

Support for Dublin Core Schema is optional and enabled by default. To disable Dublin Core Schema support: Please note. When Dublin Core Schema support is disabled, PDF+ IFilter does not output certain office document properties. See
"Office Document Properties" section below for more information.

PDF+ IFilter extracts the following XMP Dublin Core metadata:

XMP Dublin Core MetadataProperty NameProperty TypeDescription
dc:contributorcontributorVT_LPWSTRContributors to the resource (other than the authors)
dc:coveragecoverageVT_LPWSTRThe extent or scope of the resource
dc:creatorcreatorVT_LPWSTRThe authors of the resource (listed in order of precedence, if significant)
dc:datedateVT_FILETIMEDate(s) that something interesting happened to the resource
dc:descriptiondescriptionVT_LPWSTRA textual description of the content of the resource
dc:formatformatVT_LPWSTRThe file format used when saving the resource
dc:identifieridentifierVT_LPWSTRUnique identifier of the resource
dc:languagelanguageVT_LPWSTRLanguage of the document
dc:publisherpublisherVT_LPWSTRPublishers
dc:relationrelationVT_LPWSTRHow the content relates to other resources
dc:rightsrightsVT_LPWSTRInformal rights statement
dc:sourcesourceVT_LPWSTRUnique identifier of the work from which this resource was derived
dc:subjectsubjectVT_LPWSTRAn unordered array of descriptive phrases or keywords that specify the topic of the content of the resource
dc:titletitleVT_LPWSTRThe title of the document, or the name given to the resource
dc:typetypeVT_LPWSTRA document type; for example, novel, poem, or working paper

In accordance with Microsoft IFilter specification, PSD+ IFilter defines each metadata as combination of Property Set and Property Name. All XMP Dublin Core metadata belong to {DC099694-64F5-4371-9AA9-868846A5657E} Property Set GUID.


XMP Basic Schema support

The XMP Basic Schema contains properties that provide basic descriptive information.

Support for XMP Basic Schema is optional and enabled by default. To disable XMP Basic Schema support: Please note. When XMP Basic Schema support is disabled, PDF+ IFilter does not output certain office document properties. See "Office Document Properties" section below for more information.

PDF+ IFilter extracts the following XMP Basic metadata:

XMP Basic MetadataProperty NameProperty TypeDescription
xap:AdvisoryAdvisoryVT_LPWSTRAn unordered array specifying properties that were edited outside the authoring application
xap:BaseURLBaseURLVT_LPWSTRThe base URL for relative URLs in the document content
xap:CreateDateCreateDateVT_FILETIMEThe date and time the resource was originally created
xap:CreatorToolCreatorToolVT_LPWSTRThe name of the first known tool used to create the resource
xap:IdentifierIdentifierVT_LPWSTRAn unordered array of text strings that unambiguously identify the resource within a given context
xap:MetadataDateMetadataDateVT_FILETIMEThe date and time that any metadata for this resource was last changed. It should be the same as or more recent than xap:ModifyDate
xap:ModifyDateModifyDateVT_FILETIMEThe date and time the resource was last modified
xap:NicknameNicknameVT_LPWSTRA short informal name for the resource

All XMP Basic metadata belong to {BA64F93D-FBA6-4b75-8F7F-37FC8B493176} Property Set GUID.


Adobe PDF Schema support

Adobe PDF Schema specifies properties used with Adobe PDF files.

Support for Adobe PDF Schema is optional and enabled by default. To disable Adobe PDF Schema support:

PDF+ IFilter extracts the following XMP Adobe PDF metadata:

XMP Adobe PDF MetadataProperty NameProperty TypeDescription
pdf:KeywordsKeywordsVT_LPWSTRExternal Keywords
pdf:PDFVersionPDFVersionVT_LPWSTRPDF file version
pdf:ProducerProducerVT_LPWSTRName of tool that created PDF document

All XMP Adobe PDF metadata belong to {A2BAC514-218A-43E8-A3EF-7598A66B19BE} Property Set GUID.


Support for custom XMP schemas

PDF+ IFilter is easily configurable for additional XMP core schemas and custom XMP schemas. To make your custom XMP schema searchable:

  1. Open registry key "HKEY_LOCAL_MACHINE\SOFTWARE\IFilterShop\PdfPlusFilter\CustomSchemas"
  2. Create a new key with custom XMP schema name. For example, for PDFx Schema the new entry can be:
    "HKEY_LOCAL_MACHINE\SOFTWARE\IFilterShop\PdfPlusFilter\CustomSchemas\PDFx Schema"
  3. Under the registry key created add the following String values:
    Registry valueDescriptionExample for PDFx Schema
    NameSpaceURI for custom XMP schemahttp://ns.adobe.com/pdfx/1.0/
    GUIDProperty Set GUID that will be used by Indexing Service *{2C443B1E-F1E2-404F-974D-E21FEF8E72AA}
    FileNameFull path to the text file with custom XMP schema properties mapping **C:\IFilterShop\PdfPlusFilter\PDFxSchema.txt

    * GUID shall be a newly generated GUID

    ** FileName value is optional. If this value is missing then all properties within the schema will be indexed

    Each line in the text file referred by FileName value shall have the following structure:
    <XMP Metadata>;<Property Name>;<Property Type>, where

    For example, PDFx Schema property setup file can be defined as:
    CustomProp1;ProjectName
    CustomProp2;ProjetNum;VT_INT
    CustomProp3;ProjetStartDate;VT_FILETIME
    

  4. Close registry editor and restart all appropriate Search services

Indexing of XMP sidecar files

A sidecar file is an alternative to storing the metadata directly in PDF file itself by instead storing the data in a separate .xmp file with the same base name as the PDF file. Sidecars are typically used in cases when PDF file should not be edited directly.

PDF+ IFilter supports indexing of XMP sidecar files. When loaded for a PDF file, PDF+ IFilter will at first try to locate .xmp file with the same base name and the same location as the original PDF file. If .xmp file is found, PDF+ IFilter will extract XMP metadata from that file. If .xmp file is not available, PDF+ IFilter will extract XMP metadata from the PDF file itself.


Office Document Properties

PDF+ IFilter outputs the following standard Indexing Service properties as duplicates of certain XMP Dublin Core and XMP Basic properties when support for Dublin Core Schema and XMP Basic Schema are enabled.

Property Friendly NameProperty Set GUIDProperty NameDescriptionXMP Metadata
DocAuthor{F29F85E0-4FF9-1068-AB91-08002B27B3D9}4Author of the documentdc:creator
DocCreatedTm{F29F85E0-4FF9-1068-AB91-08002B27B3D9}12Time document was createdxap:CreateDate
DocKeywords{F29F85E0-4FF9-1068-AB91-08002B27B3D9}5Keywords for the documentdc:subject
DocLastSavedTm{F29F85E0-4FF9-1068-AB91-08002B27B3D9}13 Time document was last savedxap:ModifyDate
DocSubject{F29F85E0-4FF9-1068-AB91-08002B27B3D9}3Subject of the documentdc:description
DocTitle{F29F85E0-4FF9-1068-AB91-08002B27B3D9}2Title of the documentdc:title


Installation Instructions

Setup file is a self-extracting archive that must be downloaded and opened on the machine where you wish to use PDF+ IFilter.

  1. Stop all appropriate Search services.
  2. Uninstall any previous version of PDF+ IFilter.
  3. Start setup file and follow the on-screen instructions.
  4. Start all appropriate Search services.
  5. Re-index catalogs containing PDF files.


Text Extractor Setup

PDF+ IFilter uses PDF text extractor to index text of PDF document. When installed standalone, PDF+ IFilter indexes only XMP metadata embedded into PDF document. PDF text extractor has to be installed on the machine in order to enable PDF+ IFilter to index PDF text. By default PDF+ IFilter integrates with Adobe IFilter 9. PDF+ IFilter can be configured to integrate with other PDF text extractors. Please follow instructions below:

  1. Stop all appropriate Search services.
  2. Open registry key "HKEY_LOCAL_MACHINE\SOFTWARE\IFilterShop\PdfPlusFilter".
  3. Locate String value named "TextEngineCLSID".
  4. Change the value of "TextEngineCLSID" to the GUID of PDF content indexer COM component. For example:
    PDF text extractorGUID
    Adobe PDF IFilter (ver. 5.0 or 6.0){4C904448-74A9-11d0-AF6E-00C04FD8DC02}
    Adobe PDF IFilter (ver. 8.x or later){E8978DA6-047F-4E3D-9C78-CDBE46041603}
  5. Start all appropriate Search services.


Multiple Properties Output

By default PDF+ IFilter outputs multiple instances of the property as multiple properties. In products such as SharePoint Portal Server 2003 only one instance of the same value property can be indexed. PDF+ IFilter can be configured to output multiple instances of the property as a single value property. To enable this:

  1. Stop all appropriate Search services.
  2. Open registry key "HKEY_LOCAL_MACHINE\SOFTWARE\IFilterShop\PdfPlusFilter"
  3. Change the value of "MultipleInstancesMode" registry key to "1". If this value is set to "0" or missing, PDF+ IFilter will output multiple instances of the property as multiple properties.
  4. Start all appropriate Search services.
  5. Re-index catalogs containing PDF files.


Additional Setup Steps

Some Microsoft Search products require additional setup steps as described below:

SharePoint Portal Server 2003:

  1. Open "Site Settings" web page
  2. In the "Search Settings and Indexed Content" section click on "Configure search and indexing"
  3. Click on "Include file types"
  4. Make sure that ".pdf" file type is included

Office SharePoint Server 2007:

  1. Open Shared Services Provider Admin Site
  2. In the "Search" section click on "Search settings"
  3. Click on "File type inclusions"
  4. Make sure that ".pdf" file type is included

Windows SharePoint Services 3.0:

  1. Open registry key "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Applications\<WSS Server Name>\Gather\Search\Extensions\ExtensionList"
  2. Add ".pdf" extension to the list of indexable file types
  3. Restart Windows SharePoint Services Search

SharePoint Server 2010:

  1. In SharePoint Central Administration go to "General Application Settings" page
  2. In the "Search" section click on "Farm-Wide Search Administration"
  3. Click on " Search Service Application" link
  4. On the left side menu select "File Types"
  5. Make sure that ".pdf" file type is included


How to Uninstall

If you ever have to uninstall PDF+ IFilter application you can easily do this using any of the following methods:


Known Issues

PDF+ IFilter extracts only metadata or only content of PDF files

PDF+ IFilter uses PDF text extractor to index text of PDF files. When installed standalone, PDF+ IFilter extracts XMP metadata only. You have to install Adobe PDF IFilter or other PDF text indexer in addition to PDF+ IFilter in order to search both metadata and content of PDF files. Please refer to "Text Extractor Setup" section for more information.

PLEASE NOTE (for Indexing Service only). When both IFilters are installed Indexing Service relies on registration order to choose which one to use. Each time you start Indexing Service it looks at the list of DLLs in the "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ContentIndex\DLLsToRegister" registry value and registers each of the DLLs in that order. To resolve this issue, move the registration for PDF+ IFilter DLL (PdfPlusFilter.dll) to the end of the list that is maintained by Indexing Service:


PDF+ IFilter does not extract text of PDF files when integrated with Adobe PDF IFilter versions 5.0 or 6.0.

Adobe PDF IFilter (ver. 5.0 and 6.0) is an apartment threaded IFilter. Apartment threaded IFilters behave abnormally on some server platforms where indexing process is multithreaded. Follow the steps below to make Adobe PDF IFilter work on these platforms.

  1. Change the registry value "HKEY_CLASSES_ROOT\CLSID\{4C904448-74A9-11d0-AF6E-00C04FD8DC02}\InprocServer32\ThreadingModel" from Apartment to Both
  2. Double click the registry value "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ContentIndex\DLLsToRegister" value, and then click Modify
  3. Remove the entry corresponding to pdffilt.dll from the list. This will prevent Adobe PDF IFilter from re-registering itself each time you restart Indexing Service (in case Indexing Service is not disabled)
  4. Restart all appropriate Search services
  5. Perform a full rescan
If you ever want to reverse these changes, locate pdffilt.dll and re-register it with regsvr32.exe.


Additional Information

What is Adobe XMP?

Adobe eXtensible Metadata Platform enables various types of content with open standards metadata. It works by embedding metadata packets into binary data file. XMP metadata can currently be embedded into various image files (GIF, PNG, JPEG, TIFF) and document files such as PDF, PostScript, Adobe Illustrator, Adobe FrameMaker. Metadata packets are specifically designed to preserve consistency of the file, so that other applications would not be affected. XMP metadata is extremely rich in nature and suits a large variety of tasks. More information about Adobe XMP can be found at http://www.adobe.com/products/xmp.

What is Dublin Core?

Dublin Core is an initiative to create digital library metadata for the Web. Dublin Core is made up of 15 metadata (data that describes data) elements that offer expanded cataloging information and improved document indexing for search engine programs. Two forms of Dublin Core exist: Simple Dublin Core and Qualified Dublin Core. Simple Dublin Core expresses elements as attribute-value pairs using just the 15 metadata elements from the Dublin Core Metadata Element Set. Qualified Dublin Core increases the specificity of metadata by adding information about encoding schemes, enumerated lists of values, or other processing clues. While enabling searches to be more specific, qualifiers are also more complex and can pose challenges to interoperability. More information about Dublin Core may be found at http://www.dublincore.org.

PLEASE NOTE. Adobe XMP and PDF+ IFilter use Dublin Core version 1.1. Lately Dublin Core Metadata Initiative board extended the set with more elements thus making the previous specification obsolete. This should not affect the performance of Adobe XMP and PDF+ IFilter. Current specification is fully backwards compatible with version 1.1.


What's new in this version

Version 3.1

Version 3.0

Version 2.2

Version 2.1

Version 2.0


Contact Information

WWW:
http://www.ifiltershop.com
E-mail:
support@ifiltershop.com