Last Reviewed: March 9, 2009
Article: DTS0108
Applies to: dtSearch 6, dtSearch 7
If a PDF file has a security password, dtSearch may not be able to open it to extract the text for indexing.
A PDF file may have a security password even if no password is needed to open it in Adobe Reader. For example, the password may prevent printing the document, changing the document, adding annotations, etc.
To see if a PDF file has a security password, open it in Adobe Reader and click File > Properties > Security. A dialog box will appear that will tell you if the file has a password. For information on indexing PDF files that have security, please see "Security passwords on PDF files."
Adobe Reader and Adobe Acrobat will automatically fix some file corruption problems in PDF files when a PDF file is opened.
To fix a single PDF file, open it in Adobe Acrobat and save it using File > Save As. This will usually fix any problems in the file and will also optimize the file for faster viewing. After saving the PDF file in Adobe Acrobat, try to index it again in dtSearch.
To fix a large number of PDF files at once,
(1) Start Adobe Acrobat Professional or later
(2) Click Advanced > Batch Processing...
(3) Select "Fast Web View"
(4) Click "Run Sequence"
(5) Select the files to process
These steps will repair and optimize all of the PDF files selected.
Some PDF files contain either pure image data or text but no encoding information. In either of these cases, there is no text in the PDF file that can be indexed, and OCR is needed to add text to the PDF file. For information on OCR tools that can add text to a PDF file, see How to use dtSearch or dtSearch Web with OCR.
Steps to check whether a PDF file contains text
Some PDF files look like they have text in them but do not. Because the PDF file format is a graphical format, a PDF file can contain a picture of text with no words. To see if a PDF file contains text,
1. Open the file in Adobe Reader
2. Select the Text Select Tool (press
"V" or click on the
icon in the toolbar)
3. Try to select some text.
If you are unable to select test with the Text Select Tool, the file probably has no text in it. You will need to use an Optical Character Recognition (OCR) to convert the image to text. See How to use dtSearch or dtSearch Web with OCR for more information.
Steps to check whether a PDF file contains valid encoding information
Some PDF files contain text but use an encoding that is meaningless outside of the PDF file. For each character, the PDF file contains embedded font information that describes how to draw the PDF file, but the characters do not correspond to an encoding that can be used to extract text from the file. As a result, the PDF file looks like a normal document but there is no meaningful text in the file. For more information, see the "PDF" section in Unicode support in dtSearch.
To see whether a PDF has valid encoding information,
1. Open the file in Adobe Reader
2. Select the Text Select Tool (press
"V" or click on the
icon in the toolbar)
3. Select a block of text
4. Click Edit > Copy
5. Open Notepad, Microsoft Word, or another program that can accept pasted text.
6. Click Edit > Paste
If you see what looks like random letters instead of the text you copied from the PDF file, the PDF file lacks encoding information. If a PDF file lacks encoding information, there is no way to index it with dtSearch.