Index or OCR for PDFs and Word?

Hotspur71 · October 30, 2009, 2:17am

Hi,

I’m fairly new to DT and thought I’d quiz some of the experts here as to their suggestions. Here’s the situation:

I have about 1,000 PDFs currently in Sente (the bibliography manager). I also have a couple of thousand old Word documents that I’d like to have scanned in so I can do searches to find stuff I’d completely forgotten about.
They are not OCRd.
The new version of Sente will apparently not allow DT to index PDF contents. That means I have to make a decision as to whether to move the PDFs out of Sente and get that data into DT to take advantage of its AI.
What’s confusing me is the difference between OCRing the PDFs (using DT Pro Office or Evernote) and Indexing them.
Just now I indexed about 25 PDFs into DT, which took about 20 seconds. I can search and use the See Also AI virtually as if they were OCRd – OR SO IT SEEMS. I have no idea why DT indexing is so fast.
If I can just index my 1,000 PDFs – which would take about, say, two hours – why should I spend days and days laboriously dragging PDFs into DT and OCRing them?
No doubt there must be some major difference I’m overlooking. I understand that with OCRing the PDFs will be in DT while indexed ones will stay in their original folder. I also undertand that with OCR I can copy text and paste it in another document. But the key question is: Is the search and See Also AI quality the same or was it just my imagination?

Anyway, hope you can help.

Thanks.

Greg_Jones · October 30, 2009, 11:30am

To help clarify, it may be helpful to discuss just what the differences are between OCR (and the files that may need to run OCR on), indexing, and importing. You can index documents in DTPO (leaving the files in their original location) or you can import documents into DTP (adding a copy of the file to the DTP database). These two functions are most comparable to each other in functionality.

WIth DTPO, you also have the option to OCR documents when importing them, while you cannot (directly-more on this later) OCR documents that are indexed. The important point that needs to be made is that OCR probably is not necessary for the large majority of the files that you are working with. OCR is needed when you physically scan hard copies of source material, or if the original electronic document was an image scan of a hard copy, as might be the case with an old book. Digital documents such as all your Word documents and the majority of PDF files (which are PDF + text) do not need to have OCR performed on them to be searchable and/or to have DTP’s AI work with them. That’s why you see the same search and See Also functionality with files that are indexed-they already contain text information.

If you do have an image-only PDF document that you want to index rather than import into DTPO, you could always import it using ‘File>Import>Images (with OCR…)’, let DTPO do the OCR conversion, then export (and delete within DTPO) the document, replace the existing version of the document on your computer, and then index it back into DTPO.

If you are importing a large number of documents into DTPO, do not use the ‘File>Import>Images (with OCR…)’ method as your default-it will take forever and is unnecessary as I stated for the majority of the documents. If you import say a 1,000 documents at once and there are a few that do not have text information, this will be reported in the Log window. You can then use the ‘Data>Convert>To Searchable PDF’ command to perform OCR on the documents.

Hotspur71 · October 30, 2009, 12:58pm

Greg, you just saved me a huge amount of tedious work. I had a scout around my PDF database and, as you predicted, about 90% of them were PDF + Text. The rest I’ll OCR in due course. Issue Resolved, Question Answered. Many thanks.