OCRing PDFs in DT Pro versus Acrobat - Strange Behavior

I’m a bit mystified by the workings of the OCR system in DT Pro Office. This is the problem I’m having.

  1. If I download a file from Google Books, it’s not PDF+Text. So I OCR it in DT Pro Office. What happens then is that a, say, 7MB PDF balloons into a 40MB PDF.

  2. However, if I OCR that same PDF in Acrobat Professional, it becomes about 9MB. As far as I can tell, there is no diminution in sensitivity or accuracy: I can search text just as well by either method.

  3. More confusingly, when I drag that Acrobat-OCRd PDF into DT Pro, the search service barely works in DT Pro. It only finds occasional instances of a key word or misses words altogether.

  4. If I open up that document in Preview, there’s no problem at all. It will find 26 uses of “Artillery” while DT Pro will find maybe 6 – or none at all.

  5. But if I use the command File>Import>Files&Folders to import that OCRd PDF into DT Pro Office, it searches it perfectly: no skipped words, lost text, etc.

This is extremely confusing. Is there something I’m doing wrong here?


a) Why is DT Pro OCR making PDFs into huge unwieldy files while Acrobat manages to keep them to a minimum – yet accuracy (at least to me) seems to be identical?

b) Is dragging an OCRd PDF into DT Pro NOT the same as doing it via File>Import>Files&Folders?

Is there some trick I’m missing here? I hope you can advise on this as the inconsistencies are playing havoc with my finding important stuff in my database. I have quite a few PDFs that were dragged in to DT Pro Office and I’m only now discovering that Acrobat seems to be doing a much more effective job of OCRing – though it’s a pain to have to File>Import each one rather than dragging. I should add that I’m not indexing PDFs, I’m keeping them in DT Pro itself.

By default DEVONthink is looking for complete words, not substrings of them. But you could either use the asterisk wildcard or the ~ substring operator to change this, e.g. word or ~word should find the same occurrenes as Preview.