Discrepancy between word list and PDF's OCR?

yooj · October 23, 2014, 2:56pm

A perplexing anomaly: An OCRed (PDF + text) file is searched for a word, using the find function in DTPro Office’s PDF reader. The word is not found. (Other PDF readers such as Skim also do not find the word.) But, DTPro Office’s words list, via its drawer, does show each of the multiple instances of the word, correctly. Clicking on the word in the word in its entry in the word list drawer highlights the word as expected.

How can the word list display a word (multiple instances of it) that the find function cannot? Are different indexes used? I assumed that both the PDF’s OCR layer. Can a PDF have more than one text layer?

Also, I made sure the PDF find was not case sensitive.

Thanks

yooj · October 23, 2014, 4:17pm

Best explanation that I can come up with is that the word list always, automatically uses a fuzzy search. Thereby it would recognize words that were mis-OCRed.

cgrunenberg · October 24, 2014, 2:03pm

Older releases of DEVONthink used sometimes Spotlight’s mdimporter to index PDF documents (as the results were quite often a lot bettter than the ones of the PDFKit framework) but this could cause such inconsistencies and therefore the PDFKit is now always used for indexing/finding/displaying. Does reimporting the document fix the issue?

yooj · October 24, 2014, 2:32pm

I duplicated the file in the Devonthink directory, and then imported the duplicate. No change in the search behavior. I do not have the original document to reimport.

Thank you for the possible explanation. When did Devonthink switch the index technologies? I’ve been running the most current Devonthink for many years, and the at-issue documents were all originally imported within the last eighteen months or so.

Thanks again

cgrunenberg · October 25, 2014, 12:35pm

Could you please send the document to cgrunenberg - at - devon-technologies.com? Which word did you search and which version of OS X do you use?