I have a small database with 100 or so pdf files. All OCRed and showing up as pdf+text. When I look at the “words” panel that shows what words are in each pdf, all is good. If I then search for these words, not all the files containing these specific words show up. Case in point: I have 2 files both from car registration. They have the make of the car and my last name on them. If I search for both words, only one of the files shows up??? Am I missing something? Thanks.
Could you send the documents to cgrunenberg - at - devon-technologies.com plus a screenshot of the Search window (showing the options and the search term)? Thanks!
Yes, would like to send files but they are personal financial docs. I’ll try to find other docs that are not sensitive. I guess my real concern is if I can search the document individually and find the word/phrase, why does it not work when I search using DT?
Especially if you did a DEVONthink search for a phrase that you see in the image layer, there’s a possibility of an OCR error that makes the search fail on one of those documents.
Use Data > Convert > to rich text on both PDFs. That will result in two new RTF documents that contain the text only.
Move those two RTF documents into a new group and search only in that group. Replicate the search that you had originally used. What happens? Repeat the search with only one of the terms from the phrase, and look at the occurrences in each document.
OCR errors can result from a low-resolution image, proximity of graphics to text, a pencil mark in text, a coffee stain, etc.
If one of those PDFs was OCRd before the release of public beta 5, there might have been text dropouts. Make a copy of that PDF and use Data > Convert > to Searchable PDF. You will probably get better OCR accuracy (set OCR to high accuracy in Preferences), even though the resolution of the copy is not optimal.