pdf vs. pdf+text problem

nanosnack · March 14, 2017, 2:09am

I’ve added my books (in pdfs) to DEVONthink Pro Office, in hopes of being able to search them.

While some are labelled pdf, others are labelled pdf+text. However, OCR has been applied to all files beforehand. They are completely searchable.

When I do a search with DEVONthink, it returns all the hits (pdf and pdf-text), but doesn’t hightlight the results of the pdfs like it does for the pdf+text. Why can it find a word but it can’t show me where in the book it is? Using Adobe, all the text of the pdfs are searchable, highlightable, copyable, etc… But in DEVONthink it treats these same files like images. Which, I would understand, except it IS searching them and finding correct results.

I have tried to OCR or Convert the offending pdfs within DEVONthink to solve the problem, but the log says “This record has no image data for OCR!”

Could you help me understand the problem?

BLUEFROG · March 14, 2017, 3:21pm

Adobe is doing their own thing. We use Apple’s PDFKit. You should see the same behavior in Preview (though technically Apple could also allow Preview to do something we can’t).

nanosnack · March 14, 2017, 5:43pm

Yes I can search the pdf in Preview and the results are navigable and highlighted.

I’m test driving this software and trying to understand it before I buy it. I’m a researcher, I’m not an PDF or OCR expert. I understand Adobe is a different product. Brass tacks: is there a way for DEVONthink search the pdfs and deliver results or no? Because right now, it’s a no.

yooj · July 1, 2017, 6:32am

Same problem. I have fully OCRed files that do not appear as “PDF + text” in the kind column, but just appear as PDF. These PDFs can be searched by DT, but do not open in the DevonThink viewer; they only open externally.

My theory is that the files are PDF archives or some special sort of PDF.

BLUEFROG · July 1, 2017, 5:20pm

Post one such PDF, please. Thanks

Dirk · August 11, 2017, 8:00am

Did you find a solution to this? I’ve got the same problem with 1000+ new scanned, OCR’ed files (ScanSnap > DTPO).

Even with indexed old files (which previously worked fine) I now run into this problem after changing anything to the file: filename, location, rotating of single pages … after any change DTPO classifies the files as “pdf” instead of “PDF+Text”.

BLUEFROG · August 11, 2017, 3:49pm

@dirk: Hold the Option key and choose Help > Report Bug to start a Support Ticket.

Dirk · August 11, 2017, 3:53pm

Thank you, I already got an answer by Christian in the german-speaking forum. The new beta-version seems to solve the problem, but I have to import and OCR the scanned documents again (only indexing again sadly didn’t help).