Tip: Force recognise PDF with OCR when DT says it hasn't

rfog · April 17, 2021, 11:00am

There is a not common issue with DT that a PDF is marked as without Text but it clearly has. This should happen on batch PDF importing or a lot of changes in indexed PDFs, but not always. And it is not reproducible. It simply happens.

I have this Smart Group for each Database and in global:

Then, so and then, I get something like this:

It really is a PDF with Text (I’m completely sure as I scanned it from mo books and Windows Abbyy OCRed myself). But DT says it is no text PDF.

Normal way to resolve this is:

Copy PDF outside DT or indexed folder.
Delete PDF.
Empty Bin.
Drop into DT saved PDF or move into indexed folder.
Normally, DT should recognise it as PDF+Text

However, I’ve found a faster and less dangerous trick:

Highlight something in the PDF.
Save or wait until DT saves modified file by itself.
After some time, it is recognised as PDF+Text.
Remove selected text.

That is. No need to risk lose the file, or lose external tags/comments.

Now the PDF is not anymore in No OCR section:

cgrunenberg · April 19, 2021, 9:54am

Is anything logged to Windows > Log when this happens?

rfog · April 19, 2021, 10:49am

No.

(I think it could be a synchronization issue when file is being “touched” by different DT threads and/or macOS. For example, it happens a lot with some files if you index different items/folders in two databases at same time, or you update more than one folder with indexed PDFs. I try to avoid that, but sometimes it simply happens with normal daily work. All my indexed files are or in Dropbox or in iCloud Drive, always local, never placeholders).

Perhaps an script that will re-read the PDFs in selected folder will make the issue less relevant.

BLUEFROG · April 19, 2021, 1:27pm

How big is this file, file size and number of pages?

rfog · April 19, 2021, 2:52pm

It is a random thing. Sometimes are web scrapped files (2/3 pages from Safari printing), others are own scanned books/magazines (from 2 to 100 MB, 100 to 600 pages)… It happens completely randomly. And not frequently, but happens.

BLUEFROG · April 19, 2021, 3:51pm

If you import a PDF of 100+ pages, have you waited until DEVONthink has finished indexing the document to see if the type changes to PDF+Text?

rfog · April 19, 2021, 4:47pm

Of course.

Jim, this is an issue I’ve addressed so and then and I think there is an answer from @cgrunenberg that was not possible to be solved.

That is the reason I searched for a workaround.

cgrunenberg · April 20, 2021, 10:09am

As long as indexing of the document doesn’t freeze (e.g. due to PDFkit issues or PDF corruption), this should always work (and index such long documents at least partially).

rfog · April 20, 2021, 10:54am

Yes, I got so and then some partially indexed PDFs.

And that was the reason I published this TIP: in case you have a PDF marked as with no text (by whatever cause), if PDF really has text, a trick to “force” it have text, is do this thing.

BTW, I didn’t knew that text extraction was done by macOS itself, and now I understand why it fails, as each new version of macOS is more buggy than the previous one.

BTW2, it happens a lot of less in M1 than in Intel.

rfog · April 25, 2021, 1:21pm

It seems it has been resolved in DT 3.7 version.

Great!!!