What’s Wrong With My Dt-4 Non-OCRed PDF Smart Folder Operation?

Sherrell · April 24, 2025, 5:23pm

I have a Dt4 (ported from Dt3) document database consisting of almost 2000 historic documents – mostly PDFs. All of the document were OCRed upon import to Dt or were OCRed with PDFPen Pro before they were imported to Dt.

I have a “Non- OCRed PDFs” Smart Folder setup with the following parameters:
Exclude Subgroups box NOT checked
Ignore Diacritics, Fuzzy Word Comparison, and Highlight Occurrences boxes NOT checked.
All of the following are true
Word Count is 0

This folder is populated with over 500 files – mostly PDFs that HAVE been OCRed.
If I randomly select one of the files by double-clicking in the smart folder list, I can see in the Inspector Window that the file has an extensive word list – though it’s clear in many cases there are many errors in the OCR text layer. I can also search for words within the document.

So what’s going on? The word count for each of these documents is clearly not zero. Yet there they are in the Smart Folder list…

Is something fishy going on, or am I just misunderstanding something?

Thanks to everyone for tutoring me!

Ray

BLUEFROG · April 24, 2025, 5:27pm

In DEVONthink 3?
By what means, e.g., a smart rule?

Sherrell · April 24, 2025, 5:34pm

Yes, in Dt3.
I recall I simply looked at the “Kind” attribute in DT3’s file list and manually triggered an OCR if the “Kind” attribute didn’t show “PDF+Text”.
Most of the PDF’s were OCRed before I imported them to Dt3.

Thank’s so much for your quick response!

BLUEFROG · April 24, 2025, 5:42pm

You’re welcome.
Drag out to the Finder a few documents you think should not be matched as needing OCR. Select them in the Finder, Control-click them, and compress them. Then open a support ticket and attach, the ZIP file.

silop · April 24, 2025, 5:52pm

In DT3, I had a smart rule that applied OCR to PDFs without embedded text. However, in DT4, any PDF imported into the database is immediately recognized as a PDF+Text file, even if it originally had no text layer. As a result, the smart rule no longer works as intended.

Is this something to do with the “Transcription” feature in DT4 or am i missing a detail?

Searchable Text: This is similar to Apple’s Live Text feature in that a text layer isn’t added to the document, but instead is stored in the database’s index and associated with the file.

Does it mean there’d be a text layer, or only DT+ will recognize the text?

Thank you!

BLUEFROG · April 24, 2025, 6:05pm

Does it mean there’d be a text layer

As noted… “a text layer isn’t added to the document”.

Is this something to do with the “Transcription” feature in DT4 or am i missing a detail?

No, it’s not a transcription. That is for imported images and media files.

As a result, the smart rule no longer works as intended.

We are looking into this.

silop · April 24, 2025, 6:14pm

Thanks for the clarification.

BLUEFROG · April 24, 2025, 6:30pm

You’re welcome.

silop · April 24, 2025, 6:40pm

Side note:

Changing the setting for image/audio transcription also changes the behavior for the imported pdf files. If transcription is set to annotation, a separate annotation text file associated with the pdf is created.

BLUEFROG · April 24, 2025, 7:14pm

That would be incorrect as a PDF is not an image. Transcription is specifically for images and media files.

If you have enabled Files > Import > Recognition > Transcribe text & notes in images, you would see an annotation file produced. But again, they are not the same thing and the recognition command is not necessary for DEVONthink to index a PDF with no text layer.

silop · April 25, 2025, 5:59am

Here are my settings:

For any non-ocr PDF imported, the file is converted to PDF+text and an annotation file is created…

BLUEFROG · April 25, 2025, 6:15am

Which is exactly what I said would happen… and that it’s not necessary. Read my response again.

silop · April 25, 2025, 6:58am

Yes i got it, i just wanted to clarify the issue, thanks a lot