DT4: Imported PDF files type in global inbox are automatically converted to PDF+Text

jandubois · April 21, 2025, 5:13am

Even if I set “Convert incoming scans” to “No action”, the files will still automatically be “converted” to “PDF+Text”. The lower left corner status windows says it is “Indexing Items”.

I put “converted” into quotes because the file seems unchanged. It just seems indexed based on the macOS “Live View” feature, but did not actually add a text layer.

This is a bit confusing, as I can no longer easily see if the file has been processed by OCR or not. It also makes it difficult to run OCR because DEVONthink warns that the document has already been processed, even though it hasn’t.

I also don’t see how I could ever perform the “Convert Incoming Scans” action. At least for me it doesn’t trigger once the files have been “indexed”.

So far I feel like the explicit OCR is more accurate than the Life View OCR performed by macOS, but maybe it is good enough; need to experiment more.

Sorry for the rambling, but I was really getting confused by this.

BLUEFROG · April 21, 2025, 5:19am

It is not using macOS’ Live Text specifically, as can be easily proven by disabling it in the System Settings. But yes, DEVONthink 4 has a comparable feature that add searchable text for most PDFs added to a database, without adding an explicit text layer. OCR is still recommended if you’ll be using the documents outside DEVONthink or sharing outside your Mac.

From the version history…

Also, the OCR > Convert Incoming Scans setting specifically relates to exactly that: scans, e.g., those coming from ScanSnap scanners, etc. It does not apply to importing or indexing PDFs or images.

jandubois · April 21, 2025, 5:25am

Thanks for the super-quick response!

These are PDF files from a Doxie-Q scanner, saved by the Doxie.app as a PDF file and copied into the DEVONthink Inbox. What would make a scan a scan? Does it only apply to ScanSnap scanners?

BLUEFROG · April 21, 2025, 5:35am

You’re welcome

Saving and copying the PDF into the Global Inbox would not constitute it being a scan, just an imported file. And no, the setting doesn’t only apply to ScanSnap scanners.

I can’t speak to the Doxie scanners specifically, but go into the Doxie software’s Settings (or Preferences) > Local Apps and add DEVONthink. According to what I’m reading on their site, you should be able to scan then use the Send button and send the scan to DEVONthink. However, you would need to look into this further.
The only potential issue is I don’t know if the id for the Doxie software is recognized by DEVONthink, so you’d need to run some tests. If it is recognized, DEVONthink should run OCR once it receives the scanned PDF from the Doxie software.

PS: It is now 1:36am here and I am crashing. Feel free to report back and I’ll check in on this thread when I wake up

jandubois · April 21, 2025, 5:51am

Thank you!

Using the “Send” button works to send the scanned document directly into the global Inbox, but it shows up immediately as a PDF+Text document and doesn’t get processed by OCR (or auto-rotated to fix orientation).

Anyways, this is not urgent at all, now that I (think I) understand what is happening. I can just manually request OCR with an extra confirmation step. Might still be a bit confusing for others when they run into it the first time.

galsom · April 21, 2025, 9:18am

I had a smart rule in DT3 that looked in a subfolder of the Inbox for documents with a Word Count less than or equal to 1 so that it could automatically OCR them using Abbyy.

In DT4, those documents no longer get detected because they are already processed by Apple Vision. How can I modify this smart rule to still find them and automatically OCR them using Abbyy?

cgrunenberg · April 21, 2025, 2:39pm

That’s not possible currently, we will probably add a new condition for this case.

galsom · April 21, 2025, 10:32pm

Thank you for the response @cgrunenberg.

Currently as a workaround I add all scanned documents into a separate subfolder of the Inbox and just OCR them using Abbyy, regardless of the word count.

This way, PDF files produced by e.g. utility providers, not requiring OCR, are separate from those I scan myself and only the latter ones are OCRed.

CAE · April 23, 2025, 5:46am

I don’t do it as a general rule but I also “re-OCR” PDFs on a fairly regular basis. Seems to me files I get have pages with text layers combined with pages without. I’ve never seen an issue after running OCR again, though YMMV of course.

cgrunenberg · April 23, 2025, 6:13am

DEVONthink 4 automatically indexes such pages (or pages having a corrupted text layer) too using Apple Vision when importing/indexing such files.

CGDaveMac · April 26, 2025, 4:28pm

Thank you. Hopefully I will see the new condition when it’s introduced and update my rules too.

jandubois · May 8, 2025, 1:44am

This has been fixed in DT4b2 and works great now. Thank you!

BLUEFROG · May 8, 2025, 5:10am

Excellent and thanks for the follow-up. Cheers!

jandubois · May 8, 2025, 4:23pm

On thing that confused me initially: there is an option to move the originals to the trash. I expected this to mean the DEVONthink trash, next to the inbox, and not the macOS trash.

That said, I do prefer the current behaviour; the originals would just clutter up the internal trash. But maybe the docs could clarify that?

And totally unrelated: I noticed that the “Rename annotations automatically” option on Files|General is not documented. I guess the fact that I looked in the help means it isn’t clear to me what exactly it is supposed to do.

BLUEFROG · May 8, 2025, 4:27pm

That does mean moving the document to the database’s Trash when doing OCR within DEVONthink, e.g., with internally scanned documents. The behavior could be different with externally received scans and could be controlled by the scanner software.

I noticed that the “Rename annotations automatically” option on Files|General is not documented.

Noted for the next release.

jandubois · May 8, 2025, 6:09pm

How would that be controlled by the scanner software? After I “Send” the documents to DEVONthink, it depends on the “Convert incoming scans” setting if they will be simply added to the Inbox, or if they are processed by OCR. Isn’t it out of control of the scanner software once the file has been received by DEVONthink?

FWIW, I seem to end up with 2 copies of the document in the Trash, one from before, and one after OCR, and a copy of the latter also being in the Inbox.

Anyways, this is just FYI; I’m happy with the originals landing in the macOS trash, even though it is a little inconsistent with how other deletes are handled by DEVONthink.

jandubois · May 8, 2025, 6:15pm

I just realized that this would make a difference for an encrypted database, where the copies now end up in the unencrypted system trash. Moving them to the database trash would keep them encrypted.

bailob · June 4, 2025, 9:58am

I really need an option to turn off this “auto convert” feature, as it confuses me when working with OCRed and non-OCRed PDF+text files.

Moreover, I have a script that imports files and automatically copies the item link. In DT3, the script runs very quickly, but now when I import a large scanned document, the DEVONthink window tends to freeze for quite a while, and I have to wait for it to finish before I can get the item link.

cgrunenberg · June 4, 2025, 10:07am

Since beta 3 the type of such files is again PDF document and only PDFs with a text layer have the type PDF+Text

bailob · June 5, 2025, 2:37am

Thank you for your response! Based on that premise, does it mean that the import behavior now automatically performs OCR on all PDF files? I see two scenarios:

If I drag and drop a PDF into DT, there will be no window freeze, and the PDF document will convert to PDF+Text in seconds while processing in the background.
If I use a script to import a file, the window will freeze until the process is complete.

Regardless of the strategy, I believe that automatic OCR may not be the best approach, and it would be preferable if this feature could be disabled.