Question about DT Pro OCR

Hey everyone, I’m new here and come with a question:

I am currently thinking about upgrading to DT Pro. I currently use Acrobat Clearscan OCR and that is a complete nightmare on macOS since PDFkit completely messes these files up if I am not careful and edit them in anything built on PDFkit, including unfortunately, Devonthink. I have talked to their support but they seem reluctant to change the internal engine PDF to something that does not completely destroy the files (and instead opted to not make them editable anymore).

So my question is: How does OCR in DT Pro work? I have a bunch of files that have TOCs, proper page numbers, and links in them. Does DT Pro OCR the files and add its layer to them, or does it make a copy that discards the table of content and annotations?

I really want to unify my library but I am not willing to put in all the work creating toc’s and page numbers again and lose my annotations.

Thanks!

Welcome @johnnytravels

DEVONthink uses the ABBYY Finereader OCR engine (though not the same as the consumer version offer).

There is currenlty a bug in the engine that can strip the table of contents when producing the new file. We have of course reported this to then but no response has been received by us.

2 Likes

Thank you @BLUEFROG
I have tried the OCR engine and it seems to work well. I will use it for smaller documents going forward. I have also noticed that TOCs disappear. This of course will prevent me from using it to unify my library, but at least it will help read and annotate single articles from within DT. Too bad I had to pay an upgrade fee to get a feature that I wouldn’t have needed if DT had used a more relibale PDF engine in the first place.

I find it a bit cumbersome that you are still betting on macOS PDFkit when even Apple has abandoned it for iOS. There are faster and more reliable engines than PDFkit and for an App like Devonthink that is supposed to help users unify their files, having users constantly worry about the code a particular file comes in (i.e. which OCR, will PDFkit destroy the OCR irrepairably, will it bloat the file beyond measure etc.) is not a good look.

Here’s a quote from Reddit user ‘rfog-rfog’ in reply to my question over there:

There is no good OCR solution in macOS. I have a lot of issues with PDFKit as well. In macOS, I use PDF Expert to read/annotate/edit (even inside DT – I open externally into PDF Expert). Another cheaper solution is PDF Reader from PSPDFkit, but don’t use integrated PDF framework, that is a crap. DT PDF OCR engine is ABBYY one, but it is not better than other solutions. The only advantage is it is integrated into DT and you can easily OCR a PDF or a image. I have a separated laptop running Windows and almost exclusively ABBYY Pro to do optimized OCR (I use to scan old books and generate facsimile PDF with OCR, but I cannot change or annotate inside macOS or a PDF of, say 50 MB becomes 1GB or similar).

I really hope that you can transition to another more reliable rendering engine quickly. If there’s additional cost involved, make it an extension that involves an upgrade fee. I for one would happily pay for that (as opposed to grudgingly have to pay for DT’s OCR feature).

You can already go with another OCR engine for a fee. Abby was cheapo just these days, and then theres PDFpen, too. Probably others. So the

is already there :wink:

1 Like

I am referring to a PDF rendering engine, not an OCR engine.

Then I misunderstood that sentence apparently given that the thread seems to be about OCR…

Then there’s this. What feature did you pay for that you’d not have needed with another PDF engine?

The internal OCR engine - so I am able to edit and annotate files in Devonthink.
I have Acrobat and used to use its OCR but the resulting files are somewhat incompatible* with the macOS PDFkit that Devonthink uses.

*The OCR layer gets completely and irreparably destroyed if these files are edited with anything building on PDFkit and they are a nightmare to even select text from.

I use PDFPen sometimes with no negative affects that I can see in DEVONthink (or maybe I don’t notice). Perhaps they have a trial version.

Yes. PDF Pen has their own engine, as do PDF Expert and PDF Viewer (and of course, Acrobat). macOS comes with an engine that it uses system wide for some of its PDF services, it’s called PDFkit and it’s horrendous. Devs can use it for their apps if they don’t want to shell out money for licensing fees in order to provide a better experience. Devontechnologies do that with Devonthink.

ok. works for me.

Great.
You would only notice the problems if you were to edit PDFs in Devonthink that were OCR’ed with Acrobat Clearscan technology. PDFkit, which DT uses, scrambles the OCR layer. It has to do with the fact that PDFkit does not support fonts that do not exist system wide (those that you have in Fontbook). Clearscan however works by analysing the visible font in the text and creating custom fonts out of it. It basically reprints the text in its own font which in most cases greatly reduces file size.
However, since PDFkit cannot handle these fonts, it replaces them with random glyphs from other fonts. The result is a completely garbled text. And since Acrobat Pro only recognises the file as already OCR’ed, you cannot run OCR on it again and restore it. The file is forever ruined.

Did you try to remove the OCR layer in Acrobat and afterwards OCR again in DEVONthink? Never done it as I don’t use Acrobat so not sure it works.

How to remove OCR from a PDF? - Super User

Actually, if you re-OCR a document in DEVONthink, you’ll get a warning but it will create a per-page image and OCR each page.

That’s correct, but that entails that I purchase Pro, which is an additional cost incurred to alleviate an issue I would not have if Devonthink did not use PDFkit in the first place. Also, I may well lose some of the file’s metadata like the table of contents, and the OCR is only really accurate if I haven’t marked up my document on my Remarkable beforehand, because it does not discriminate between the text and graphical content (like lines, exclamation marks etc).