Problems with text layer corruption in DT Pro3?

annfran · February 7, 2020, 8:37pm

I am considering upgrading from DT Pro Office 2 (which I haven’t used in some time) to DT Pro 3, but I’d like some reassurance that the OCRing works and is safe. I am a Bookends user, and I have had endless problems with the text layer of my OCRd PDFs becoming corrupted. As soon as I highlight text from within Bookends, the text layer turns into a garbled mess. At this point, my only option is to remove the highlights, save the file as JPGs, and then reassemble it as a PDF and re-OCR it. This is, to understate things, very frustrating. My posts to the Bookends forum and various google searches have revealed that this is likely a problem with Apple’s PDFkit and thus theoretically affects all programs that use it. I’m baffled by the fact that others aren’t complaining more, because as things stand my OCRd PDFs are in constant danger of corruption. One slip of the mouse from within Bookends and my text layer is gone. Before my upgrade to Catalina, I had a work-around: I could use FineReader to re-OCR the affected files, which were then protected from corruption within Bookends. (Before this, I had used Adobe Acrobat for OCRing.) But this no longer works, and documents OCRd in FineReader are equally susceptible to corruption. If annotating and/or OCRing from within DT Pro 3 would avoid this risk, I’d happily pay for it, but as far as I can tell DT is also based on PDFkit and FineReader. Can anyone offer any guidance?

annfran · February 8, 2020, 5:05am

I think I already answered my own question. I downloaded a trial version of Devonthink 3.0 (there doesn’t seem to be a trial version of DT Pro 3.0) and imported an OCR’d document. As soon as I highlighted it, the OCR layer was corrupted. This is very depressing. I can’t possibly be the only one to have noticed this.

BLUEFROG · February 8, 2020, 5:13am

Welcome @annfran

This is indeed a problem with Apple’s PDFKit and there’s no known workaround at this time.

In DEVONthink, could you hold the Option key, select Help > Report Bug, and attach an unaltered PDF that would exhibit the issue?
Thanks.

annfran · February 8, 2020, 5:24pm

Thanks for confirming! It is at least nice to know that I’m not crazy. I’ll send the bug report right away. I’m sure that I’ll eventually still upgrade to DT3 Pro (which looks fabulous), but I’ll wait until Apple gets its act together so that I can get full functionality. Do let users know if there is some way we can exert some pressure in this direction.

BLUEFROG · February 8, 2020, 5:27pm

Do let users know if there is some way we can exert some pressure in this direction.

Don’t we wish it worked like this
Unfortunately, Apple listens to the developer community far less than they used to. But we do file radars still, in the hope something will get fixed.

carlito · March 22, 2021, 10:07pm

I’m having the same problem of having quite some corrupted PDFs. Now, this is not specific do DT (apparently), which I like very much. It also happens, when annotating PDFs in Preview. I have been playing around with it and found out, that it only corrupts PDFs, which a PDF/A compliant. The others (standard PDFs) seem to be fine!
Searching around, I found some information on apples discussion site (sorry can’t post the link here ). You may look for thread 7899335 on discussions dot apple dot com …

It basically that gives some explanation about Apples outdated PDF engine, which is not capable handling PDF/A compliant files and overwrites the text layer with gibberish.

So I eventually wonder if it would be possible for DT to check, if a PDF is PDF/A compliant and giving a warning to the user or preventing any(!) alteration?! That would be great.

many thanks!

BLUEFROG · March 22, 2021, 11:52pm

Welcome @carlito

Could you open a support ticket and send us a few problematic PDFs to test? Thanks.

carlito · March 23, 2021, 8:44am

Hi Bluefrog,

thanks for your quick response. I have prepared a number of PDFs and tried to open a support ticket. Unfortunately I can’t upload them due to file size. Is there any other way to get them over to you, as I do not want to alter the originals to prevent them from corruption…

Although, here’s what I’ve done (you may try this on any other scanned file):

I took a file, created by a Konica-Minolta scanner (which only OCRs the first page) and fed it through the Abby Finereader for full OCR. I saved two version of the final result: A PDF/A-non compliant (standard) PDF and a PDF/A compliant version (can be selected in the “advanced section” of the Abby SaveAs dialogue). When opening and copying text from the resulting PDFs into any Texteditor, everything looks fine.

Then I changed the PDFs the following way: I highlighted a few sections and deleted a few pages using Preview, then saved them (or rather having them autosaved). After reopening the files, the standard PDF still looks fine, and text can be copied.

The PDF/A version is corrupted.

Please note, that the corruption does not happen immediately. It took some time (e.g. one or two minutes) after having saved the documents. There may be some caching behind (this is really bad, like a time-delayed-boobytrap). Pls let me know if there’s a way getting those files over to you.

Rgds Carlito

chrillek · March 23, 2021, 9:10am

I think that might not be possible: PDF/A is a set of rules a PDF has to conform to. It is not simply a flag saying “Hey, I pretend to be PDF/A”. There’s a host of PDF/A validators out there you can throw your files at. DT would have to

either use one of those – costs? time?
write their own PDF interpreter (or use GhostScript, which is free) to get at the content of the PDF and to be able to determine if it is PDF/A.

There’s another thread which connects problems with destroyed text layers with the OCR engine used: Michael Tsai - Blog - Preview in Big Sur Destroying PDFs Again. Which seems to rhyme with your experience.

BLUEFROG · March 23, 2021, 2:25pm

You could upload the files to a cloud service like Dropbox and send me a link in the ticket. Just ZIP them first.