OCR Layer disappearing

jbp · November 15, 2019, 3:13am

Every now and then I seem to lose the text layer from certain PDFs. I’ve searched about for this, but the closest thing I can find is this thread. My issue apears different, however, I can’t trace the lost OCR layer to any particular crash.

I come across them when reviewing annotations, I find the underlines remain but the text is missing. Here is an example.

Any ideas?

cgrunenberg · November 15, 2019, 8:54am

This is most likely a PDFkit issue which is not compatible to all PDF versions & features. Which version of macOS do you use and what’s the source of the document? Are you able to reproduce this by editing a certain PDF document? In that case a copy of the document would be great. Thanks in advance.

jbp · November 19, 2019, 12:34am

Sorry, I really should have that drilled in by now. I’m on macOS Catalina 10.15.1. The PDF’s are almost always retail versions of academic books, secured via our academic library. How they are actually produced, I couldn’t say.

I haven’t been able to reproduce it yet, but I will post again once I do.

cgrunenberg · November 19, 2019, 10:30am

This would be really helpful, thanks in advance.

jbp · February 8, 2020, 1:17am

I all but forgot this, more’s the pity. I mostly use the Highlights app for Markdown annotations, and it seems the problem documents have all been opened in Highlights. However, there is some other strange behaviour. Not only is the OCR layer desytroyed, but somehow DEVONthink is still able to scan the document and recognise a duplicate.

In testing I tried to re-import a copy with a different name, and even though DEVONthink thinks the PDF has no text layer (it says the type is ‘PDF document’ not ‘PDF+text’), it still must pick up the text? Also, I can still underline text via DEVONthink in these documents, but it does not select any text. See image attached

They always seem to be books, so documents with a large number of pages. I would like to send you one, but I’ll be honest, I’m not sure of the status of sharing these PDFs with the library copyright.

BLUEFROG · February 8, 2020, 3:39am

I would like to send you one, but I’ll be honest, I’m not sure of the status of sharing these PDFs with the library copyright.

Generally speaking, sending in such data for troubleshooting purposes is considered fair game.

Where did these “book PDFs” come from? Calibre?

jbp · February 10, 2020, 3:03am

OK Good to know.

Where did these “book PDFs” come from? Calibre?

Nothing mysterious about them. Most of them are the ebook versions offered in PDF format by academic publishers. Those are downloaded via our university library proxy. However, there are also a couple that have been scanned. To be honest, I was clutching at straws there.

None of them have come via Calibre. I’m aware of the weirdness that comes from producing PDFs in Calibre, I had a problem with the OCR layer being duplicated and the file sizes becoming huge. I have given up those heathen ways!

I will email you PDF to the support address tonight.

BLUEFROG · February 10, 2020, 2:25pm

Thanks. I’ll be looking for it.

timj · May 6, 2020, 3:25pm

Has this been resolved?

I have what appears to be the same problem on Catalina (10.15.4) with the latest DT3 and DTTG. The direct import of some PDF books into DT3 results in a PDF in which everything works until it is annotated in DT3.

This is reproducible with PDFs created in the latest version of Calibre (4.15) and is related to this thread. The evidence seems to suggest that something about the way DT3 is handling the OCR layer is breaking the PDF. Note: this may well be due to Apple’s PDFKit issues, but it seems many of us need a way to address this problem just the same.

Drag-and-drop the PDF into DT3 – the Info pane initially shows the “Kind” as “PDF” but then it flashes to “PDF+Text.” Note the Kind and Size in this initial screenshot:

The TOC works as expected, searching the PDF works as expected:

In DT3’s interface, highlight a portion of text – the highlight displays the text in the Annotations pane as expected:

Navigate away from the PDF and then back to the PDF – the annotation is garbled:

External viewers (PDF Expert, Preview) confirm the corruption of the OCR layer:

Searching no longer works and the filesize has increased (from ~3MB to ~8MB in this example):

What do you suggest? I can send you the PDF for troubleshooting purposes if that would help. Thanks in advance!

cgrunenberg · May 7, 2020, 8:32am

This is an issue of the PDFkit which only Apple can fix unfortunately (and yes, we filed a bug report and sent several example documents many months ago).

timj · May 7, 2020, 9:20am

I see. Do you know where I can learn more about the technical aspects of the problem? Given that it only happens to some PDFs, I’d like to understand the problem better so that I can attempt to find workarounds for these.

cgrunenberg · May 7, 2020, 10:32am

It’s related to the internals of the PDF document, e.g. which languages and fonts are used. But it’s hard to tell when this bug will exactly happen or not (otherwise Apple probably could easily fix this). PDF documents created on macOS should be fine though.

timj · May 8, 2020, 4:10pm

@cgrunenberg My assumption is that the MacOS Preview app uses PDFKit, is that right? The reason I ask is because I discovered something that I can’t explain and am wondering if you can help me. Here is the process I followed (all of it on MacOS Catalina 10.15.4 and DT3.0.4):

Export a PDF from Calibre (4.15) and import it directly into DT3 – It displays correctly as “PDF+Text” and the TOC and search functionality is perfect.
Open the PDF in PDF Expert (directly from DT3) and add a highlight – The PDF structure and search functionality is unchanged, and the Annotations pane shows the highlight as expected.
Copy the PDF to a directory in Finder, then open it in Preview – (Preview does not seem to be able to make highlights to files stored in DT3 and offers to make a copy.) Make a highlight in the PDF using Preview, save and close the file.
Reopen the file in Preview, all highlights display correctly and search functionality is accurate – the OCR layer is fine, not corrupted by Preview.
Reimport the file into DT3 – everything works as expected, including highlights showing the correct text in the annotations pane, and searching working correctly.

So far, everything seems to work just fine. It seems that when I annotate the file in DT3 that the OCR layer becomes corrupted.

Highlight the PDF using DT3 – After closing and restarting DT3, the OCR layer in the PDF is corrupt, highlights display garbled text in the annotations pane, and searching does not work.

I have no idea of how the technical aspects of annotations work in DT3, but if Preview is also using PDFKit and can highlight the PDF without corrupting the OCR layer, than might it be possible that DT3’s implementation of annotations could be introducing a bug into the PDF’s OCR layer?

I sure hope you can help! I rely on DT3 (and highlight PDFs constantly directly from DT3) every day. If there is anything I can do to help, please let me know. Thanks in advance!

timj · May 9, 2020, 7:56pm

@cgrunenberg and @BLUEFROG, thank you for all your hard work and for helping us figure out how to resolve this. I appreciate any help or suggestions you might have, as my entire PDF library is in DT3 and is now at risk…

Can you give us an idea of what options might be available in the days ahead to resolve this? For example, would it be possible to include a setting in DT3 that effectively disables any editing of PDFs in DT3, at least until this issue is resolved? As long as DT3 is only used to view these PDFs, they do not become corrupt. But it seems that if DT3 annotates them in any way, the OCR layer becomes corrupted. It is admittedly a kludge, but having an option to make all PDFs effectively read-only in DT3 would prevent users from accidentally clobbering them. Editing them externally (e.g., in PDF Expert) is far from ideal, but at least maintains the usability of the PDFs. Would this be possible?

Also, what is the best way to contact Apple about fixing PDFKit? I am guessing it is via this page, is that right?

Given that this issue with PDFKit is a known problem and apparently has been since High Sierra, I am guessing that you have considered rewriting the PDF engine in DT3 so that it does not depend on PDFKit. Since the ability to work flawlessly with PDFs is one of DT’s greatest strengths, is that something that might be forthcoming?

Thanks again!

BLUEFROG · May 10, 2020, 4:17pm

Yes, that link can be used by a developer to file bug reports. We have filed bug reports with Apple on many occasions, so filing doesn’t guarantee fixing.

I am guessing that you have considered rewriting the PDF engine in DT3 so that it does not depend on PDFKit. Since the ability to work flawlessly with PDFs is one of DT’s greatest strengths, is that something that might be forthcoming?

This would not be a trivial thing to do and no there are no current plans to rewrite it. Some discussion of other frameworks has be had, but switching frameworks is also not a trivial thing to do. Additionally, changing frameworks could results in additional costs passed along to the consumer.

timj · May 11, 2020, 2:17pm

Additionally, changing frameworks could results in additional costs passed along to the consumer.

That is certainly understandable. I expect there are many of us that would be willing to pay for a version of DT that always handles all PDF files without corruption. Please consider this an enthusiastic vote for resolving this problem!

One other thought: All my databases are synced via Dropbox, but, because the databases are encrypted, I don’t know which files on Dropbox web correspond to the files in my database. Thus, if I annotate a PDF in DT3 and it corrupts the file, I could restore it to the previous version via Dropbox web… but I don’t know which file it is.

So I am wondering if it would be possible, for databases synced via Dropbox, to expose the location of the file in the sync store on Dropbox web, maybe in the context menu for the file. If I had an option in the menu “Show this file on Dropbox” I could (perhaps?) restore the previous version from the history of that file on Dropbox. (Alternatively, if I knew the location of the file in the local sync store (not in the database), I could use Finder to locate the file on Dropbox web.)

Is this possible? What do you suggest? Thanks in advance.

BLUEFROG · May 11, 2020, 2:22pm

because the databases are encrypted, I don’t know which files on Dropbox web correspond to the files in my database. Thus, if I annotate a PDF in DT3 and it corrupts the file, I could restore it to the previous version via Dropbox web… but I don’t know which file it is.

Actually, being encrypted doesn’t matter in this case. As noted in our Help > Documentation > In & Out > Sync > Q & A, DEVONthink’s sync doesn’t merely copy files to a sync location. It is raw, chunked, and optionally encrypted DEVONthink-specific data for use only with DEVONthink or DEVONthink To Go. You cannot go find your file in the sync data.