DTPO/PDFkit corrupting/obliterating OCR layer?

scottlougheed · March 21, 2018, 1:14pm

I know that PDFkit has been a mess, and I’ve lost track of whether DTPO is, or is not, relying on PDFkit by default these days.

The issue occurred on macOS 10.13.4 with DPTO 2.9.17. The PDF in question was a scan that had been made and OCRd by someone other than me (I don’t know the provenance).

I had a PDF+Text open in DTPO. I had confirmed that the OCR layer was present and accurate (using PDF Expert, Safari, and finally, in DTPO, but NEVER preview). I began reading the document in DTPO. After making the first highlight, the application hung (beachball), then the Log window appeared with row indicating “No Text” for the presently-open file.

Indeed, I confirmed using all of the above applications that the OCR layer was now toast. The document could not be searched, anything selected was just blank characters, etc.

I had to restore the document from a backup to retrieve an un-corrupted version.

I can’t trust DTPO as a PDF viewer if it, or the frameworks it relies upon, can corrupt OCR layers! Thankfully opening files in an external application is easy, but its all-too-easy to, on a whim, quickly open a PDF, modify it in some way that causes it to auto-save, and BAM, OCR gone.

Is this just part of the ongoing PDFkit issues? Is this a new problem?

cgrunenberg · March 21, 2018, 1:22pm

This is rare but not a new issue of the PDFKit depending on the contents of the text layer, e.g. this might happen in case of Eastern European languages. The only workaround is to use applications using their own PDF engine to edit/annotate such documents.

scottlougheed · March 21, 2018, 1:31pm

Good to know. The image of text is in english, and the OCR layer is, ostensibly, in english as well. Indeed this has not happened with every PDF i’ve opened, so clearly somewhat contingent on something to do with the OCR encoding.

Either way, this makes DTPO a bit of a minefield for viewing PDFs for me. I think I’ll have to abandon viewing PDFs in DTPO for the time being.

cgrunenberg · March 21, 2018, 1:43pm

Viewing PDF documents doesn’t cause any issues, only editing/annotating does. In addition, you’ll get a warning in case that this should happen again (contrary to e.g. Preview or other apps using PDFkit).

BLUEFROG · March 21, 2018, 2:11pm

@scott: Did you confirm the contents of the image layer by converting the PDF to plain text or just visually deduce it was fine by selecting text?

scottlougheed · March 21, 2018, 2:29pm

Indeed – this will only happen if something is done to cause the file to be saved. However, I didn’t receive an pre-emptive warning. The only indication I received that this had already occurred was the Log popping up, and the concordance icon being greyed out (as well as the See Also results becoming totally whacko).

Either way, for now, I’m just going to avoid viewing (and modifying) PDFs in DTPO.

scottlougheed · March 21, 2018, 2:36pm

I confirmed that the OCR Layer was initially in tact by:

searching to for text which returned results that indicated accurate OCR
copying and pasting text which produced pasted text that was accurate to the image of the text
using macOS’s “Lookup” feature, which correctly selected both the image and the text layer and produced appropriate results

(these are all more or less typical aspects of my day-to-day PDF workflow, so they were not initially done as a troubleshooting thing. They were initially done in purest of my usual daily tasks.)

After I made an annotation that resulted in DTPO saving the file, in addition to the Log warning:

Searching the document for any string returned no results (reproduced in several PDF applications)
Copying and pasting did not produce any text, or produced results that were garbled.
Lookup was unsuccessful because: a) text selection through force-click/three-finger tapping/double clicking on a word failed because there was no text that could be intelligently identified as word. b) manually selecting text by clicking and dragging clearly indicated that there was no separation between words and using “lookup” from the contextual menu produced no results because it was passed either a blank string or gibberish.
DTPO greyed out the concordance and other word-based buttons in the PDF viewer. See Also and Classify also displayed results that didn’t make sense (e.g., clearly did not relate to the visible contents of the file, and also differed from the initial results when I originally used See Also and Classify to classify it).

The observed changes lead me to believe that there was an accurate OCR text layer initially, and then a corrupt OCR text layer subsequently.

Restoring a file prior to DTPO auto-saving returned to the original behaviour: search works, text selection was predictable, look-up works, copy/paste produced proper and accurate text output etc.

BLUEFROG · March 21, 2018, 2:54pm

Do you have secondary languages set in OCR?
What’s your primary OCR language set to?

scottlougheed · March 21, 2018, 4:50pm

My own ocr settings are English and only English.

However, this was not a file created by me and I did not perform the OCR on this specific file.

BLUEFROG · March 21, 2018, 4:54pm

Is it a file you can share in a Support Ticket?

scottlougheed · March 21, 2018, 4:56pm

Certainly, incoming shortly.

Ticket # 512522

ws8 · March 25, 2021, 2:47am

I’m not sure if I’m having the same problem as @scottlougheed , but from time to time, I have OCR’d text disappear from a document, leaving me with a blank page that still holds annotations (e.g. highlights). What’s odd is that the text is “there” in that it can be searched or copy/pasted, but it otherwise remains invisible.

I looked around other threads, but I didn’t come across any with this exact issue—happy to be pointed to another forum if this issue has already been addressed.

I also can’t determine what triggers this. It happened this morning when I woke my laptop up from sleep and brought the DT3 window to the front, but in the past it has happened simply when the PDF was opened in DT3. It also happens seldom enough that I kept ignoring the issue, getting a new copy of the pdf, etc.

Image one shows the blank page, with the other pages in the PDF still intact, visible in the thumbnails. Image two shows the search function still working, with the yellow highlight the “found” term.

cgrunenberg · March 25, 2021, 7:47am

Are Preview or Acrobat still able to display the PDF document? Which version of macOS do you use?

ws8 · March 25, 2021, 3:40pm

Opening the PDF in Preview or Acrobat results in the same thing—blank page, highlights and image remain, text is still searchable.

I’m using
macOS 11.2.3
DT3 3.6.2

cgrunenberg · March 25, 2021, 3:51pm

Do you still have an older copy of the document without the corruption?

ws8 · March 25, 2021, 3:56pm

Yes—I have the original, before it was OCR’d by DT

cgrunenberg · March 25, 2021, 3:58pm

Are you able to reproduce the issue by using a copy of the original? Then the original file and some instructions how to reproduce this would be great, thanks!

ws8 · March 25, 2021, 4:01pm

I don’t think I can reproduce it, unfortunately. The issue happens seemingly at random—this particular instance of the problem occurred at least six months after I OCR’d the PDF originally. The blank page simply appeared in the open PDF when I woke my laptop from sleep. I don’t think the act of OCRing is the immediate cause.

Maybe I’m misunderstanding your question, tho?

cgrunenberg · March 25, 2021, 4:03pm

Yes, it’s more likely caused by editing (e.g. highlighting) the PDF due to PDFkit issues.

ws8 · March 25, 2021, 4:31pm

Got it. Is there any behavior/actions you would recommend against to decrease the risk of this issue?