DTPO/PDFkit corrupting/obliterating OCR layer?

I know that PDFkit has been a mess, and I’ve lost track of whether DTPO is, or is not, relying on PDFkit by default these days.

The issue occurred on macOS 10.13.4 with DPTO 2.9.17. The PDF in question was a scan that had been made and OCRd by someone other than me (I don’t know the provenance).

I had a PDF+Text open in DTPO. I had confirmed that the OCR layer was present and accurate (using PDF Expert, Safari, and finally, in DTPO, but NEVER preview). I began reading the document in DTPO. After making the first highlight, the application hung (beachball), then the Log window appeared with row indicating “No Text” for the presently-open file.

Indeed, I confirmed using all of the above applications that the OCR layer was now toast. The document could not be searched, anything selected was just blank characters, etc.

I had to restore the document from a backup to retrieve an un-corrupted version.

I can’t trust DTPO as a PDF viewer if it, or the frameworks it relies upon, can corrupt OCR layers! Thankfully opening files in an external application is easy, but its all-too-easy to, on a whim, quickly open a PDF, modify it in some way that causes it to auto-save, and BAM, OCR gone.

Is this just part of the ongoing PDFkit issues? Is this a new problem?

This is rare but not a new issue of the PDFKit depending on the contents of the text layer, e.g. this might happen in case of Eastern European languages. The only workaround is to use applications using their own PDF engine to edit/annotate such documents.

Good to know. The image of text is in english, and the OCR layer is, ostensibly, in english as well. Indeed this has not happened with every PDF i’ve opened, so clearly somewhat contingent on something to do with the OCR encoding.

Either way, this makes DTPO a bit of a minefield for viewing PDFs for me. I think I’ll have to abandon viewing PDFs in DTPO for the time being.

Viewing PDF documents doesn’t cause any issues, only editing/annotating does. In addition, you’ll get a warning in case that this should happen again (contrary to e.g. Preview or other apps using PDFkit).

@scott: Did you confirm the contents of the image layer by converting the PDF to plain text or just visually deduce it was fine by selecting text?

Indeed – this will only happen if something is done to cause the file to be saved. However, I didn’t receive an pre-emptive warning. The only indication I received that this had already occurred was the Log popping up, and the concordance icon being greyed out (as well as the See Also results becoming totally whacko).

Either way, for now, I’m just going to avoid viewing (and modifying) PDFs in DTPO.

I confirmed that the OCR Layer was initially in tact by:

  1. searching to for text which returned results that indicated accurate OCR
  2. copying and pasting text which produced pasted text that was accurate to the image of the text
  3. using macOS’s “Lookup” feature, which correctly selected both the image and the text layer and produced appropriate results

(these are all more or less typical aspects of my day-to-day PDF workflow, so they were not initially done as a troubleshooting thing. They were initially done in purest of my usual daily tasks.)

After I made an annotation that resulted in DTPO saving the file, in addition to the Log warning:

  1. Searching the document for any string returned no results (reproduced in several PDF applications)
  2. Copying and pasting did not produce any text, or produced results that were garbled.
  3. Lookup was unsuccessful because: a) text selection through force-click/three-finger tapping/double clicking on a word failed because there was no text that could be intelligently identified as word. b) manually selecting text by clicking and dragging clearly indicated that there was no separation between words and using “lookup” from the contextual menu produced no results because it was passed either a blank string or gibberish.
  4. DTPO greyed out the concordance and other word-based buttons in the PDF viewer. See Also and Classify also displayed results that didn’t make sense (e.g., clearly did not relate to the visible contents of the file, and also differed from the initial results when I originally used See Also and Classify to classify it).

The observed changes lead me to believe that there was an accurate OCR text layer initially, and then a corrupt OCR text layer subsequently.

Restoring a file prior to DTPO auto-saving returned to the original behaviour: search works, text selection was predictable, look-up works, copy/paste produced proper and accurate text output etc.

Do you have secondary languages set in OCR?
What’s your primary OCR language set to?

My own ocr settings are English and only English.

However, this was not a file created by me and I did not perform the OCR on this specific file.

Is it a file you can share in a Support Ticket?

Certainly, incoming shortly.

Ticket # 512522

I’m not sure if I’m having the same problem as @scottlougheed , but from time to time, I have OCR’d text disappear from a document, leaving me with a blank page that still holds annotations (e.g. highlights). What’s odd is that the text is “there” in that it can be searched or copy/pasted, but it otherwise remains invisible.

I looked around other threads, but I didn’t come across any with this exact issue—happy to be pointed to another forum if this issue has already been addressed.

I also can’t determine what triggers this. It happened this morning when I woke my laptop up from sleep and brought the DT3 window to the front, but in the past it has happened simply when the PDF was opened in DT3. It also happens seldom enough that I kept ignoring the issue, getting a new copy of the pdf, etc.

Image one shows the blank page, with the other pages in the PDF still intact, visible in the thumbnails. Image two shows the search function still working, with the yellow highlight the “found” term.

Are Preview or Acrobat still able to display the PDF document? Which version of macOS do you use?

Opening the PDF in Preview or Acrobat results in the same thing—blank page, highlights and image remain, text is still searchable.

I’m using
macOS 11.2.3
DT3 3.6.2

Do you still have an older copy of the document without the corruption?

Yes—I have the original, before it was OCR’d by DT

Are you able to reproduce the issue by using a copy of the original? Then the original file and some instructions how to reproduce this would be great, thanks!

I don’t think I can reproduce it, unfortunately. The issue happens seemingly at random—this particular instance of the problem occurred at least six months after I OCR’d the PDF originally. The blank page simply appeared in the open PDF when I woke my laptop from sleep. I don’t think the act of OCRing is the immediate cause.

Maybe I’m misunderstanding your question, tho?

Yes, it’s more likely caused by editing (e.g. highlighting) the PDF due to PDFkit issues.

Got it. Is there any behavior/actions you would recommend against to decrease the risk of this issue?