DT no longer recognises text in PDF as text

Apostol · August 30, 2021, 10:05am

Hi there,

I have been using Devonthink for a few months, so I am still relatively new to it. For the second or third time now, I am experiencing the following problem – text from a book pdf that, yesterday, used to lend itself to being copied and pasted into the annotations box, is today not identified as recognisable text at all by DT. When I try to copy and paste bits of text, all I get into the annotations box are a bunch of question marks (see screenshot below).

It also turns out that the text in this particular pdf isn’t even searchable any more (this used to work fine before). The other pdf’s in my library don’t have this problem, so far as I can tell - but it is still frustrating, especially since, as I said, it’s not the first time it’s happened.

Any advice? Thanks in advance.

chrillek · August 30, 2021, 10:15am

Did you open the PDF in another application between your first and second attempts to open it in DT?
Also, did you search the forum for related threads? I seem to remember that this kind of problem arose before.

cgrunenberg · August 30, 2021, 10:19am

Which version of macOS do you use? Depending on the PDF document and its internal structure & encoding the PDFkit of macOS might sometimes corrupt the text layer unfortunately.

Apostol · August 30, 2021, 10:21am

I didn’t open the PDF in another application before the problem arose, no. I now tried to open it in Acrobat Reader, and when I try to copy and paste text into MS Word, the same problem occurs. I suppose, then, that the problem is with this particular PDF - but it is still bizarre that it used to work fine before?

I did search the forum for related threads but did not find anything immediately relevant - will search again.

Apostol · August 30, 2021, 10:23am

It’s Big Sur 11.5.2. This sounds like it could well be the root of the problem - since, as I just found out, other software such as Acrobat Reader also don’t seem to identify text in the PDF. Shame!

stephenjw · August 30, 2021, 10:51am

I’ve had this problem intermittently throughout the years, too - it’s a pain. As far as I can tell from my experience, it happens with other apps, too, not just DT. The only solution I’ve found is to export the PDF to an image file and then re-do OCR. You lose your highlights and bookmarks/ table of contents but get the text layer back.

cgrunenberg · August 30, 2021, 12:43pm

Applying OCR again to the document should also fix this and retain at least the annotations.

rfog · August 30, 2021, 6:31pm

That corruption is caused by the “fancy” “super-duper”, “fantastic” Apple macOS PDF framework.

stephenjw · August 30, 2021, 10:44pm

Hi Cris

Re-applying OCR has never worked for me (at least in Acrobat). There’s some anecdotal articles online to similar effect, eg this one. My guess is that it’s something to do with the fonts in the OCR layer. SW

stephenjw · August 30, 2021, 11:17pm

Hi @rfog

I’ve started seeing this sort of error in a few PDFs in DT:

Screen Shot 2021-08-31 at 09.14.10

It’s not something I had seen until a few of weeks ago. Would it be the macOS PDF framework issue you mentioned?

SW

BLUEFROG · August 30, 2021, 11:25pm

Yes, PDFKit is Apple’s PDF framework.

If you’re running the Pro or Server edition of DEVONthink, have you tried running OCR on the file in DEVONthink?

stephenjw · August 31, 2021, 12:40am

Hi Jim

I tried running OCR in DT, which interestingly did change the text layer but didn’t solve the problem, I guess because the data has been corrupted.

FWIW, this was the process:

1 Original PDF as received:

(a) Image:

Screen Shot 2021-08-31 at 09.58.40

(b) OCR Text:

model number CC-CL-50W-203 (Type A Light)
el number NK-DLW1-50W29 (Type F Light).

2 PDF after problem appeared (outline/ bookmarks created in Acrobat Pro DC, imported to DT3.7.2, highlighting applied in DT before font went skewy):

(a) Image:

Screen Shot 2021-08-31 at 09.59.14

(b) OCR text:

model number CC-CL-50W-2T0y3p(e A Light)
el number NK-DLW1-50W2T9y(pe F Light).

3 After OCR in DT3.7.2:

(a) Image:

Screen Shot 2021-08-31 at 10.24.36

(b) OCR text:

model number CC-CL-50W-2Q$£ A Light)
el number NK-DLW1-50W^e F Light).

So OCR in DT is changing the text layer (& maybe it might fix the OP’s issue - I’ll try when I next get the OCR problem the OP was referring to with the question-mark-in-box characters).

I’ve noticed some other occasional problems with particular PDFs recently - the first page of a PDF going completely black and PDF pages where I have highlighted text using DT not rendering DT (they look white, with no text).

I suspect at least some of the problems are caused by bad PDFs and the rest are a product of the macOS PDFkit issue.

Apologies for the long rambling post. Thankyou for your attention to the forums, as always. I feel bad taking up your time as I’m certain it is not a problem with DT (which BTW I use 8+ hrs a day and as you know is fantastic).

BLUEFROG · August 31, 2021, 3:21am

No worries and no wasted time.

Hold the Option key and choose Help > Report bug to start a support ticket and please attach the PDF for us to inspect. Thanks!

stephenjw · August 31, 2021, 5:16am

Thanks very much, Jim.

Unfortunately I can’t send this particular document due to client confidentiality restrictions. As soon as I can replicate the issue, I’ll send through with a support ticket.

Thanks again,

SW

rfog · August 31, 2021, 7:15am

As @BLUEFROG said, yes, it is one of the Apple PDF kit issues. I have perfectly valid PDFs that I can see in Windows, iOS and macOS with 3rd party apps (like PDF Expert, GoodReader or PDF Viewer from PSPDFKIT) but not with native macOS preview or any other application that use that framework.

Your last resort could be installing a ghostscript-like virtual printer, print the PDF into full graphic and then OCR it with DT or another tool. I don’t know any virtual printer for macOS, but you could use Windows 10 native “Print to PDF” to do it if you have access to a Windows 10 machine (virtual or real).

stephenjw · August 31, 2021, 10:29am

Thanks for the suggestion, rfog.

It sounds similar to what I do with exporting to a tiff file then printing back to PDF.

Frustrating when it happens.Hopefully Apple will fix.

chrillek · August 31, 2021, 12:09pm

Don’t hold your breath

stephenjw · August 31, 2021, 11:30pm

Ha ha - good advice.

mmbr · September 6, 2021, 7:29am

For short clips, try TextSniper. I’ve been using it only a short time, but no problems yet.
Edgar