I have been using Devonthink for a few months, so I am still relatively new to it. For the second or third time now, I am experiencing the following problem – text from a book pdf that, yesterday, used to lend itself to being copied and pasted into the annotations box, is today not identified as recognisable text at all by DT. When I try to copy and paste bits of text, all I get into the annotations box are a bunch of question marks (see screenshot below).
It also turns out that the text in this particular pdf isn’t even searchable any more (this used to work fine before). The other pdf’s in my library don’t have this problem, so far as I can tell - but it is still frustrating, especially since, as I said, it’s not the first time it’s happened.
Did you open the PDF in another application between your first and second attempts to open it in DT?
Also, did you search the forum for related threads? I seem to remember that this kind of problem arose before.
I didn’t open the PDF in another application before the problem arose, no. I now tried to open it in Acrobat Reader, and when I try to copy and paste text into MS Word, the same problem occurs. I suppose, then, that the problem is with this particular PDF - but it is still bizarre that it used to work fine before?
I did search the forum for related threads but did not find anything immediately relevant - will search again.
I’ve had this problem intermittently throughout the years, too - it’s a pain. As far as I can tell from my experience, it happens with other apps, too, not just DT. The only solution I’ve found is to export the PDF to an image file and then re-do OCR. You lose your highlights and bookmarks/ table of contents but get the text layer back.
Re-applying OCR has never worked for me (at least in Acrobat). There’s some anecdotal articles online to similar effect, eg this one. My guess is that it’s something to do with the fonts in the OCR layer. SW
I tried running OCR in DT, which interestingly did change the text layer but didn’t solve the problem, I guess because the data has been corrupted.
FWIW, this was the process:
1 Original PDF as received:
(b) OCR Text:
model number CC-CL-50W-203 (Type A Light)
el number NK-DLW1-50W29 (Type F Light).
2 PDF after problem appeared (outline/ bookmarks created in Acrobat Pro DC, imported to DT3.7.2, highlighting applied in DT before font went skewy):
(b) OCR text:
model number CC-CL-50W-2T0y3p(e A Light)
el number NK-DLW1-50W2T9y(pe F Light).
3 After OCR in DT3.7.2:
(b) OCR text:
model number CC-CL-50W-2Q$£ A Light)
el number NK-DLW1-50W^e F Light).
So OCR in DT is changing the text layer (& maybe it might fix the OP’s issue - I’ll try when I next get the OCR problem the OP was referring to with the question-mark-in-box characters).
I’ve noticed some other occasional problems with particular PDFs recently - the first page of a PDF going completely black and PDF pages where I have highlighted text using DT not rendering DT (they look white, with no text).
I suspect at least some of the problems are caused by bad PDFs and the rest are a product of the macOS PDFkit issue.
Apologies for the long rambling post. Thankyou for your attention to the forums, as always. I feel bad taking up your time as I’m certain it is not a problem with DT (which BTW I use 8+ hrs a day and as you know is fantastic).
As @BLUEFROG said, yes, it is one of the Apple PDF kit issues. I have perfectly valid PDFs that I can see in Windows, iOS and macOS with 3rd party apps (like PDF Expert, GoodReader or PDF Viewer from PSPDFKIT) but not with native macOS preview or any other application that use that framework.
Your last resort could be installing a ghostscript-like virtual printer, print the PDF into full graphic and then OCR it with DT or another tool. I don’t know any virtual printer for macOS, but you could use Windows 10 native “Print to PDF” to do it if you have access to a Windows 10 machine (virtual or real).