Text from PDF not readable

vw-devonthink · April 28, 2022, 3:48pm

This is not specifically a DevonThink issue but DevonThink is equally affected.

I have access to electronic publications published by a legal publisher. It is possible to download extracts of the books as PDFs.

However, using any PDF viewing app on the Mac (including DevonThink, such as Adobe Acrobat, Foxit, PDFpen Pro, etc.) copying text from the resulting PDF produces garbage. Also it is not possible to search for text (as the search only sees garbage).

For example:

if the text selected and copied is as follows:

“The objective test In deciding whether the parties have reached agreement, the courts normally apply the objective test,6 which is further discussed at para.2-003 below. Under this test, once the parties have to all outward appearances agreed in the same terms on the same subject-matter,7 then neither can, generally,8 rely on some unexpressed qualiﬁcation or reservation to show that he had not in fact agreed to the terms to which he had appeared to agree. Such subjective reservations of one party therefore do not prevent the formation of a contract.9”

This is what is produced to the clipboard:

"“7KHREMHFWLYH WHVW ,QGHFLGLQJZKHWKHU WKHSDUWLHVKDYH UHDFKHGDJUHHPHQW
WKHFRXUWVQRUPDOO\DSSO\WKHREMHFWLYHWHVWZKLFKLVIXUWKHUGLVFXVVHGDWSDUD
EHORZ 8QGHU WKLV WHVW RQFH WKH SDUWLHV KDYH WR DOO RXWZDUG DSSHDUDQFHV
DJUHHG LQ WKH VDPH WHUPV RQ WKH VDPH VXEMHFWPDWWHU WKHQ QHLWKHU FDQ
JHQHUDOO\e UHO\RQVRPHXQH[SUHVVHGTXDOLೳ FDWLRQRU UHVHUYDWLRQ WRVKRZ WKDWKH
KDG QRW LQ IDFW DJUHHG WR WKH WHUPV WR ZKLFK KH KDG DSSHDUHG WR DJUHH 6XFK
VXEMHFWLYH UHVHUYDWLRQV RI RQH SDUW\ WKHUHIRUH GR QRW SUHYHQW WKH IRUPDWLRQ RI D
FRQWUDFW”

However, exporting the PDF to Word (using PDFPen Pro) does not produce readable text.

But opening the same PDF in a PDF viewing app for Microsoft Windows does allow me to copy the selected passage as text. (For this I used PDFX-Change running under CrossOver).

A page from the downloaded file is attached.

Test page.pdf (127.6 KB)

Any help from DevonThink or another user on identifying the problem would be greatly appreciated.

Stephen_C · April 28, 2022, 4:16pm

I’m not sure this will help very much but:

I downloaded your test page and imported it into DEVONthink Pro (where it showed correctly as PDF+Text).
I selected text in the DEVONthink view window and copied and pasted that into CotEditor (effectively my default text editor).

The result was that the text pasted perfectly. I was also able to search for terms in that page within DEVONthink.

Stephen

rmschne · April 28, 2022, 4:42pm

I had a different result from @Stephen_C …

I clicked on the link and the file downloaded and showed just fine in Preview. I “moved” it to the Global Inbox and DEVONthink imported it as a PDF (no ocr). When viewed in DEVONthink, white space.

I “moved” from Preview to the desktop, then dragged and dropped into a DEVONthink folder and it’s viewable in DEVONthink. No OCR.

Copy/pasting from Preview and and DEVONthink gives gobblygook characters in Word.

With DEVONthink, I OCR’ed the version that looked ok, and after that step, the content could copy/paste into Word just fine. Also could paste in to my text Editor (BBEdit) as plain text.

I then OCR’ed the version that was blank white space, and that no change to viewing. Still just white when viewed in DEVONthink or any PDF viewer I tried (PDFPen and Preview). OCR did not do anything.

On the downloaded version stored on Desktop (which I previously dragged into DEVONthink), I opened with PDFPen. Tried to “clear OCR Layer on page”, and nothing happened. That was unexpected as I was then going to re-OCR it with that tool, but as there apparently is an OCR layer there could not do that.

Not sure what to make of all that, frankly. I’ve forgotten more than I knew about the vagaries of PDFs.

I think it’s just a “funny” PDF of some sort.

BLUEFROG · April 28, 2022, 4:42pm

CotEditor (effectively my default text editor).

Yayyy!! I CotEditor.

Stephen_C · April 28, 2022, 5:04pm

All your fault—after reading this.

Stephen

BLUEFROG · April 28, 2022, 5:10pm

It’s so good I wish I coded it.
I still use TextEdit for some support stuff, but I use CotEditor most often.
BBEdit too, but I never compose in it. I use it as a can opener, e.g., for inspecting the raw code of PDFs, etc.

PS: However, for composition in Markdown, etc., DEVONthink is still my default.

troejgaard · October 29, 2024, 6:02pm

Since I’m still on Ventura, I missed that it had 5.0 release (min. requirement is Sonoma) with a new sidebar for Folder Navigation, among other things. I don’t think it’s enough by itself to make me upgrade, but it does sway the needle…

(Edit: sorry for reviving the thread. I saw some activity and thought this was a recent reply.)

kewms · October 29, 2024, 6:57pm

My guess would be DRM on the publisher’s side. Certainly if you see the same behavior in every viewer, the publisher would be the source of the problem.