Text from PDF not readable

This is not specifically a DevonThink issue but DevonThink is equally affected.

I have access to electronic publications published by a legal publisher. It is possible to download extracts of the books as PDFs.

However, using any PDF viewing app on the Mac (including DevonThink, such as Adobe Acrobat, Foxit, PDFpen Pro, etc.) copying text from the resulting PDF produces garbage. Also it is not possible to search for text (as the search only sees garbage).

For example:

if the text selected and copied is as follows:

ā€œThe objective test In deciding whether the parties have reached agreement, the courts normally apply the objective test,6 which is further discussed at para.2-003 below. Under this test, once the parties have to all outward appearances agreed in the same terms on the same subject-matter,7 then neither can, generally,8 rely on some unexpressed qualiļ¬cation or reservation to show that he had not in fact agreed to the terms to which he had appeared to agree. Such subjective reservations of one party therefore do not prevent the formation of a contract.9ā€

This is what is produced to the clipboard:

"ā€œ7KHREMHFWLYH WHVW ,QGHFLGLQJZKHWKHU WKHSDUWLHVKDYH UHDFKHGDJUHHPHQW
WKHFRXUWVQRUPDOO\DSSO\WKHREMHFWLYHWHVWZKLFKLVIXUWKHUGLVFXVVHGDWSDUD
 EHORZ 8QGHU WKLV WHVW RQFH WKH SDUWLHV KDYH WR DOO RXWZDUG DSSHDUDQFHV
DJUHHG LQ WKH VDPH WHUPV RQ WKH VDPH VXEMHFWPDWWHU WKHQ QHLWKHU FDQ
JHQHUDOO\e UHO\RQVRPHXQH[SUHVVHGTXDOLą³³ FDWLRQRU UHVHUYDWLRQ WRVKRZ WKDWKH
KDG QRW LQ IDFW DJUHHG WR WKH WHUPV WR ZKLFK KH KDG DSSHDUHG WR DJUHH 6XFK
VXEMHFWLYH UHVHUYDWLRQV RI RQH SDUW\ WKHUHIRUH GR QRW SUHYHQW WKH IRUPDWLRQ RI D
FRQWUDFWā€

However, exporting the PDF to Word (using PDFPen Pro) does not produce readable text.

But opening the same PDF in a PDF viewing app for Microsoft Windows does allow me to copy the selected passage as text. (For this I used PDFX-Change running under CrossOver).

A page from the downloaded file is attached.

Test page.pdf (127.6 KB)

Any help from DevonThink or another user on identifying the problem would be greatly appreciated.

Iā€™m not sure this will help very much but:

  1. I downloaded your test page and imported it into DEVONthink Pro (where it showed correctly as PDF+Text).

  2. I selected text in the DEVONthink view window and copied and pasted that into CotEditor (effectively my default text editor).

The result was that the text pasted perfectly. I was also able to search for terms in that page within DEVONthink.

Stephen

1 Like

I had a different result from @Stephen_C ā€¦

I clicked on the link and the file downloaded and showed just fine in Preview. I ā€œmovedā€ it to the Global Inbox and DEVONthink imported it as a PDF (no ocr). When viewed in DEVONthink, white space.

I ā€œmovedā€ from Preview to the desktop, then dragged and dropped into a DEVONthink folder and itā€™s viewable in DEVONthink. No OCR.

Copy/pasting from Preview and and DEVONthink gives gobblygook characters in Word.

With DEVONthink, I OCRā€™ed the version that looked ok, and after that step, the content could copy/paste into Word just fine. Also could paste in to my text Editor (BBEdit) as plain text.

I then OCRā€™ed the version that was blank white space, and that no change to viewing. Still just white when viewed in DEVONthink or any PDF viewer I tried (PDFPen and Preview). OCR did not do anything.

On the downloaded version stored on Desktop (which I previously dragged into DEVONthink), I opened with PDFPen. Tried to ā€œclear OCR Layer on pageā€, and nothing happened. That was unexpected as I was then going to re-OCR it with that tool, but as there apparently is an OCR layer there could not do that.

Not sure what to make of all that, frankly. Iā€™ve forgotten more than I knew about the vagaries of PDFs.

I think itā€™s just a ā€œfunnyā€ PDF of some sort.

CotEditor (effectively my default text editor).

Yayyy!! I :heart: CotEditor. :slight_smile:

All your faultā€”after reading this. :grin:

Stephen

1 Like

Itā€™s so good I wish I coded it.
I still use TextEdit for some support stuff, but I use CotEditor most often.
BBEdit too, but I never compose in it. I use it as a can opener, e.g., for inspecting the raw code of PDFs, etc.

PS: However, for composition in Markdown, etc., DEVONthink is still my default.

1 Like