pdf has text layer but is not searchable

Prion · February 11, 2008, 11:44am

Dear all

upon importing a number of pdfs from science journals, I stumbled across quite a few that DevonThink Pro complained about (No Text). I checked the offending pdfs and was shocked to find that the pdfs did in fact contain a text layer but were not searchable. The pdfs had indeed come straight from the journal homepages so they never underwent OCRing but were produced electronically from a file.
I have so far only checked if I could highlight some text in Preview.app and if I could concluded that the pdf will be searchable.
However, neither Preview nor Acrobat can find a single word in these files. The search will run and complete but never finds a single thing. Copying text and pasting into Textedit leaves a garbled mess of strange characters that have no resemblance with the article.

I am rather taken aback that pdf might not be the trustable document format that I thought it was. I am prepared for problems that arise from less than perfect OCRing, but we are talking straightforward pdfs from files.

Two questions:

How can I check rigorously if a pdf is searchable, i.e. has a text layer that actually contains useful information? DTP seems to do a good job upon import but can I run a check on what is inside the database already or lives outside the DTP database and merely indexed?
How can I convert the pdf into something more useful? Acrobat 7 will refuse OCRing these files stating that they already contain a text layer.

I would appreciate your help because my research hinges on the references. Having an unknown number of pdfs inside the database that are not searchable defeats the purpose.

Prion

eboehnisch · February 12, 2008, 6:47pm

Prion, could you email us such a PDF so that we can have a look? You can find our general email address (info@…) on our Contact page.

Prion · February 12, 2008, 8:51pm

I just sent a message with one of the offending pdfs. Hopefully you can solve this mystery, many thanks for looking into it.

Prion

Prion · February 13, 2008, 4:28pm

Just so that everybody with a similar problem knows:
I had a chat with Eric, Devontechnologies celebrity and president, and going through the pdf in question we were able to narrow it down to the following.
Acrobat 7 could not search the pdf at all neither on Eric’s machine nor on mine.
Preview.app could on Eric’s but not on mine, same applies for Devonthink which uses webkit just like Preview.

The difference seems to lie in various upgrades to Webkit in Leopard that makes pdfs better searchable. My Powerbook (still running on Tiger and thus using an older version of webkit) could not make use of the textlayer inside the pdf while the machine running Leopard could. Leopard was not perfect, though, but usable.
Some words contain characters outside the ASCII code (or whatever was used) thus disrupting the search string. This would still result in this word not turning up in a search but this seemed to be the exception. Eric suggested using the fuzzy search in Devonthink to sort of circumvent the problem. Great idea, had not occured to me. It does not solve the indexing problem and less than perfect searchability of certain pdfs but it has to be kept in mind that the culprit is not Devonthink but the process of creation of the pdf itself.

Still, I need to upgrade to Leopard quickly, it seems.

Hope this helps, and two thumbs up for customer service, Devonthink. These guys know what they are talking about.

Prion

eboehnisch · February 13, 2008, 4:49pm

Thanks, Prion

One addition from my side: The PDFs were producted by the 3B2 newspaper typeset system and are just messed up internally. Seems some newspapers should update their systems every 30 years at least.

bangersandmash · February 13, 2008, 5:02pm

I have come accross academic journals that have intentionally inserted garbage into the text layer as a naive attempt at copy protection. While you may be able to highlight in the pdf, you are only highlighting the garbage text (ascii characters). Acrobat 7 (which I use) will not allow you to re-OCR a document that has a text layer so you’ll need to get rid of the text layer first. I cannot figure out how to do this in Acrobat7, but according to http://blogs.adobe.com/acrolaw/2006/12/acrobat_8_new_e.html Acrobat 8 can remove this layer of garbage text for you. This will allow you to re-OCR the document and make it searchable in devonthink.

eboehnisch · February 13, 2008, 5:17pm

These documents are not OCR-able as they contain true machine-readable text, not an image of the page. But in the text some characters are replaced by special characters for use by the typesetter. On screen they appear properly but from the Unicode standpoint they’re just garbage.

Bill_DeVille · February 13, 2008, 6:25pm

At Eric’s request, Prion attached a copy of a PDF to a message to Support. It was a journal publication of one of his papers.

I checked out the text layer of the PDF. It contained some garbage characters that substituted for normal characters in some words, making those words non-searchable.

At high magnification one of those garbage characters contained the label “PRIVATE USE”. Unfortunately, that reduces the usefulness of the PDFs to scholars who wish to search their reference collections – perhaps they should complain to the journals using such techniques.

I was able to use DT Pro Office’s File > Import > Images (with OCR) command to import that PDF with a new text layer, with no inserted garbage characters. For such PDFs that have already been imported into the database the command Data > Convert > to Searchable PDF will create a new PDF copy with a new text layer for the selected PDF.

Prion · February 14, 2008, 9:16am

Thanks Bill

as I said above, this pdf is one of close to 200 pdfs that share the same problem. It came straight from the journals website to make sure that it was no tinkering on my behalf that had happened in the meantime that might have caused the problem.

The “private use” as I understand it does not refer to the intended (or allowed) usage of the pdf as such. It stands for characters that finetune the position of certain characters relative to one another in non-monospaced fonts in the pdf. Depending on which font is used, two characters may need to be positioned close to one another (say, “i” and “f” as in the word “if”) but in reverse order the same characters appear much more closely (as in “fine” where the “i” is positioned almost underneath the “f”). Some of these combinations simply have no correlate in ASCII or UTF and are replaced by some strange, font-specific character that prints fine but makes no sense in pure text, hence the term “private use” (in this font). These are the characters that break the search string and account for the hiccups even under Leopard.
Somebody with a better understanding of what is going on behind the scenes please correct me if I am wrong. I don’t claim to be an expert but that is my understanding of what causes the problems.

Bill, can I run the demo of DTPO alongside DTP (which I own)? I do not want to compromise DTP which harbours many important data for me.

Prion

Bill_DeVille · February 14, 2008, 1:49pm

Prion, you can demo DT Pro Office using your current databases. DT Pro and DT Pro Office are entirely compatible.

Of course, I recommend running Scripts > Export > Backup Archive before doing anything new or different. First, that assures that the database will be in good condition before the change; second, you will have a current external backup archive file that you can save to another computer as a precaution against a hard drive failure or other catastrophe.

We recommend against running two versions of a DEVONthink application simultaneously, as that can confuse Services or scripts that send data to a database. A database that’s in use by one DT application cannot be opened by a different DT application while it’s in use.