Non-searchable searchable PDFs

Abelard · September 13, 2016, 8:40pm

And there is no way to know if a PDF is searchable or not? There was this Tuesday tip (now defunct as the the page linked to containing the script is no longer available) which suggested a method of finding those PDFs that had no scanned OCR layer (blog.devontechnologies.com/2007/ … ocr-layer/), but would this not simply find those PDFs that are not PDF+text?

Bill_DeVille · September 13, 2016, 9:11pm

DEVONthink displays the Kind of an image-only PDF as PDF, and of a PDF with a searchable text layer as PDF+Text.

Your example contained one page of searchable text, and the following pages were image-only. The Kind will be displayed as PDF_Text. Note that you can select text on that first page, but you can’t select a word of text on the subsequent pages.

Abelard · September 14, 2016, 4:29am

In both the examples here, the problem has been identical. I’ve been unable to OCR a PDF+Text within my database i.e. the Convert option has not worked. DTPO couldn’t distinguish between an OCR layer of text on a doc, or just a layer of irrelevant text. There is no option within DTPO on telling it to treat the PDF as an image, regardless of whether it’s a PDF or a PDF+text, in the same way you can for incoming docs i.e. Import/Images (with OCR). This also has the associated problem of not knowing which PDFs within DTPO are searchable or not (that is, the actual text of the PDF). The ‘text’ in the second doc was on all the pages of the PDF, although there was more on the first page. In effect it was a copyright sheet (as I believe it might have been for the first one), stating the doc details on the first sheet and then ‘Reproduced with permission of the copyright owner. Further reproduction prohibited without permission’ on subsequent pages. I imagine that Antiquity are not the only journal who do this, although this is the first time I’ve come across this for an academic journal article.

There is the option within Preferences/OCR to convert incoming scans to searchable PDFs, but this is will not help in this scenario because it does not apply to PDFs added directly to the database.

So at this juncture I have no way of knowing which PDFs in my database need to be deleted and then imported via Import/Images (with OCR), or in other words how searchable my database is. I can certainly see why DTPO cannot distinguish between a layer of copyright text and an OCR layer (they’re both searchable text), but bearing in mind that being able to search everything within DTPO is one of the big draws of the software (certainly for me), not being to either identify or then fix it the issue seems to be an issue.

Maybe I’ve got this completely wrong. I hope so. But has anyone been able to OCR a similar doc i.e. a PDF with an irrelevant text layer within DTPO, or indeed identify one other than by chance?

cgrunenberg · September 14, 2016, 7:12am

Could you please send an example to cgrunenberg - at - devon-technologies.com? Thanks a lot!

Abelard · September 14, 2016, 3:18pm

Just to update (and finally close) the thread, my ability to successfully OCR a PDF+Text doc in DTPO was restored (THANK YOU Chris)! Somehow, a preference somewhere was changed that got rid of the alerts. This was cleared by using the Terminal command 'defaults delete com.devon-technologies.thinkpro2’, although the next version (2.9.5) will obviate this by including a button in the preferences to reset the alerts.

Bill_DeVille · September 14, 2016, 3:31pm

Aaha! That’s a good example of the kind of problem that can result by turning off an alert that has been built into the software (and forgetting that this had been done). Personally, although sometimes I might become irritated when an alert pops up frequently, I value its presence.

Abelard · September 14, 2016, 3:52pm

I know right? I have absolutely no memory of doing that, although I wouldn’t be surprised as alerts can be annoying. I’m guessing I assumed I was just stopping the alert rather than the whole process, otherwise I doubt I would have done it.

korm · September 14, 2016, 3:53pm

It might be interesting to tell support at the journal(s) where the PDF comes from what you are looking to do. They might be modifying downloaded files to prevent text extract – to preserve their rights in the content.

Abelard · September 14, 2016, 4:05pm

Having just downloaded an article from their latest issue (a normal searchable PDF), it appears this only applies to older issues which have presumably been scanned from an actual paper copy of the journal and stored as a PDF with that overlay detailing the article details and the copyright blurb. I’d also hazard a guess that this would apply to lots of other journals out there that have made their archives available the same way. Luckily, they normally look quite different, so fingers crossed I’ll spot the buggers!

Kinsey · September 14, 2016, 4:24pm

Out of curiosity, are you getting the article from JSTOR? I have a lot of articles downloaded from there, which are scans of paper copies, but have had no issues searching the articles either online or within DTPO.

Abelard · September 14, 2016, 4:39pm

I actually got it through ProQuest (via my uni). My personal feeling would be that it would depend on the journal, but that’s just a guess. Hmmm, how to spot those unsearchable PDF+texts?

BLUEFROG · September 14, 2016, 5:44pm

Create a Smart Group in your database with criteria of…
Kind is PS/PDF
Word Count is 0

This should do it.

korm · September 14, 2016, 6:24pm

If you’re not getting expected search results from a PDF+Text you can select the document and use Data > Convert > to Plain Text. You’ll get a text file with just the text layer, which you can inspect and compare to the PDF.

Abelard · September 17, 2016, 9:52am

This works, but only for those without any text layer i.e. PDFs rather than PDF+Text, although this is handy too to spot those PDFs with nothing at all. It’s hard to see how to get DTPO to tell a text layer that isn’t OCR from an OCR one.