Non-searchable searchable PDFs

Abelard · May 26, 2016, 4:05pm

Hi all

Yes, I’m sorry for the confusing title but let me explain. I’ve recently imported some PDFs downloaded from ‘The Times’ Digital Archive relevant to my studies, found using their search facility. Anyway, after importing them into DTPO, and seeing as they are listed as ‘PDF+Text’, I assumed they were searchable, but a quick test search indicates that they’re not. Converting to ‘Searchable PDF’ has no effect. What have I missed?

BLUEFROG · May 26, 2016, 6:40pm

Either the PDF is fault or your search terms / scope are incorrect. Please start a Support Ticket and ZIP and send a PDF and provide with the terms you were searching with.

brookter · May 26, 2016, 10:37pm

The scanning on newspaper articles can be very hit and miss, I’ve found, so that what you see on the page very often isn’t what the OCR returns.

Have you tried converting the PDF+Text to Rich Text (or Plain Text), using the Data > Convert menu? Then you’ll be able to see the actual words in the text layer of the pdf, so you can test your search terms. At least you’ll know whether it’s a problem with the original OCR process or a deeper issue.

Abelard · May 27, 2016, 1:36pm

Thanks guys. Yeah, I think it’s a deeper issue as converting to plain/TRF only gives me the cover sheet; it completely ignores the actual newspaper text. But here’s the thing – the only reason I found them is because they are searchable on the Time Digital Archive, so there must be a way they can be searchable. I’ll go back and see if there are other forms I can download them from there. In the meantime, I’ve put in a Support Ticket with the PDFs attached so the DTPO experts can have a look.

BLUEFROG · May 27, 2016, 3:20pm

The search on the web doesn’t necessarily mean it’s searching the actual PDFs. (And the PDFs you sent were NOT searchable, so they couldn’t be found if it was searching the PDFs themselves.)

Bill_DeVille · May 27, 2016, 3:58pm

@Abelard: If you are downloading PDFs from archives of old newspapers, it’s not uncommon that only the cover page has been OCRed and contains text, and the article pages themselves are only images. DEVONthink will show the PDF’s Kind as PDF+Text, but the article isn’t searchable.

If so, you can try to OCR the PDF. The success of that, especially text recognition accuracy, will depend on the resolution of the images.

Abelard · May 27, 2016, 5:16pm

I’m going to paste here the reply given by Jim Neumann so that if anyone else has this query, there’s an answer without bothering the lovely people at DEVONtechnologies :

A file will be marked as PDF+Text if there’s a text layer. That doesn’t mean the text layer has any data on it. However, in your case, there is text on it. If you convert the PDF to plain text you will see the page folios and the ownership / copyright information.

Note: Converting a file with a text layer may not yield acceptable results on already processed text. The quality of the original also plays a huge factor in the process. If the original is fuzzy and lo-res, it will not convert well.

However, I set Preferences > OCR to 300dpi and 80% quality and converted the PDFs.

Abelard: And now it works a treat! My thanks for such a speedy and efficient reply, much appreciated.

BLUEFROG · May 27, 2016, 10:34pm

My pleasure. Cheers!

Abelard · September 12, 2016, 2:29pm

Hi, sorry to bother you on this again, but how do I strip a PDF of a text layer and OCR it? I have a PDF downloaded from a journal, but I can’t convert it, so in a similar situation as before. Converting at the settings above has no effect, and converting to text just gives me the text layer. How do I get DTPO to actually OCR the image?

Bill_DeVille · September 12, 2016, 4:52pm

You don’t need to first “strip” the text layer of a PDF+Text document, in order to apply OCR.

However, OCR will fail if the image quality of the PDF is low, especially if the resolution of the image was low. I’ve seen old photo archives of newspapers, e.g., that were scanned at a resolution less than 100 dpi. The text images don’t allow reasonable text recognition accuracy by the OCR software.

Abelard · September 12, 2016, 6:07pm

Thanks for posting. The image is of quite good quality, as one would expect from a quite well respected archaeology journal (Antiquity). Using Convert PDF had no effect that I can see. My impression is that because it’s already defined by DTPO as PDF+text, the conversion fails. Happy to send you a copy if you like

BLUEFROG · September 12, 2016, 6:42pm

Based on what values? A 72dpi image generally looks good onscreen but is insufficient for OCR purposes.

Abelard · September 13, 2016, 12:32pm

Just based on the fact that the writing is quite clear. My point is that when I use the Data/Convert/To Searchable PDF, nothing happens. It doesn’t even try, and that’s what confusing me. Is there no way of converting a PDF+Text for a doc already within the database, or does DTPO just assume that such PDFs are already readable because of that text layer?

cgrunenberg · September 13, 2016, 12:49pm

It should convert them after a warning.

Abelard · September 13, 2016, 1:37pm

I agree, that’s what’s odd. I must be missing something obvious here, so I’ve done a very short screen video (40 secs), so you can see what I’m trying to convert and what (doesn’t) happen when I try:

cgrunenberg · September 13, 2016, 1:50pm

Some alerts include a checkbox to disable them, maybe that’s why no alert is displayed?

Abelard · September 13, 2016, 4:46pm

Perhaps, but have no idea where that might be. The central question I have is why won’t DTPO try to OCR this doc which is already within its database? Is that because it sees PDF+Text?

Bill_DeVille · September 13, 2016, 5:12pm

No. DEVONthink Pro Office Data > Convert > to searchable PDF will attempt to perform OCR on a selected PDF+Text document (leaving aside the issue of the normal warning message). But in some cases, even though you can see text in the image, the image is below the threshold image resolution or quality for the OCR software to properly “see” the text, and the attempt will fail or the result will be text recognition and conversion so poor as to be unusable for searches.

That’s not an uncommon problem with images captured from some archives of old newspaper or journal issues.

Abelard · September 13, 2016, 5:28pm

I’ve managed to do this now, but it raises another question. What I did was download another copy of the PDF and then used the File > Import > Image with OCR. It then scanned it no problem and now it’s a fully searchable PDF.

My question is why couldn’t I do this within DTPO with the Convert/Searchable PDF?

Bill_DeVille · September 13, 2016, 6:49pm

Stuff happens. PDFs have been produced in a wide variety of “flavors” depending on the creating app. Perhaps you had directly captured the original file to DEVONthink, and it was flaky in the OS X environment. Your second attempt resulted in a copy of the original file, that had been saved under OS X and so might have removed the original’s flakiness. Or perhaps the original file had been corrupted in download?

It’s not worth spending time speculating about a single instance. But if you find that the issue is common in files from the same source, the tip of working on a copy that had been saved under OS X could be useful.