Webclipper not running OCR

joeg3 · October 6, 2021, 1:38pm

Hi, for a family history project, I’m trying to archive some family obituaries from the web by web clipping with the Firefox plugin in PDF uncluttered format so any photos come down too. The newly created document’s ‘Kind’ is PDF+Text which I thought meant it was OCR’d. But neither searching DT, nor searching within the document on a unique text string give any results.

But maybe I’m off base and OCR isn’t a thing for web clipping?

Thanks
Joe

cgrunenberg · October 6, 2021, 2:06pm

OCR is usually only necessary in case of scans/photos but might be also necessary in case of poorly created PDF documents. Does a conversion to plain/rich text of the PDF document produce the expected results?

joeg3 · October 6, 2021, 3:56pm

I clipped the same page as rich text. Now I can search and find the document within DT based on a unique word in the document, but doing a Cmd+F to search within the document doesn’t return results for that word in the document.

So I’m unclear how to get a webpage into DT that is searchable both within DT and within the document.

And I was under the assumption that a document type of “PDF+Text” was an OCR’d document, but maybe that’s not always true?

BLUEFROG · October 6, 2021, 4:16pm

I was under the assumption that a document type of “PDF+Text” was an OCR’d document, but maybe that’s not always true?

PDF+Text means there’s a detected text layer in the document, whether that’s produced by OCR or not.

but doing a Cmd+F to search within the document doesn’t return results for that word in the document.

A screen capture could be helpful.

joeg3 · October 6, 2021, 4:28pm

This is the rich text webclip. The word “bowhunter” is in the second paragraph, but DT doesn’t see it when searching within the document

BLUEFROG · October 6, 2021, 4:47pm

Thanks for the screen capture!

You have Enable Operators & Wildcards enabled but haven’t provided a wildcard.
Searching for bowh or the entire word wil find it…

joeg3 · October 6, 2021, 5:09pm

Good to know a wildcard character is required if wildcard checkbox is checked - thanks!

So here is the top of the page that wasn’t in the earlier screen shot of the rich text webclip. The pdf webclip is much better where the clipped document looks just like the webpage. But the pdf webclip isn’t searchable in DT. I think I found a workaround. If I right click the pdf webclip and select OCR > to searchable PDF, I get a popup asking if I want to convert this searchable pdf again. If I click Convert, the pdf webclip is now searchable.

Am I missing some setting where the webclipper would perform OCR so I don’t have to do it manually?

BLUEFROG · October 6, 2021, 5:15pm

No, the browser extension doesn’t need to do OCR. OCR is for image conversion and PDFs without a text layer.

There is something unusual about the page causing parts of the PDF text to be lost.

You could use your workaround if it suits you but this shouldn’t be necessary in general.