Capture a website as image and OCR it

halloleo · March 30, 2021, 1:31am

Coming from the thread Webarchive doesn’t show static copy of web page I understand that webarchives and PDFs are - for different reasons - not as great for archiving most websites as I had hoped.

As suggested by @mschne I think a reasonably true way of archiving could be a browser image which then is saved into a PDF and OCRed.

Can DEVONthink’s Importer or another part of DEVONthink do this?

rmschne · March 30, 2021, 6:18am

I think converting a website to an image is completely impractical and if I suggested that, I was wrong. I can see it “possible” with a webPAGE for critical needs to have that image, but other than a pretty representation of the content it’s not as good as PDF as a way to save content of a web Page.

DEVONthink now can convert image files to PDF, and these PDF’s can be OCR-ed, but I’ve not tested how good it is I am ok with saving web pages as I described (most of the time, I save the Reader view into a PDF).

chrillek · March 30, 2021, 6:49am

What exactly is it that you want to achieve? If you’re interested in the content (as implied by OCR’ing it), you could simply save a PDF of the HTML document. I don’t really understand what your goal is here. Archiving websites is already done on a global scale by web.archive.org anyway.
But if you take a screenshot of an HTML document, you

miss out on anything below the fold (which can be a lot nowaydays)
get poor resolution (100dpi as opposed to 300 with even an old laser printer)
have the added disadvantage of advertisements in your image (in which you’re probably not really interested)

I think, before talking about the technique to use, one should figure out to what ends they want to archive something. In the case of HTML, hoping to freeze a document in time and keep all its layout properties etc. in another format is probably not going to work. Just think of a scrollbar – how would you capture that on paper?

chrk · March 30, 2021, 1:04pm

For times Devon’s own PDF clipping options as well as a printed PDF from Safari or Safari Reader view fail to deliver satisfactory results (due to cookie notices with Devon’s clipper or other formatting issues – print display looks off or no Reader view), I use a Safari extension that saves the browser image you mentioned as PDF in retina solution. This extension is basically like printing the page to PDF, but the layout will be exactly like you see the page. That is, if you use an ad blocker, there won’t be any ads in the resulting PDF, etc.
There are several extensions like this.

chrillek · March 30, 2021, 1:50pm

So does it save the HTML document’s text in the PDF, too? Or is the PDF just an encapsulation of a JPEG or TIFF? In other words: do you have to run OCR after capturing to have a searchable PDF?
Update: Forget the question. The plugin description explains that: It is an image embedded in a PDF.
So basically all links in the document are gone, too. That’s fine if you do not need them, of course.

chrk · March 30, 2021, 2:01pm

Yeah, it happens infrequently that I need to use it, so it’s ok for some documents in my case. There is one extension like that, which does the same thing, but preserves links. I just didn’t like that it opened an install notification webpage after every Safari restart, so I use the other one.

BLUEFROG · March 30, 2021, 2:08pm

Back in the day, I used this…

https://tastyapps.com/websnapperpro.html

And though it’s in perpetual beta, I use this at times…

rfog · March 30, 2021, 2:09pm

Just “purchased” it (it was free in my store) and PDF is the encapsulated image set as one-page PDF.

rfog · March 30, 2021, 2:12pm

Wow! This one seems very complete.

BLUEFROG · March 30, 2021, 2:22pm

WebSnapper Pro?

rfog · March 30, 2021, 2:34pm

No, no, the app you recommended, Paparazzi.

Edit, now I see you recommended two apps!!! I only checked Paparazzi!!!

rfog · March 30, 2021, 2:41pm

Ok, now my definitive comment.

Yes, @BLUEFROG, I was talking about WebSnapper.

(Sorry, doing 3 things at a time and I’m a man).

BLUEFROG · March 30, 2021, 3:06pm

No worries!

I haven’t used WebSnapper in years, but it was a great help at the time.
Nowadays, I only use Paparazzi occasionally for a specific in-house function.

chrk · March 30, 2021, 3:18pm

They seem like nice options. Having taken a quick look, it seems that both however don’t save exactly what you see in the browser (1), so they are closer to DEVON’s own PDF clipping features than those other extensions.

(1) unless JavaScript from Apple Events are allowed in Safari’s developer menu while using Wepsnapper’s Safari extension “to properly resolve dynamic modifications on a page (opening menus, closing cookie notes, etc.)”

kewms · March 30, 2021, 4:10pm

I don’t use it often, so I don’t know how well it would work for large-scale data collection, but I like the Copyfish browser extension. It’s most useful for things that don’t already have a text layer, like graphic novels and photos of signs. (I use it for foreign language study.)
https://ocr.space/copyfish

halloleo · March 31, 2021, 1:04am

Good question. My dream is:

A method to capture static, but searchable snapshot of the website as I saw it while I scrolled through.

I know this is a tall order, but this is what I would like for nicely designed blog sites and similar - but not always: Sometimes the website experience is pretty bad due to ads and self promotes (e.g. news sites!), then I don’t want the site as is, but a cleaned up version.

For the latter case DEVONthink’s declutter option can help. But for the first case DEVONthink’s PDF offering looks only sometimes close to what I see on the page.

halloleo · March 31, 2021, 1:06am

@chrk @BLUEFROG @kewms Thanks for all the suggestions. I will try them out.

So far I have tried FireShot and it looks pretty good.