Storing webpages in DT

I have looked over the past weeks into the option on how to store web content in DT. They all have their merit but they also are not working on all or most webpages.

I tested webarchive and the pdf options for different webpages.

I found a way which works for me that is to use Firefox with the FireShot extension. This extension takes images of webpages while scrolling through the text. Combines the images into one pdf and stores it in a local folder. I then import the pdf into DTPO and OCR the image pdf into a PDF and Text file. I know it is a bit long winded to convert text into an image and then back into text. But the benefit is that it works on all webpages I have encountered so far.

My question / feature request would be:
Add the feature of photographing webpages into DT. This would future proof getting web content into DT.

Happy to hear your thoughts.

As stated in a different post, I really love this product and I am impressed with all the features it offers.

What is the issue with clipping directly to PDF with our browser extension?

For some of the webpages content is missing. I assume these are JavaScript issues. For some webpages I have to scroll through before I can clip them to make sure all images are loaded. It also happens that banners are showing on clipped pages which were not visible at the time of clipping.

As I said, I have a work around these issues with the FireShot option. The post was meant to show a way to clip webpages without the above issues.

Thanks for replying to my post.

What you described is perfectly normal nowadays: images are lazy-loaded to save time and get good points with Google.

I would still suggest that you find another way to clip content: taking a screen shot and them OCRing it is not the most reliable way to capture text (Iā€™m wondering how that works with multi-column pages, BTW). There should be browser-specific plugins available that extract the text with JavaScript, for example.

1 Like

Thanks you are right that text to image and OCR it back is not wonderful. But I had that already stated in my post. This solution is working for me.

The post is meant as an idea / feature request.