Clip to DEVONthink / Save the webpage as-is without cookie confirmation

Ugur · February 5, 2021, 10:09am

I am using the web clipper (in Firefox) a lot to capture web pages. But what annoys me is that the website that I see in Firefox (Cookies confirmed) lands in DEVONthink in a way where I constantly need to re-confirm cookies again, whenever I view the webpage as a document in DEVONthink.

I used to use Evernote to clip webpages, and it would capture a static view of a webpage on that moment.

What I’d like to accomplish in DEVONthink is this:

capture a static view of that site
preferably not in PDF format, but rather HTML
the way I see it in the browser
no reload in DEVONthink whenever I view the document

I could capture as PDF, but even then I see a big banner overlapping the document saying “confirm cookies”.

What am I missing?

Edit

Apparently I had turned off accepting cookies in DEVONthink

Settings > Web > Accept Cookies > Never

But thanks for all the answers!

chrillek · February 5, 2021, 10:36am

Off the top of my hat: Cookies are not part of the web page. They are stored by the browser (and send by the server). So if you save an HTML page as is, the cookies are not part of it.

Regardless of the technicalities: DT says that it is saving the HTML data, and it even shows the size of the page. But it really seems to reload it from the original server, which leads to the behavior that you describe. And which is not desirable.

I think that “HTML page” should mean exactly that: the page itself at the time it was saved. Not something the server sends when DT tries to display the page again. Also, the documentation does not seem to describe the behavior shown by DT.

cgrunenberg · February 5, 2021, 10:41am

The captured HTML page might contain JavaScript and this is supported by DEVONthink (see Preferences > Web) and might cause this. The best options are to use a different file format or clutter-free layout.

Ugur · February 5, 2021, 10:44am

What I have observed is that cookies I confirm in Chrome browser (on a Mac) need no re-confirmation in DEVONthink. But I’d rather not use Chrome browser but stick to Firefox.

So how is it possible that confirmed cookies in Chrome need no re-confirmation but those from Firefox do?

Also, when I capture a website on that very moment, I need to rely on the information captured. That a captured HTML page can change in DEVONthink makes it not acceptable for my work.

But maybe I am missing on something, and there is a switch that makes it possible to statically create a copy of the webpage.

Ugur · February 5, 2021, 10:44am

Unfortunately, in html format there is no way to check “clutter-free” format.

chrillek · February 5, 2021, 10:51am

As @cgrunenberg said: the HTML page might contain JavaScript. Which in turn might build part of the page by requesting data from the server. Imagine a shop system… No static HTML pages there.

Than you should use PDF or MD. HTML is (since a long time already) a dynamic format, thanks to JavaScript. And not only that: even simple links in the page or images are dynamic elements that are loaded from a server (<img src="http://example.com/image.png">). There’s no way to make them static in an HTML document that you save. Regardless of the software you use for that. If http://example.com/image.png points to a car today and to a motorcycle tomorrow, you have to use a different format than HTML if you want to be sure that your file always shows the car.

cgrunenberg · February 5, 2021, 10:55am

You could use formatted notes which are based on HTML too.

Ugur · February 5, 2021, 10:56am

The hint with JavaScript was very helpful.

Turned off JavaScript in

Settings > Web > "Enable JavaScript"

and now I am able to see the captured pages (from within Firefox) without re-confirming cookies in DEVONthink.

Thank you very much!

chrillek · February 5, 2021, 2:13pm

But the pages can still change over time. Or rather their representation.

Ugur · February 5, 2021, 3:03pm

So I should rather save / clip in Webarchive format instead of HTML?

This comment here suggests so:

chrillek · February 5, 2021, 3:39pm

Webarchive is deprecated (Apple deprecates WebArchives - what does this mean for DEVONthink?) and not widely supported. Meaning that apparently the only browser you can use to display them is Safari.

If you want to capture an HTML page frozen at the moment you saw it, my best bet would be a portable format like PDF. Especially if (from your quote) images in Webarchives are downloaded again when the computer you’re viewing them on is online (what a weird concept of “archive”). Which would against the whole idea of having a frozen copy of the page.

BLUEFROG · February 5, 2021, 6:03pm

Ahh, the joys of dynamic content delivery… sigh.

suavito · February 6, 2021, 12:02pm

Not necessarily just dynamic content. Markdown webpages clipped by DEVONthink, wether clutter-free or not, do load from their respective servers too. Images, that is. Which is the reason they do not qualify for archiving either.

chrillek · February 6, 2021, 3:17pm

Good point. As long as references to files (aka URLs) are saved, regardless of the format, the result is not an archive. Only if the current off the referenced files are embedded in the final result one has a real archive. Less complicated: PDF is the best choice here.

lunedi_hax · February 8, 2021, 4:06pm

I use .pdf to get a pretty good representation of the page/s. I used to choose unpaginated from the clipper drop down but I found recently that DT now saves one long strip which it then tries to print all on one A4 sheet (it renders fine on screen so I hadn’t noticed). Now I have changed to paginated pdf which prints OK. This happens on both Safari and Firefox.
Some newspapers prevent clipping to DT; I have found that sometimes it works to export the page as pdf in Safari (File/Export as pdf) and then drag it into DT.

suavito · February 9, 2021, 6:03am

That’s at least one step too many: Just print the webpage and pick Save PDF to DEVONthink 3. This method respects the activation status of Reader View too.

darwin · February 12, 2021, 1:58pm

I save on iPad-DTTG as Webarchive to one database which is synced to my Desktop-Mac. There a smart rule coverts it to a pdf and ist synced back to to my iPad-DTTG (the webarchive is deleted. Most importantly: All weblinks are saved.

toffy · November 4, 2022, 8:00am

Archiving web-pages in DT has also been bugging me.

This doesn’t seem to work universally. On my test page, turning off javascript led to an almost empty page being displayed. Cookie-warning was gone, but also the entire page… So:

In many cases, yes, but the problem is that anything not visible on the page while printing will be lost (e.g. folded content). And some things will not be included despite being visible (e.g. thumbnails from embedded youtube videos).

Another disadvantage with pdf is that if you print the webpage to pdf you often need to make sure that the page format is set to landscape, because if it is in portrait mode, chances are you will get a mobile version of the page (pr something that looks different to what you see in your browser).

If you use the devothink browser plugin and select pdf there, the result is better, but even when I select “PDF (one page)”, I sometimes get two pages (the first one being an empty one).

So there doesn’t seem to be one good solution. You need to turn on your brain when archiving and think about what you want to preserve.

But I still don’t quite understand what actually gets archived when using the html option: do I understand correctly that there is a risk that parts of the page (e.g. images) are not actually stired in DT but loaded from the web, which means that once they disappear from the web, they disappear “from your archive”? And there is no way of stopping this? How does Evernote do it?

chrillek · November 4, 2022, 8:38am

What is printed from a web document to a PDF depends on the “print” style sheet used (or not used) by this document. Therefore, one cannot say if it works or doesn’t in general.

You will always get something that looks different from what you see in your browser. Simply because a browser has no pages, while PDF very much has (printing to “single-page PDF” aside). When I said, “PDF is the best choice here”, I meant that you have the best chances to get a printed version of a document as the author of this document wants it to appear on paper. For example, someone even slightly aware of CSS would exclude the navigation bar from the PDF – it makes no sense there. Also, they’d hopefully print out the link targets, since you can’t click on a printed PDF. And so on.

Of course. There never (or only in very rare cases) “is” an image in a web document. There are just references to it. And if the reference goes away, the image will no longer be displayed. Which is better than someone changing the file the document references, in which case you’ll see a different image than before. The same is true, btw, for scripts and fonts.

There is. But then you change the document because you have to modify the references to images, scripts, and fonts. Which goes a bit against the idea of keeping everything as it is. You can use a command line tool like wget or curl that downloads all assets to your local disk and modifies the references in the original document accordingly.

The main decision here is this: Do you want to “archive” the document as it appears at this moment? Or do you want to see changes like removed or modified images etc.?

Anyway, HTML was not conceived for archiving (simply think about stale links). If you want to save it for future use, you have to live with drawbacks and shortcomings.

joe.lafferty · November 15, 2022, 8:17pm

Some very helpful comments here, thanks!

I want to capture information for teaching purposes, and have preferred capture as paginated PDF as this gets most of the information on the website into the PDF even if the layout is at times a bit skew-wiff, I can live with that. Sometimes I capture PDF ‘clutter free’ but that sometimes cuts of vital parts of the webpage. I have also at times captured both PDF and MD, but again, MD seems at times to cut off early. Another problem with MD files is that if I then move the files from a temp / holding folder, I can loose access to the ‘assets’ as they remain in a sub folder in the temp / catchall location.

I too have been having problems with the dreaded cookie pop up which annoyingly greys out the website. I’ve tried switching off Javasctipt in DT3, and have also downloaded an extension for chrome called Consent-O-Matic, which stops the cookie pop up every time - thank goodness! Like others have shared, I too prefer not to use Chrome (Goggle knows enough about me!) but it does not work on Safari, but does work on the new Arc Browser which seems to be chrome based.

So, I have two questions. first is is there an easy way to move MD files from a temp location and keep the links to images intact? And, secondly, has there been any further developments that might help getting rid of the dreaded cookie pop up in imported PDFs?