Clip to DEVONthink / Save the webpage as-is without cookie confirmation

What I have observed is that cookies I confirm in Chrome browser (on a Mac) need no re-confirmation in DEVONthink. But I’d rather not use Chrome browser but stick to Firefox.

So how is it possible that confirmed cookies in Chrome need no re-confirmation but those from Firefox do?

Also, when I capture a website on that very moment, I need to rely on the information captured. That a captured HTML page can change in DEVONthink makes it not acceptable for my work.

But maybe I am missing on something, and there is a switch that makes it possible to statically create a copy of the webpage.

Unfortunately, in html format there is no way to check “clutter-free” format.

As @cgrunenberg said: the HTML page might contain JavaScript. Which in turn might build part of the page by requesting data from the server. Imagine a shop system… No static HTML pages there.

Than you should use PDF or MD. HTML is (since a long time already) a dynamic format, thanks to JavaScript. And not only that: even simple links in the page or images are dynamic elements that are loaded from a server (<img src="http://example.com/image.png">). There’s no way to make them static in an HTML document that you save. Regardless of the software you use for that. If http://example.com/image.png points to a car today and to a motorcycle tomorrow, you have to use a different format than HTML if you want to be sure that your file always shows the car.

You could use formatted notes which are based on HTML too.

The hint with JavaScript was very helpful.

Turned off JavaScript in

Settings > Web > "Enable JavaScript"

and now I am able to see the captured pages (from within Firefox) without re-confirming cookies in DEVONthink.

Thank you very much!

1 Like

But the pages can still change over time. Or rather their representation.

So I should rather save / clip in Webarchive format instead of HTML?

This comment here suggests so:

Webarchive is deprecated (Apple deprecates WebArchives - what does this mean for DEVONthink?) and not widely supported. Meaning that apparently the only browser you can use to display them is Safari.

If you want to capture an HTML page frozen at the moment you saw it, my best bet would be a portable format like PDF. Especially if (from your quote) images in Webarchives are downloaded again when the computer you’re viewing them on is online (what a weird concept of “archive”). Which would against the whole idea of having a frozen copy of the page.

Ahh, the joys of dynamic content delivery… sigh.
:roll_eyes:

Not necessarily just dynamic content. Markdown webpages clipped by DEVONthink, wether clutter-free or not, do load from their respective servers too. Images, that is. Which is the reason they do not qualify for archiving either.

1 Like

Good point. As long as references to files (aka URLs) are saved, regardless of the format, the result is not an archive. Only if the current off the referenced files are embedded in the final result one has a real archive. Less complicated: PDF is the best choice here.

I use .pdf to get a pretty good representation of the page/s. I used to choose unpaginated from the clipper drop down but I found recently that DT now saves one long strip which it then tries to print all on one A4 sheet (it renders fine on screen so I hadn’t noticed). Now I have changed to paginated pdf which prints OK. This happens on both Safari and Firefox.
Some newspapers prevent clipping to DT; I have found that sometimes it works to export the page as pdf in Safari (File/Export as pdf) and then drag it into DT.

1 Like

That’s at least one step too many: Just print the webpage and pick Save PDF to DEVONthink 3. This method respects the activation status of Reader View too.

1 Like

I save on iPad-DTTG as Webarchive to one database which is synced to my Desktop-Mac. There a smart rule coverts it to a pdf and ist synced back to to my iPad-DTTG (the webarchive is deleted. Most importantly: All weblinks are saved.

Archiving web-pages in DT has also been bugging me.

This doesn’t seem to work universally. On my test page, turning off javascript led to an almost empty page being displayed. Cookie-warning was gone, but also the entire page… So:

In many cases, yes, but the problem is that anything not visible on the page while printing will be lost (e.g. folded content). And some things will not be included despite being visible (e.g. thumbnails from embedded youtube videos).

Another disadvantage with pdf is that if you print the webpage to pdf you often need to make sure that the page format is set to landscape, because if it is in portrait mode, chances are you will get a mobile version of the page (pr something that looks different to what you see in your browser).

If you use the devothink browser plugin and select pdf there, the result is better, but even when I select “PDF (one page)”, I sometimes get two pages (the first one being an empty one).

So there doesn’t seem to be one good solution. You need to turn on your brain when archiving :wink: and think about what you want to preserve.

But I still don’t quite understand what actually gets archived when using the html option: do I understand correctly that there is a risk that parts of the page (e.g. images) are not actually stired in DT but loaded from the web, which means that once they disappear from the web, they disappear “from your archive”? And there is no way of stopping this? How does Evernote do it?

What is printed from a web document to a PDF depends on the “print” style sheet used (or not used) by this document. Therefore, one cannot say if it works or doesn’t in general.

You will always get something that looks different from what you see in your browser. Simply because a browser has no pages, while PDF very much has (printing to “single-page PDF” aside). When I said, “PDF is the best choice here”, I meant that you have the best chances to get a printed version of a document as the author of this document wants it to appear on paper. For example, someone even slightly aware of CSS would exclude the navigation bar from the PDF – it makes no sense there. Also, they’d hopefully print out the link targets, since you can’t click on a printed PDF. And so on.

Of course. There never (or only in very rare cases) “is” an image in a web document. There are just references to it. And if the reference goes away, the image will no longer be displayed. Which is better than someone changing the file the document references, in which case you’ll see a different image than before. The same is true, btw, for scripts and fonts.

There is. But then you change the document because you have to modify the references to images, scripts, and fonts. Which goes a bit against the idea of keeping everything as it is. You can use a command line tool like wget or curl that downloads all assets to your local disk and modifies the references in the original document accordingly.

The main decision here is this: Do you want to “archive” the document as it appears at this moment? Or do you want to see changes like removed or modified images etc.?

Anyway, HTML was not conceived for archiving (simply think about stale links). If you want to save it for future use, you have to live with drawbacks and shortcomings.

Some very helpful comments here, thanks!

I want to capture information for teaching purposes, and have preferred capture as paginated PDF as this gets most of the information on the website into the PDF even if the layout is at times a bit skew-wiff, I can live with that. Sometimes I capture PDF ‘clutter free’ but that sometimes cuts of vital parts of the webpage. I have also at times captured both PDF and MD, but again, MD seems at times to cut off early. Another problem with MD files is that if I then move the files from a temp / holding folder, I can loose access to the ‘assets’ as they remain in a sub folder in the temp / catchall location.

I too have been having problems with the dreaded cookie pop up which annoyingly greys out the website. I’ve tried switching off Javasctipt in DT3, and have also downloaded an extension for chrome called Consent-O-Matic, which stops the cookie pop up every time - thank goodness! Like others have shared, I too prefer not to use Chrome (Goggle knows enough about me!) but it does not work on Safari, but does work on the new Arc Browser which seems to be chrome based.

So, I have two questions. first is is there an easy way to move MD files from a temp location and keep the links to images intact? And, secondly, has there been any further developments that might help getting rid of the dreaded cookie pop up in imported PDFs?

We are working to try and minimize cookie banners when clipping.

  • Will it be perfect? Not with changing approaches employed by web devs.

  • Then will it be better? From what we’ve seen, yes though we know there will be exceptions within the estimated 1.7 billion websites on the Internet :flushed:

2 Likes

thanks Jim,
might have guessed you’d be working on it!
keep up the great work - DT3 is my second brain, and the main reason I moved to Mac in 2007. Not regretted it as you and the team has kept improving DT and I appreciate it.

1 Like

We’re glad to have your continued support :slight_smile: