There’s the following page which seems particularly difficult to “clip”/capture into DEVONthink:
I tried capturing it using clipping tool and format as Formatted Note. I also tried PDF (not paginated). Neither of those were able to fully capture this page. What ends up is that the latter half of the images (images from the middle to the bottom of the webpage) aren’t captured.
About the only way I’ve found to capture this is to do an old-fashioned Copy, and then Paste into a Formatted Note document in DT.
Am I missing something? I’m using the latest version of DT on Big Sur (Big Sur was a recent upgrade from today; I moved from Catalina). Thanks!
FYI, while there may be some trick or method in DEVONthink to overcome this that others may offer, sometimes web sites just make it difficult to get a capture. They do it maybe deliberately, and maybe in the spirit of making “pretty” web sites, or using technology that streamlines their operations with lots of automation which of course trickles to the server as yet more automation. Gone are the days when web sites are simple-enough HTML.
Thanks for the suggestion. I do something similar in your 2nd section which you labelled “most useful way”.
I haven’t tried printing the Reading View of Safari but that’s an option to keep in mind. I usually don’t like to save webpages in PDF format. That said, I wish there were a way the DT could capture the Reading View of Safari without the user doing a manual Copy and Paste.
Also, I totally forgot about “Print Friendly”! Good reminder! I have so rarely use it in the past couple of years it just slipped my mind.
You’re right that web pages become more and more difficult to capture in just one step. More often than not, after capturing a page in DT I go into DT to remove all the extraneous elements like ads interspersed throughout the document, unnecessary (and HUGE!) social media icons, etc.
I collect more web pages than I probably need. That being said, my normal practice is to show the pages in Reading View in Safari and then if the rendering is good enough (often times the pasted images from Twitter they include don’t show and sometimes images not ther) then if I’m ok with the rendering “print” into PDFPen. There I delete all the extra pages if there are any. The best web sites work, but a very small number simply don’t. I then create an Optimised PDF shrinking all images to 75 DPI, then save that into DEVONthink. That often saves MB’s. If all that does not work for a web site, frankly, I normally just give up and not try as 99% at this point I fail to save those bad pages. I’m working on a small DEVONthink Rule to process incoming PDF’s with GhostScript to shrink images to 75dpi to make the my process be more automated, but I’ve not gotten there yet. Frankly, that’s just for fun as I don’t do enough saving to make it worth to spend too many hours automating it with a bullet-proof AppleScript.
I used to like the PDF format, but have shied away from it (if the source is not PDF itself) because while they may visually look good, text in PDF can be difficult to copy properly depending on how PDF engine flows the text, especially across pages.
Looks like you’re having fun creating an automated way to handle your PDF captures
If you are talking about the copying with clipboard, never saw this a problem.
I use PDF as easy to highlight, OCR works, and the format will survive for a very long time (as it’s an ISO standard).
edit: creators of PDF’s can turn on the feature that prevents reader from using clipboard to copy. not something i do when creating PDFs unless I have a reason to. Perhaps that is what is interfering with your success to using clipboard on PDFs created by others.
You’re loading an amp page here. That is a format developed by google to (allegedly) make web pages load faster on mobile devices. Apart from the fact that google kind of forces this format on everybody although it has never been standardised, it will probably make sure that images “below the fold” (i.e. not visible right now) will not be loaded by the browser. Only when you scroll to them will the browser request them from the server. This might be one reason for the problems that you’re experiencing.
As a matter of sample, I’m attaching your page. How I did it? First I captured as Formatted Note, then I went back into Safari, selected Reading View, and then selected and copied all text. Once I went back into the Formatted Note delete all content and pasted. After that I deleted some backgrounds.
Google gave amp enabled pages higher search rankings but they’ve announced they aren’t gong to do that anymore so the whole amp thing might slowly go away.
Yes, I do mean copying with a clipboard. It’s been a long time since I’ve used the PDF clipping function but depending on how the PDF text flow works, when copying body text across pages with header and footer information, it is possible that this would include the header and footer information if the text flow does not distinguish that the header and footer are actually separate elements rather than part of the body text. I’ve seen that problem before.
And then sometimes when initial caps are used on a website, copying text that includes the initial caps may also copy other weird spaces and ads/graphics, etc.
For my purposes in DT, I typically just want to copy the text and any relevant graphics, and care less about the format or visual style. (That, however, isn’t the case for other things I collect ).
For now, I use either formatted note or markdown. Its basic text abilities should make it a long-term archivable — and easily retrievable — format.
I think you’re probably right. It may explain why even when copying in PDF format, the graphics from the middle to the bottom of the webpage are blank in the PDF!
Wow! I had no idea you could do that. I’ll keep that — the removing of /platform/amp — in my bag of tricks. And yes, that trick did work for that page! I could capture as Formatted Note and then directly in DT remove extraneous icons, etc. Thanks!