I don’t think that there’s any programming involved here – just a simple smart rule like this (not tested!)
I’ve found when converting WebArchives to Paginated PDFs that documents containing mathematical formulas with exponents are expressed incorrectly. The exponent comes down off its perch. Any thoughts on correcting this?
First, sorry for the long delay in reacting on this thread, things were busy IRL.
Unfortunately, I have come across many pages where capturing as PDF cuts text in the middle of lines, e.g. https://www.keensoft.es/en/alfresco-devcon-2019/ . At the moment, capturing web content is a bit of a mixed bag with DEVONthink – I would welcome a new, independent approach beyond PDF and WebArchives. However, I also understand how tricky it is to even decide on a format in this context. Let’s hope some alternative will present itself.
No problem at all!
And we are always working on something so you never know
Damn man, now I can’t wait for something.
Looking into webarchives I found this thread again. Worth mentioning:
The WebArchive class is deprecated, not the file format webarchive
.
There’s a new methode, from WWDC20 Notes:
WKWebView has learned to create Web Archives with createWebArchiveData(completionHandler:)
Everything’s fine
As long as you don’t plan on ever using these archives outside of the Appleverse.
I simply love WebArchives!!!
Cannot imaging living without them.
What I noticed is, that DTTG (probably DT too) seems to refer to the remote site!
So, most of the time when I do something with WebArchives, I disable Internet access for the iPad and then handle the WebArchives - otherwise, DTTG tries to access the orginal website, as it seems.
This can be annoying, of course.
So a way to disable active internet access for WebArchives would be GREAT
Alternatively, clipper can capture webpages with SingleFile. It produces a self-contained HTML file with all the images, styles, and scripts. The files can be viewed with a regular browser so don’t require anything special but WebKit.
Thanks a lot for this reference! SingleFile looks like a very interesting project; unfortunately, due to “real life” interfering, I won’t have the time to test it for a while. Perhaps someone else is also interested and could look into DEVONthink / Safari integration of this tool, possibly via a script?
I’m curious to see if SingleFile works with this page Golang project structuring — Ben Johnson way | by vignesh dharuman | SellerApp | Medium. This page contains code snippets hosted at Github; I have tried every format the DT clips to, including web archive, and cannot get the code to be embedded in the output.
Scroll down and wait until the GitHub parts are loaded. Afterwards two ways work over here in Safari.
Webarchive (via dragging)
- Select the part that you want to clip
- Drag it onto DEVONthink’s icon in the dock
PDF (via printing)
- Press ⌘+P
- In the left corner under
PDF
selectSave PDF to DEVONthink 3
Instead of using this menu via mouse you can create a shortcut, search the forum.
Thank you for the suggestions @pete31. I’m just not having much luck capturing the entirety of the document. When I print to PDF, the file is subject to the whims of a paper page size, which proves to be too narrow to contain the code samples
I also tried selecting all text in the article and dragging to the DT dock icon, but it too provided spotty results:
I’m including the files created in DT for reference
PDF printing.pdf (383.0 KB)
select and drag to DT icon.webarchive.zip (279.9 KB)
Too narrow?
The width of the printed PDF is contolled by the Page Setup in the printing application.
True, @BLUEFROG. I was able to get a much higher percentage of the code parts to render by going from Portrait to Landscape, but it did not get everything. Just for kicks I created a page that was 50x50 inches and it appears to have captured all the information.
monstrous page pdf.pdf (377.5 KB)
It is obviously too wide, but having to tweak and proofread kind of obviates quick and easy clipping.
Umm, I now checked the records I captured carefully and they are also missing parts (PDF) or do not wrap (Webarchive). So that was a bad suggestion (was in a hurry …)
I’m not sure if clipping this content easily is possible. The content isn’t local to the page - and quite slow loading as well.
Also, the HTML is a mess: “div soup” would be a very friendly term to describe it. The code block is actually delivered as a table – are they still living in 2000?
Indeed!
I’m not a fan of the Medium site, in general. That just adds to it.
Agreed @BLUEFROG , I’m no fan either. Alas, that is where the content lives