I need to capture webarchive files of more than 500 pages coming from a website.
I have all the 500+ bookmarks ready, and I can use either the extra script provided by DEVONthink or my custom one (that saves as a PDF at the same time and does a few other useful things such as saving an image from the webpage).
However, some information is hidden (voluntarily by the website) on the webpages I’m trying to capture unless I’m logged in. I thought “no problem, I’ll just open one of the links in the internal DEVONthink browser and log in from there”. Well, not exactly.
After I’m logged in, if I keep opening bookmarks from the same website, DEVONthink keeps the same session ID (it uses the same cookie for each query to the web server) so everything looks fine.
However, once I try to capture a webarchive or PDF of the same URL, I get the HTML content of a guest user. Apparently the process that does such captures does not retain the cookie and session ID.
Is there a way to keep my browsing session when capturing a webarchive in a script?
PS: Before getting the following method as a suggestion, I have already thought of using curl to send a POST query to log in (in the current case, or a credentials cookie in other cases). However, to use the same session for the download, I would have to also use curl — not DEVONthink — and curl does not know how to output webarchive files. So that doesn’t seem like a possible solution.
Does anyone have an idea at DEVONtech?
I think that I must not be the only one that wants to save web information as webarchives or PDFs from a website accessible to subscribers only.
I’ve had a similar problem when attempting to clip content from websites that operate a paywall. Clipping as a webarchive, PDF or any of the other formatting options available via the DTPO webclipper doesn’t work - the clipped page replaces the content I wish to view with the standard ‘subscribe please’ content. I have yet to find a way around this, but I suspect that it may not be possible for the developers to do this due to protective measures at the website end.
Hello Kinsey, this is normal, or at least it can be explained. The web clipper extension does not send the content of the webpage you’re currently browsing to DEVONthink, but just the URL and the output type you want, so it reloads internally the URL without any cookie or session credentials. Of course, it’s not practical and the DEVONthink developers could also decide to send the whole web content from the browser to their application, but it’s probably more complicated than it sounds, because web content is not only the HTML part but also images and scripts that are linked in the source code, so they must be loaded as well while they may also require a valid session ID.
A possible workaround should have been to connect to the website through the internal DEVONthink browser, and then to capture from within DEVONthink. While it should work — and I think I can even remember that I used to do, so it must have been working at some point — it appears that now it doesn’t, at least on the website I need to capture.
With this method, it would be much less complicated for DEVONthink to share the cookie of the internal browser with the process that saves the webarchive or the PDF files than to change completely the web clipper extension, so I’m still waiting for their answer.
The latest versions of the WebKit are quite restrictive (to improve the security) and therefore don’t share cookies anymore. Possible workarounds:
Add a bookmark to DEVONthink and capture the web archive/PDF document (see menu Data > Capture or contextual menus or action menu of navigation bar) after opening the bookmark
Print a PDF to DEVONthink
Save a web archive from Safari to DEVONthink’s global inbox folder
Use DEVONagent Pro and add the rendered web page in the desired format to DEVONthink
Hello Christian,
Thank you for your detailed answer.
Workaround 1: I’m inside DT’s internal browser, I can display the webpage contents, while being logged in to the site. Clicking in DT’s internal browser toolbar on capture webarchives or PDF does capture the logged-user-only content. However I can’t do that 542 times for each link I have to capture, and my Applescript code that launches the same command only gets the “unlogged” content.
set this_webarchive to (create web document from this_URL agent my_user_agent in current group)
set this_PDF to (create PDF document from this_URL agent my_user_agent in current group width my_browser_width without pagination)
If this works manually from within DT, why does the equivalent in Applescript not have the same effect? I can’t see why the cause could be Safari’s security limitations, because the script automates what DT can already do. After all, I can get new logged-user-content by following links in DT’s browser, so the context is kept while I navigate within the window or even if I select other bookmarks of the same website: the session is kept all the time inside DT even for new pages except when scripting. I may be tempted to call that a bug, the scripted command should have the same output as the button in the UI.
Workarounds 2: Of course printing to PDF from Safari will work (or saving webarchives from Safari or DEVONagent), but I cannot repeat it 542 times, and this would have to be automated outside of DT. Can you suggest a script that would open Safari, print the PDF and import it back to DT?
Both Clip to DEVONthink and AppleScript use a background task to capture contents so that WebKit bugs don’t crash DEVONthink.
Safari’s AppleScript support is very limited but this script should work (if you’re logged in in DEVONthink’s browser):
tell application id "DNtp"
set theSelection to the selection
if theSelection is not {} then
set theWindow to missing value
try
activate
show progress indicator "Converting..." steps (count of theSelection) with cancel button
repeat with theRecord in theSelection
set theName to name of theRecord
step progress indicator theName
set theURL to URL of theRecord
if theURL begins with "http:" or theURL begins with "https:" then
try
if theWindow is missing value then
set theWindow to open window for record theRecord
else
set URL of theWindow to theURL
end if
repeat while loading of theWindow
delay 0.1
end repeat
set theData to paginated PDF of theWindow
set theGroup to parent 1 of theRecord
set theConvertedRecord to create record with {name:theName, type:PDF document, URL:theURL} in theGroup
set data of theConvertedRecord to theData
end try
end if
if cancelled progress then exit repeat
end repeat
hide progress indicator
on error error_message number error_number
hide progress indicator
if the error_number is not -128 then display alert "DEVONthink Pro" message error_message as warning
end try
if theWindow is not missing value then close theWindow
end if
end tell
This is a great idea — to get the data property of the current window — and works perfectly. The only change I had to make was that I only close theWindow if count of theSelection > 1, because when only one bookmark is selected, DT uses the current preview tab (for views that have a preview) instead of opening a new window.