webarchive and pdf problems

dionysius2 · April 13, 2010, 1:48am

OK, here’s a problem for all you Gods and Goddesses out there. I have had this same concern with other softwares so it would be nice to understand this once and for all.

If I save a page as an archive, and the page is WITHIN a website that I had to log into, the archived page is just the log-in page, when I go to look at it later – it’s not the page I wanted at all.

Why is it that an archive cannot be an HTML version of THAT PARTICULAR PAGE and be kept on my computer forever more? It’s not really an HTML version, it’s more of a link, really, since it also seems to update itself if the page itself is updated. This isn’t what I think of as an ‘archive’. This to me is just the same as saving the link. Am I wrong? I remember having this same problem with Webstractor.

What you need to do is save the page as a PDF to have a real copy on your hard drive forever more. But in fact, using DT’s bookmarklets, the PDF buttons make a PDF of … the log-in page! That’s so weird. What you have to do, it seems, is PRINT the page, and choose PRINT PDF TO DT. And then it usually works.

At the extreme end, even printing to PDF sometimes gives me a PDF of the log-in page. For instance, I made a payment today via Western Union and I was on the receipt page of WU, with my payment information and all that before my eyes, and I tried printing it to PDF to save it as a receipt and it printed the Western Union home page! I had to actually take a SCREENSHOT of the page in order to keep a record of it. Perhaps that page is ultra-secure somehow.

Anyway, this is important, because I’m saving lots of webarchives for research, and I’m going to be worried if, when I go to get them later, they are all archives of updated pages, which no longer contain the information I need.

Thanks.

Bill_DeVille · April 13, 2010, 2:05am

Many sites, including my bank’s online Web site, prohibit a second download of a displayed page unless a second login is made by the user. What you have discovered is that attempting to capture such a page by means of a script or bookmarklet results in a re-download of the page, and so all you capture is the login page.

Of course, you can select the text/image content of such a page and capture it as a rich text note.

Usually, to preserve formatting when I’m saving a bank transaction to a database I ‘print’ the page as PDF. The Print command is allowed by my bank, and the page doesn’t have to be re-downloaded, so that always works.

dionysius2 · April 13, 2010, 2:19am

Thanks Bill, that’s really useful about the Rich Text – I didn’t know that, though I did plan to experiment with each button.

sjk · April 13, 2010, 6:59pm

Any recommendation for saving PDFs of this forum’s threads so it preserves formatting? Looks fugly when I’ve tried (with the prosilver board style).

cturner · April 13, 2010, 9:19pm

How about “Print View” and tinker with the style sheet some?

korm · April 13, 2010, 9:50pm

I use DTPO to browse this forum. Open the forum in its own window. Use “Capture PDF” in the contextual menu. This is what I get. Formatting ok?

Bill_DeVille · April 13, 2010, 10:00pm

Comment: If I’m capturing a thread that contains an embedded text box, e.g., so that one can copy a script, I’ll choose to capture as WebArchive.

sjk · April 14, 2010, 2:37am

I don’t know what/where that is, but am curious.

Indeed you do.

Yep, that formatting is quite satisfactory for my infrequent usage purposes. And mentioning Capture PDF led me to discover that I can get the same result using the Add PDF Document to DEVONthink script directly from Safari. Problem solved – thanks!

cturner · April 14, 2010, 9:21am

Korm’s captures are pretty nice, but if you’re still curious: “View Source” on a print view page and search for “.css”.

You can download it and muck with it as long as (check the path, though) you keep it in the same directory as the print view HTML.

Enjoy! Charles

sjk · April 14, 2010, 5:12pm

I’m still dense – what’s a print view page?

cturner · April 14, 2010, 6:21pm

sjk · April 14, 2010, 11:59pm

D’oh! Thanks, Charles. All this time I’ve never paid attention to those icons (and wish I had).

I actually prefer that printable view better, without any CSS adjustments, than the capture method for creating PDFs of full threads. But maybe I’ll find reasons to use the capture method, too. Problem re-solved.

cturner · April 15, 2010, 12:53am

Yeah, it’s pretty good. Would be nice if the font size difference between body, quotes, code, etc. was a bit smaller, but that’s a pretty trivial issue.

Best, C

dionysius2 · April 15, 2010, 10:50pm

OK I can see the perfection of korm’s formatting if you want to surf in DT. I don’t know where Add PDF Document to DEVONthink is – if it’s a script, and you’re in Safari, do you need to put the script into Safari’s scripts folder, sjk?

What I have noticed now is that there is a contextual menu in Safari itself which does exactly what I’m talking about – it saves the page as an archive rather than as a PDF and it doesn’t update when the page is updated, and you know why? Because the address for the page is on my hard drive rather than on the web. The HTML archives I save in Devonthink all have addresses on the web. Doesn’t that – I’m asking this again – technically make them links rather than archives?

Meanwhile I can always save pages as archives in Safari itself and drag them into DT… although I don’t suppose this is going to be so easy if I’m saving a bunch of webpages from Devonagent, into DT.

Of all these questions, my main one is: can somebody who made DT tell me why archives have a web address rather than a hard-drive address? This is me learning. Cheers.