Web archives, Firefox Scrapbook, set data, encoding - issues

I ran into a bit of problem while trying to migrate from Firefox’ quite nice Scrapbook plugin to the hopefully even more excellent DEVONthink Pro as my potentially sole pile for digital written things collected here and there. My applescript attempts produced some results that I can’t explain and which don’t seem to be documented either. As far as I can tell, there are some hiccups with devonthink’s “create record” and/or “set data” commands.

The moving source…
Scrapbook’s data management is relatively easy:

  • a root folder
  • –folders for any page grabbed (named with a timestamp
  • ----the files of the webpage in that folder (index.html as the main page, .jpgs .swfs …
  • ----index.dat

The index.dat has exactly those information you wouldn’t want to miss in DEVONthink:

  • date of creation
  • URL
  • folders
  • title
  • comments

Slightly unfortunately, the Scrapbook developer(s) orphaned the folder-property in the index.dat a few months ago, propably last October. These folder data is now eexclusivly kept in a kind of central directory file called scrapbook.rdf, which I learned the morning after my macbook ran my applescript for a decent part of the night. (As I’m not too familiar with scripting these xml-like files, I preferred to manually move a few hundred pages into the correct dt groups. DT’s categorisation support has been a great help.)

…and a stubborn target
As to the role of DT in that migration procedure, the important lines of code are the ones, Christian has given several times as an example of how to import webpages into web archives stored in DT:


     set theArchive to create record with {name:theName, URL:theURL, type:html} in theGroup
     set data of theArchive to download web archive from theURL

The script almost flawlessly processed my four-digit Scrapbook pages. Everything looked very nice in DT, nice pages with nice metadata in the addres bar and the info box and put nicely into Scrapbook-folder like groups (apart from those pages since October). Only a few minor issues were obvious:
a) dt’s adress bar still shows something like “file://pathtoScrapbookEtc” instead of “http://thiswasmyhome.org” even though the URL field has been set to the latter by applescript,
b) clicking on the at-sign next to the path field in the Information windows does nothing,
c) thumbnails of webarchives imported by a script only appear once you click into and edit the webarchive (is there a script for placebo-editing lots of pages around?).
d) (Off-topic: I had a look at DT exporting features, just in case. I still have tons of stuff in old files from asksam Surfsaver, a once nice, now doomed Windows software. I think your “DEVONtech_storage” file is a bit too proprietary and too closed-lipped.)

…with an issue
However, when I “disconnect” the source from Devonthink by either renaming the source folder in Finder (in case of my Scrapbook files with their “file://”-URLs) or by switching off airport (in case of pages with “http”-URLs like something.com), Devonthink is no longer able to display the page. It just displays - nothing. Strangely however, the devonthink adress bar (?) displays reasonable information on the size, amount of characters, the subject and even all the right values in the Information window. Clicking on the menu bar command Format > Edit source (Quelltext bearbeiten) even reveals good looking html code. Clicking then on Format > View page (Seite betrachten) again shows: a blank piece of white nothing. (fyi: The database properties dialogue shows that there are hardly any pictures in the database. However, the page sizes quite often are > 100 kb.) You can still drag the webarchive from DT to the Finder and open it in an texteditor, the sources seem to include binary data for jpgs etc and the archive opens well in Safari. According to the texteditor, it’s UTF-8 encoded. (I had quite a problem reading umlaute from Scrapbook’s UTF-8 encoded index.dat, so I chose an third party librabry “Textcommands” to reencode the input string from utf-8 to unicode)

After grabbing pages from within Safari with the “Add web archive for DEVONthink.scpt” script located in ~/library/scripts/Applications/Safari, I think DT treats http-URLs differently than file-URLs. DT never manages to show pages that were grabbed from adresses like “file:///iamyourmachine/weareusers/thisisyourhome/justgrabme.html” once the source is “disconnected”. However and interestingly enough, you can still click on the “Open with…” command, select, let say, “Safari”, and there is the page, that DT couldn’t show.

I’ll attach some code with comments describing how strangely and differently DT deals with different parameters for the “set data” command.


set dtrootgroup to "Inbox_test"
set source_URL to "file://localhost/folders/Scrapbook/20050310200743/index.html"

(*
  dealing with scrapbook folder, looping and repeating, dealing with index.dat etc....
  i'm happy to share it once we've solved this set data-thing....
*)

create_new_rec_webarchive("Das Leben als Chance zur Krise", source_URL, "www.test.de/order/blub.html", {"history", "warfare"}, dtrootgroup, "20050310162344")

on create_new_rec_webarchive(theTitle, scrapbookURL, originURL, scrapbookfolder, devonwebstoregroup, timestamp)
	tell application "DEVONthink Pro"
		try
			-- Set target group
			set AppleScript's text item delimiters to {"/"}
			set targetpath to devonwebstoregroup & "/" & (scrapbookfolder as string)
			set AppleScript's text item delimiters to {""}
			set targetlocation to create location targetpath
			
			-- Create Record
			set theArchive to create record with {name:theTitle, type:html, URL:originURL, path:scrapbookURL} in targetlocation
			
			--#########################
			--#####  the crucial DT lines
			-- Fill the webarchive with content

			--set data of theArchive to download web archive from scrapbookURL 
			-- content not displayed in dt if disconnected from source, 
			-- "open with...> safari" works fine
			
			--set data of theArchive to download web archive from "http://www.heise.de/newsticker/meldung/87932" 
			-- works fine in script, shows page in DT even with 
			--airport off, everything's just perfect

			set data of theArchive to download web archive from "http://localhost/~meandmyhome/index.html" 
			--raises err 1700 from dt, even though the 
			--URL mentioned works fine in Safari

			--set data of theArchive to download web archive from "http://localhost/~meandmyhome/scrapbookroot/20050310200743/index.html" 
			--raises err 1700 from dt, even though the 
			--URL mentioned works fine in Safari
			
			-- stuff commented out
			--set path of theArchive to scrapbookURL
			--set URL of theArchive to originURL
			--set comments = folder
			
			--set creation date = timestamp...
			--..omitted stuff			
			
		on error errText number errNum
			log "###ERROR: " & scrapbookURL
			log errText & ", " & errNum
		end try
	end tell
end create_new_rec_webarchive


I tried accessing my ScrapBook from DEVONthink using Indexing. The main problem was, I’m using ScrapBook as a way of capturing my browsing history – and that just got to be a huge amount of indexing every time I would sync the directory. My script would change the directory name of the scrapbook entry (as seen from DEVONthink) into the title name. But it was just too much.

Enter Spotlight. I found that if I change the index.html files to $TITLE.html, and hide the extension, that MoRU becomes a wonderfully effective way of searching through my web history. After all, I tend to not want things in DEVONthink until I’ve confirmed their value.

Now I doubt this matters much if you only capture into your ScrapBook things that are already valuable.

In that case, I might recommend searching for the command-line utility “webarchiver”, and just converting your entire ScrapBook database into webarchives, which you can then drag-and-drop into DEVONthink.

John

Bugs B and C are solved. Thanks for that.
A still is an issue, albeit a minor one.

I don’t suppose you ever got DEVONthink to effectively archive your scrapbook database…did you?

If so, I would be sincerely grateful if you could find it in your heart to respond back to this post with precisely how you did it, and with what applescript exactly (in layman’s terms if at all possible, as I am completely ignorant when it comes to apple-scripting)

I really look forward to hearing back from you about this! Thanks…

it worked. and now, with this alleged webkit bug fixed, a seamless migration is possible.

as to the script: just change the folders - both file system and devonthink - to your personal settings, save it, compile it, run it. and look what happens.

You’ll also need a collection of scripts called “TextCommands”. osaxen.com/files/textcommands1.1.3.html

scrapbook2devon_build16.applescript

  

global Scrapbook_root
global dtrootgroup

-- Root group in dt for scrapbook files
set Scrapbook_root to "HD:Users:me:Documents:Projekte:Test_Infomanagement:Migration:ScrapbookAuszug1:" as string
--set Scrapbook_root to "HD:Users:me:Documents:Projekte:Test_Infomanagement:Migration:Resteimport" as string
-- /Users/me/Documents/Projekte/Test_Infomanagement/Migration

set dtrootgroup to "--Inbox"


scrapbook2devon()


on scrapbook2devon()
	
	-- root folder for scrapbook subfolders
	set this_folder to Scrapbook_root as string -- (choose folder with prompt "Pick the folder containing the files to process:") as string
	
	--Scrapbook-Einzelverzeichnisse festlegen
	tell application "System Events"
		set scrapFolders to every folder of folder this_folder
	end tell
	
	-- Loop durch die Ordner der einzelnen Seiten
	repeat with i from 1 to the count of scrapFolders
		
		log "####################################" & return
		set thisFolder to (item i of scrapFolders)
		URL of thisFolder
		log "####################################" & return & return
		--get properties of thisFolder
		--get URL of thisFolder
		--get URL of thisFolder
		--stop
		
		--set filenames
		set name_of_thisfile to (path of thisFolder & "index.dat") -- :-format:
		set indexHtml_url to (URL of thisFolder & "index.html") -- file:..scrapbook-URL
		--log name_of_thisfile
		
		try
			--open file and convert it fix unicode problem
			set filecontent to get_filecontent_as_string(name_of_thisfile)
			tell application "TextCommands"
				set filecontentUTF to convert to unicode filecontent from "utf-8"
			end tell
			
			-- write content of index.dat into 2-dimensional list
			-- scrapbook's index.dat's fields: id, type, title, chars, icon, source, comment, folder
			tell application "TextCommands"
				set listdata to search filecontentUTF for "^(\\w+)\\t(.*)$" with regex and individual line matching -- well, the comment folder might be a problem due to possible linefeeds (upd: no br-tags, no chr10. fine.)
			end tell
			
			-- loop through rows of that list and return field/value-pairs
			set id_value to ""
			set title_value to ""
			set source_value to ""
			set comment_value to ""
			set folder_list to {""}
			set comment_value to ""
			set tpye_value to "" -- in scrapbook either blank or "file"
			
			repeat with i from 1 to (count listdata)
				set thisfield to item 1 of item (i) of listdata
				set fieldvalue to item (2) of item (i) of listdata
				if thisfield is "folder" then
					set AppleScript's text item delimiters to {tab}
					set folder_list to text items of fieldvalue
					set AppleScript's text item delimiters to {""}
				else if thisfield is "id" then
					set id_value to fieldvalue
				else if thisfield is "title" then
					set title_value to fieldvalue
				else if thisfield is "source" then
					set source_value to fieldvalue
				else if thisfield is "comment" then
					set comment_value to fieldvalue
				else if thisfield is "type" then
					set type_value to fieldvalue
				end if
				log thisfield & ": " & tab & fieldvalue
			end repeat
			
			-- Save scrapbook file in devonthink	
			my create_new_rec_webarchive(title_value, indexHtml_url, source_value, folder_list, dtrootgroup, id_value, comment_value, type_value)
			
		on error errText number errNum
			log "###FEHLER in scrapbook2devon subroutine:  " & indexHtml_url
			log errText & ", " & errNum
		end try
	end repeat
	
end scrapbook2devon

on create_new_rec_webarchive(theTitle, indexHtml_url, originURL, scrapbookfolder, devonwebstoregroup, timestamp, commenttext, type_value)
	tell application "DEVONthink Pro"
		--activate
		--with timeout of 120 seconds
		try
			--set theDestination to display group selector "Destination" buttons {"Cancel", "OK"}
			
			-- Set target group
			set AppleScript's text item delimiters to {"/"}
			set targetpath to devonwebstoregroup & "/" & (scrapbookfolder as string) -- deal with scrapbookfolder
			set AppleScript's text item delimiters to {""}
			set targetlocation to create location targetpath
			
			-- Create Record
			if type_value = "" then log "### TYPE nicht leer:" & type_value
			set type_value to "" -- warum überhaupt abfragen?
			set theArchive to create record with {name:theTitle, URL:indexHtml_url, type:html} in targetlocation --URL:originURL, path:indexHtml_url, comment:commenttext
			
			log "==== record-eintrag in devon erstellt"
			log "indexHtml_url: " & indexHtml_url
			-- Fill it with content etc.
			set data of theArchive to download web archive from indexHtml_url
			set path of theArchive to indexHtml_url
			set URL of theArchive to originURL
			set comment of theArchive to commenttext
			
			--set creation date = timestamp...
			set this_day to get texts 7 thru 8 of timestamp as integer
			set this_month to get texts 5 thru 6 of timestamp as integer
			set this_year to get texts 1 thru 4 of timestamp as integer
			set this_hour to get texts 9 thru 10 of timestamp as integer
			set this_minute to get texts 11 thru 12 of timestamp as integer
			set this_second to get texts 13 thru 14 of timestamp as integer
			
			set myDate to current date
			set day of myDate to this_day
			set month of myDate to this_month
			set year of myDate to this_year
			set time of myDate to get texts 9 thru 14 of timestamp as integer --fehler: (this_hour & ":" & this_minute & ":" & this_second)
			set the date of theArchive to myDate
			
			
		on error errText number errNum
			log "###FEHLER in create_new_rec_webarchive subroutine: " & indexHtml_url
			log errText & ", " & errNum
		end try
		--end timeout
		--activate
	end tell
end create_new_rec_webarchive


on get_filecontent_as_string(fileRef)
	try
		set f to open for access fileRef
		set str to read f as string
		close access f
		return str
	on error errText number errNum
		log "###FEHLER: " & " (get_filecontent_as_string)"
		log errText & ", " & errNum
		return ""
	end try
end get_filecontent_as_string


Thanks so much for the prompt reply about this script…

I cannot seem to figure out what/where I should enter into the script for the path of my Scrapbook folder in DT, and for the actual path of my Scrapbook captures folder that I have designated in Firefox???

Here is the path of my Scrapbook folder that I’ve created in DT:

Macintosh HD:Users:micahdiamond:Documents:DEVONthink Pro Database:Tech Documents:Firefox Scrapbook:

Here is the path of my Scrapbook folder that I have setup the scrapbook plugin to capture pages to…

Macintosh HD:Users:micahdiamond:Documents:Firefox Scrapbook-Captured Webpages

would there be any way that you could perhaps respond back to this post with the exact code that I should plug in to the script so that I can get it working properly?

Any help you could offer would be greatly appreciated! Thanks…

The first file system path is irrelevant. What is needed instead is the name of the devonthink group in which the web archives should reside in after the import procedure. Open dt, open the option-command-six view, and choose the title of any group you can see in the left panel. Or create a new one in dt, eg. “–scrapbook-import”

Code to be replaced…:


-- Root group in dt for scrapbook files 
set Scrapbook_root to "HD:Users:me:Documents:Projekte:Test_Infomanagement:Migration:ScrapbookAuszug1:" as string 
--set Scrapbook_root to "HD:Users:me:Documents:Projekte:Test_Infomanagement:Migration:Resteimport" as string 
-- /Users/me/Documents/Projekte/Test_Infomanagement/Migration 

set dtrootgroup to "--Inbox" 

Replacing code:


-- folder in file system that contains scrapbook's date-stamped folders (like "20050309161902") 
set Scrapbook_root to "Macintosh HD:Users:micahdiamond:Documents:Firefox Scrapbook-Captured Webpages:" as string 

-- Group in Devonthink where the webpages should reside in after import 
set dtrootgroup to ""--scrapbook-import" 


I still cannot get this script to work properly?? This is no doubt due to the fact that I am completely naive when it comes to all aspects of applescript…Perhaps you can help me get this script figured out once and for all…

I think that if you could clarify exactly where, and for what, the various paths represented in the code of your working script represents, that would be helpful?

What exactly are the 3 bolded entries listed here referring to?
– Root group in dt for scrapbook files
“HD:Users:me:Documents:Projekte:Test_Infomanagement:Migration:ScrapbookAuszug1:”
“HD:Users:me:Documents:Projekte:Test_Infomanagement:Migration:Resteimport”
/Users/me/Documents/Projekte/Test_Infomanagement/Migration

What is ScrapbookAuszug1? What is Resteimport? What is Migration?

In your response to my last post, you wrote:

-- folder in file system that contains scrapbook's date-stamped folders (like "20050309161902")
set Scrapbook_root to "Macintosh HD:Users:micahdiamond:Documents:Firefox Scrapbook-Captured Webpages:" as string 

The exact folder that contains the date-stamped folders for each Scrapbook capture is:
“Macintosh HD:Users:micahdiamond:Documents:Firefox Scrapbook-Captured Webpages:Data:”

Should I specify this 'Data" folder, or do you meant that I should specified the “Firefox Scrapbook-Captured Webpages”

A couple of random issues that might be contributing to my inability to effectively run this script:

  • Does it matter whether or not I have the ‘Multi-Scrapbook’ feature enabled?
  • Is a restart necessary after installing textcommands library?
  • Does it matter that I am using DEVONthink Pro Office?
  • Does it matter whether Firefox is open when I run the script?

Is there anything else you can think of that might be the problem?

Just for verification, here is the exact code in its entirety that I am trying to launch from the DEVONthink Office Pro script menu:

global Scrapbook_root
global dtrootgroup

-- folder in file system that contains scrapbook's date-stamped folders (like "20050309161902")
set Scrapbook_root to "Macintosh HD:Users:micahdiamond:Documents:Firefox Scrapbook-Captured Webpages:" as string

-- Group in Devonthink where the webpages should reside in after import
set dtrootgroup to "--Firefox Scrapbook-Captured Webpages" 


scrapbook2devon()


on scrapbook2devon()
	
	-- root folder for scrapbook subfolders
	set this_folder to Scrapbook_root as string -- (choose folder with prompt "Pick the folder containing the files to process:") as string
	
	--Scrapbook-Einzelverzeichnisse festlegen
	tell application "System Events"
		set scrapFolders to every folder of folder this_folder
	end tell
	
	-- Loop durch die Ordner der einzelnen Seiten
	repeat with i from 1 to the count of scrapFolders
		
		log "####################################" & return
		set thisFolder to (item i of scrapFolders)
		URL of thisFolder
		log "####################################" & return & return
		--get properties of thisFolder
		--get URL of thisFolder
		--get URL of thisFolder
		--stop
		
		--set filenames
		set name_of_thisfile to (path of thisFolder & "index.dat") -- :-format:
		set indexHtml_url to (URL of thisFolder & "index.html") -- file:..scrapbook-URL
		--log name_of_thisfile
		
		try
			--open file and convert it fix unicode problem
			set filecontent to get_filecontent_as_string(name_of_thisfile)
			tell application "TextCommands"
				set filecontentUTF to convert to unicode filecontent from "utf-8"
			end tell
			
			-- write content of index.dat into 2-dimensional list
			-- scrapbook's index.dat's fields: id, type, title, chars, icon, source, comment, folder
			tell application "TextCommands"
				set listdata to search filecontentUTF for "^(\\w+)\\t(.*)$" with regex and individual line matching -- well, the comment folder might be a problem due to possible linefeeds (upd: no br-tags, no chr10. fine.)
			end tell
			
			-- loop through rows of that list and return field/value-pairs
			set id_value to ""
			set title_value to ""
			set source_value to ""
			set comment_value to ""
			set folder_list to {""}
			set comment_value to ""
			set tpye_value to "" -- in scrapbook either blank or "file"
			
			repeat with i from 1 to (count listdata)
				set thisfield to item 1 of item (i) of listdata
				set fieldvalue to item (2) of item (i) of listdata
				if thisfield is "folder" then
					set AppleScript's text item delimiters to {tab}
					set folder_list to text items of fieldvalue
					set AppleScript's text item delimiters to {""}
				else if thisfield is "id" then
					set id_value to fieldvalue
				else if thisfield is "title" then
					set title_value to fieldvalue
				else if thisfield is "source" then
					set source_value to fieldvalue
				else if thisfield is "comment" then
					set comment_value to fieldvalue
				else if thisfield is "type" then
					set type_value to fieldvalue
				end if
				log thisfield & ": " & tab & fieldvalue
			end repeat
			
			-- Save scrapbook file in devonthink   
			my create_new_rec_webarchive(title_value, indexHtml_url, source_value, folder_list, dtrootgroup, id_value, comment_value, type_value)
			
		on error errText number errNum
			log "###FEHLER in scrapbook2devon subroutine:  " & indexHtml_url
			log errText & ", " & errNum
		end try
	end repeat
	
end scrapbook2devon

on create_new_rec_webarchive(theTitle, indexHtml_url, originURL, scrapbookfolder, devonwebstoregroup, timestamp, commenttext, type_value)
	tell application "DEVONthink Pro"
		--activate
		--with timeout of 120 seconds
		try
			--set theDestination to display group selector "Destination" buttons {"Cancel", "OK"}
			
			-- Set target group
			set AppleScript's text item delimiters to {"/"}
			set targetpath to devonwebstoregroup & "/" & (scrapbookfolder as string) -- deal with scrapbookfolder
			set AppleScript's text item delimiters to {""}
			set targetlocation to create location targetpath
			
			-- Create Record
			if type_value = "" then log "### TYPE nicht leer:" & type_value
			set type_value to "" -- warum überhaupt abfragen?
			set theArchive to create record with {name:theTitle, URL:indexHtml_url, type:html} in targetlocation --URL:originURL, path:indexHtml_url, comment:commenttext
			
			log "==== record-eintrag in devon erstellt"
			log "indexHtml_url: " & indexHtml_url
			-- Fill it with content etc.
			set data of theArchive to download web archive from indexHtml_url
			set path of theArchive to indexHtml_url
			set URL of theArchive to originURL
			set comment of theArchive to commenttext
			
			--set creation date = timestamp...
			set this_day to get texts 7 thru 8 of timestamp as integer
			set this_month to get texts 5 thru 6 of timestamp as integer
			set this_year to get texts 1 thru 4 of timestamp as integer
			set this_hour to get texts 9 thru 10 of timestamp as integer
			set this_minute to get texts 11 thru 12 of timestamp as integer
			set this_second to get texts 13 thru 14 of timestamp as integer
			
			set myDate to current date
			set day of myDate to this_day
			set month of myDate to this_month
			set year of myDate to this_year
			set time of myDate to get texts 9 thru 14 of timestamp as integer --fehler: (this_hour & ":" & this_minute & ":" & this_second)
			set the date of theArchive to myDate
			
			
		on error errText number errNum
			log "###FEHLER in create_new_rec_webarchive subroutine: " & indexHtml_url
			log errText & ", " & errNum
		end try
		--end timeout
		--activate
	end tell
end create_new_rec_webarchive


on get_filecontent_as_string(fileRef)
	try
		set f to open for access fileRef
		set str to read f as string
		close access f
		return str
	on error errText number errNum
		log "###FEHLER: " & " (get_filecontent_as_string)"
		log errText & ", " & errNum
		return ""
	end try
end get_filecontent_as_string