Pocket to Devonthink

andreas_schmidt · January 5, 2015, 12:30am

[size=85]–rewritten Jan 7, 2014[/size]

Pocket2Devonthink

This script helps to shovel web articles gathered with Pocket (getpocket.com, nee ReadItLater) to the Inbox of the currently selected Devonthink database. The records that are created in DT retain some of their Pocket metadata, including title, creation date, and URL. Different from existing solutions in this forum, the script uses Pocket’s sqlite database, retrieves necessary data and then creates the records using the files stored in deep inside the “Library/Containers/com.readitlater.PocketMac/” folder.

Use cases

You want to bulk-import all the articles stored in/with Pocket into a Devonthink database
You want to import articles that have been added to Pocket after a certain date. (A bit of hack.)

Usage

Change the properties loop_min and loop_max. Use low numbers first, say, 1 and 10, to check whether all runs nicely on your system. Later, set loop_max to 20000 or whatever so all your articles are imported.
Change the where_condition property
Run the script. I’ve always started it from within the Script Editor, using cmd-r. Then make a coffee, go for a walk or whatelse. The script gets some 20 articles imported per minute.
Review the log file. It should open automatically and list all the articles that have not been imported. Usually, problems are cause by a combination of return characters in the title field in the Pocket database and the inability of this script to deal with them. You could either alter the database with an sqlite editor (sqlitepro.com is quite nice), import the articles manually or just don’t care. I’ve had some 5 non-imports in 1000 articles here that required some fiddling with the database or manual import. To later import individual articles that were omitted in previous runs, change the where_condition property and make sure the loop_min is set to 1.

Known issues

As said, the script stumbles when it meets title fields with return characters. (I doesn’t fall, though. Articles are just not imported.)
Images are not imported
The script does not take care of tags or the favorite and archive status.
Slow. Just some twenty articles per minute.

-- POCKET2DEVONTHINK
-- script imports articles from a local Pocket for Mac app into the inbox of the current Devonthink database
-- script comes with the use-it-like-you-want-to-and-dont-blame-me licence
-- last changed on Jan 5, 2015
-- recent changes: errorhandling; pdfs for pocket records without mime setting; conditional searches

-- USER SETTINGS; please adopt to your needs
--loop_min/max define the range of pocket articles to be imported
-- 1/10 would, e.g., import the first ten articles you've ever stored in Pocket
-- 1 and, say, 100000 would presumably move all of them into DT
property loop_min : 1 --useful to import only a few documents 
property loop_max : 10 -- to DT, e.g. for testing or after errors 
property where_condition : "" --set to "" or something like " WHERE unique_id='13148'" or " WHERE unique_id>'13148'"; make sure loop_min is small enough


-- some variables required in this script
property scriptlastchanged : "05.01.2015 10:00"
property user_agent : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5"
property sqlite_path : "sqlite3" --"/Applications/Sente65.app/Contents/MacOS/sqlite3"

property strEOR : "<EOR>" & return
property strRecDelim : quote & strEOR & quote
property sFieldDelim : ";; "
property db_path_p : POSIX path of (path to home folder) & "Library/Containers/com.readitlater.PocketMac/Data/Library/Application Support/Pocket/readItLater3.sqlite"
property quoted_db_path_p : quoted form of db_path_p
property offline_path : POSIX path of (path to home folder) & "Library/Containers/com.readitlater.PocketMac/Data/Library/Application Support/Pocket/offline/cache0/RIL_pages/"
property tempfolder_path : POSIX path of (path to home folder) & "Desktop/import_pocket/"
property timestamp : time of (current date) --for logfile
property insta_url_prefix : "http://www.instapaper.com/text?u=http%3A%2F%2F"
property insta_urls_prefix : "http://www.instapaper.com/text?u=https%3A%2F%2F"
property pocket_url_prefix : "http://getpocket.com/a/read/"


## get list of articles from pocket db and it into arrayish list of lists
if where_condition is "" then
	set the_articles_text to get_articles("")
else
	set the_articles_text to get_articles(where_condition)
end if
set the_articles_text to replaceString(the_articles_text, {"\n"}, "\\n") -- avoids 
set the_articles to textToTwoDArray(the_articles_text, character id 13, ";; ")

writelog("Pocket2Devonthink \n --------------- \nScript last changed: " & scriptlastchanged & "\n" & (current date) & "\n\nPocket articles to create in Devonthink: " & loop_min & " - " & loop_max & "\nCondition: " & where_condition, timestamp)

## LOOP through reading list items
set loop_count to 0
set error_count to 0 -- number is written into log file
repeat with this_article in the_articles
	set loop_count to loop_count + 1
	set itemlog to "" -- this var gets written to the logfile at the end of each repeat when errors_raised
	set errors_raised to false --current item gets into the logfile only when an error occured
	
	log "\n################\n # " & loop_count & "\n"
	
	# MIN MAX Loop	
	if (loop_count ≥ loop_min) and (loop_count ≤ loop_max) then
		set itemlog to "\n##" & loop_count & "\n"
		
		set itemlog to itemlog & "Raw data on this article according to pocket database: "
		set text item delimiters to ";; "
		set itemlog to itemlog & this_article
		set text item delimiters to ", "
		
		# GET METADATA from reading list
		try
			set uid to item 1 of this_article
			set item_id to item 2 of this_article
			set url_orig to item 3 of this_article
			set title to item 4 of this_article
			set time_added_pocket to item 5 of this_article
			set time_added to timestamp2appledate(time_added_pocket) -- to date rli_date
			set word_count to item 6 of this_article
			set mime to item 7 of this_article
			set offline_text to item 8 of this_article
			set offline_web to item 9 of this_article
			
			set itemlog to itemlog & "\n loop_count: " & loop_count & ";;\nuid: " & uid & ";;\nitem_id: " & item_id & ";;\nurl_orig: " & url_orig & ";;\ntitle: " & title & ";;\ntime_added: " & (time_added as string) & ";;\nword_count: " & word_count & ";;\nmime: " & mime & ";;\noffline_text: " & offline_text & ";;\noffline_web: " & offline_web
			log itemlog
			
		on error errormsg
			set itemlog to itemlog & "\n --> Error while analysing the data list for this article. Usually, this is caused by a return character in the title field. Please add this article manually to DT. \n" & errormsg
			set errors_raised to true
			set error_count to error_count + 1
		end try
		
		
		
		
		# BUILD IMPORT URLs (they might be used or not further down in this script)
		(*
		# Instapaper
		set urlshort to remove_http(url_orig) -- instapaper needs http://, https://, ftp:// removed from url
		if characters 1 through 6 of url_orig as string = "https:" then
			set insta_url to (insta_urls_prefix & urlshort)
		else
			set insta_url to (insta_url_prefix & urlshort)
		end if
		# Pocket
		set pocket_url to pocket_url_prefix & item_id
		*)
		
		
		# COPY FILE to temporary folder on Desktop
		set this_offlinefolder_path to offline_path & uid & "/"
		set this_tempfolder_path to tempfolder_path & uid & "/"
		
		try
			set has_local_file to true
			do shell script "ditto " & (quoted form of this_offlinefolder_path) & " " & this_tempfolder_path
			--should result in something like this: "ditto '/Users/me/Library/Containers/com.readitlater.PocketMac/Data/Library/Application Support/Pocket/offline/cache0/RIL_pages/10001/' /Users/me/Desktop/test_pocket2/10001/"
			-- /Users/me/Library/Containers/com.readitlater.PocketMac/Data/Library/… appears to be inaccessible via applescript
		on error errormsg
			set has_local_file to false
			set itemlog to itemlog & "\n --> apparently no local copy\n" & errormsg
			set errors_raised to true
			set error_count to error_count + 1
		end try
		
		
		# CREATE RECORD in DEVONthink
		set result_record to null
		tell application id "DNtp"
			try
				set location_target to incoming group of current database
				
				## Using Pocket's offline copies located in
				-- Library/Containers/com.readitlater.PocketMac/Data/Library/Application Support/Pocket/offline/cache0/RIL_pages/
				-- slightly different record creation depending on file type and existence of offline copies
				-- would have been easier to just import text.html, web.html, web.pdf - whichever exists
				if mime = "application/pdf" or url_orig ends with ".pdf" then -- doesn't catch those pdfs w/o mime and url without "pdf" in it
					set local_url to "file://" & this_tempfolder_path & "web.pdf"
					set itemlog to itemlog & "\n" & local_url
					set result_record to create PDF document from local_url in location_target
				else if offline_text = "1" then
					set local_url to "file://" & (POSIX path of this_tempfolder_path) & "text.html"
					set itemlog to itemlog & "\n" & local_url
					set result_record to create formatted note from local_url in location_target
				else if offline_web = "1" then
					set local_url to "file://" & this_tempfolder_path & "web.html"
					set local_path to this_tempfolder_path & "web.html"
					set itemlog to itemlog & "\n" & local_url
					set rec1 to import local_path to location_target
					set rec2 to convert record rec1 to rich -- DT doesn't allow to set URL (or rather: it doesn't show up in the address line
					delete record rec1
					set result_record to rec2
				else if not has_local_file then
					set result_record to create record with {URL:url_orig, type:bookmark} in location_target
					--else
					--set result_record to create record with {name:"error", plain text:"Something went wrong with this record in the if-mime-then operation\n\n" & itemlog, type:text} in location_target
				end if
			on error errormsg
				set itemlog to itemlog & "\n --> Something went wrong while creating this record in Devonthink\n " & errormsg
				--writelog(itemlog, timestamp)
				try
					set result_record to create record with {name:"error", plain text:itemlog, type:text} in location_target
				end try
				set errors_raised to true
				set error_count to error_count + 1
			end try
			
			# TEST whether record was created
			try
				set record_created to true
				name of result_record -- raises an error if class type is missing value
			on error errormsg
				set record_created to false
				set itemlog to itemlog & "\n --> No record created in Devonthink\n"
				set errors_raised to true
				set error_count to error_count + 1
			end try
			
			
			# SET METADATA
			if record_created then
				try
					set this_record to result_record
					tell this_record
						set name to (title)
						set the creation date to time_added
						set URL to url_orig
						set comment to "unique_id::" & uid & "\nitem_id::" & item_id & "\nloop_count::" & loop_count
					end tell
				end try
			end if
			
			(*
			-- local copy looks best in 8 of 10 cases; hence I've out-commented this
			## Instapaper - create record by downloading via http://www.instapaper.com/text?u=http(s)%3A%2F%2F
			set rec_insta to create formatted note from insta_url in location_target
			tell rec_insta
				set name to (title & " // insta")
				set the creation date to time_added
				set URL to url_orig
				set comment to "unique_id::" & uid & "\nitem_id::" & item_id & "\nloop_count::" & loop_count
			end tell
			
			## Pocket - create record by downloading via getpocket.com/a/read/[item_id]
			-- looks best, but unreliable unless invoked in browser on Pocket's website 
			set rec_pocket to create formatted note from pocket_url in location_target
			tell rec_pocket
				set name to (title & " // pocket")
				set the creation date to time_added
				set URL to url_orig
				set comment to "unique_id::" & uid & "\nitem_id::" & item_id & "\nloop_count::" & loop_count
			end tell
			*)
			
			
			
		end tell
		if errors_raised then writelog(itemlog, timestamp)
		log itemlog
	else if (loop_count > loop_max) then
		exit repeat
	end if
	
end repeat

writelog(("\n" & error_count & " error(s) occured. (" & (current date) & "). Loop_count: " & loop_count), timestamp)
set logfile to (path to desktop as string) & "Pocket2Devonthink_Log_" & timestamp & ".txt"
set y to POSIX path of logfile
--do shell script "open " & ((path to desktop) as text) & "Pocket2Devonthink_Log_" & timestamp & ".txt"
do shell script "open " & y

## SOME FUNCTIONS

## get_articles()
## searches the reference table of pocket db; where-string can be empty; returns list of articles
-- 1. unique_id, 2. item_id, 3. url, 4. title, 5. time_added, 
-- 6. word_count, 7. mime (required to id pdfs via "application/pdf", 8. offline_text (has text.html), 9. offline_web (has web.hmtl)
-- unique_id is used in file system as well
on get_articles(sql_where)
	log ">>>> GetPocketReferences"
	set sCommand to sqlite_path & " -separator ';; ' " & quoted_db_path_p & " 'select unique_id, item_id, url, title, time_added, word_count, mime, offline_text, offline_web,\"<EOR>\r\"\n\tfrom items" & sql_where & ";'"
	set sResult to (do shell script sCommand)
	(*set AppleScript's text item delimiters to {strEOR}
	set lstResults to paragraphs of sResult
	set AppleScript's text item delimiters to return
	--log ">>get_articles returns: " & return & "\t" & lstResults
	set AppleScript's text item delimiters to ""
	return lstResults *)
	return sResult
end get_articles


on textToTwoDArray(theText, mainDelimiter, secondaryDelimiter)
	set {tids, text item delimiters} to {text item delimiters, mainDelimiter}
	set firstArray to text items of theText
	set text item delimiters to secondaryDelimiter
	set twoDArray to {}
	repeat with anItem in firstArray
		set end of twoDArray to text items of anItem
	end repeat
	set text item delimiters to tids
	return twoDArray
end textToTwoDArray

on timestamp2appledate(timestamp)
	set h to do shell script "date -r " & timestamp & " \"+%Y %m %d %H %M %S\""
	set mydate to current date
	set year of mydate to (word 1 of h as integer)
	set month of mydate to (word 2 of h as integer)
	set day of mydate to (word 3 of h as integer)
	set hours of mydate to (word 4 of h as integer)
	set minutes of mydate to (word 5 of h as integer)
	set seconds of mydate to (word 6 of h as integer)
	return mydate
end timestamp2appledate

on remove_http(url1)
	try
		set n to count of url1
		if characters 1 through 6 of url1 as string = "https:" then
			set url2 to characters 9 thru n of url1 as string
		else if characters 1 through 5 of url1 as string = "http:" then
			set url2 to characters 8 thru n of url1 as string
		else
			log "url1: " & url1
			log characters 1 through 4 of url1 as string
		end if
		log url2
		return url2
	on error
		return ""
	end try
end remove_http

on replaceString(theText, oldString, newString)
	set AppleScript's text item delimiters to oldString
	set tempList to every text item of theText
	set AppleScript's text item delimiters to newString
	set theText to the tempList as string
	set AppleScript's text item delimiters to ""
	return theText
end replaceString

on writelog(this_message, timestamp)
	set the log_file to ((path to desktop) as text) & "Pocket2Devonthink_Log_" & timestamp & ".txt"
	try
		open for access file the log_file with write permission
		write (this_message & return) to file the log_file starting at eof
		close access file the log_file
	on error
		try
			close access file the log_file
		end try
	end try
end writelog

Archiv.zip (29.5 KB)

andreas_schmidt · January 7, 2015, 11:17am

I’d be happy to get some comments by fellow coders in this forum on the following (and other) issues:

Return character problem: getarticles() returns the db as a text file, textToTwoDArray splits it up in a list of lists of article data. They need some revision to convert returns early in the script instead of dealing with their effects later)
Images: Pocket doesn’t create what is called ‘formatted notes’ in DT. Such html files save images inline as streams of letters. In Pocket however, html text and images are saved in different locations in the file system. Unfortunately, Pocket’s html files don’t use the local file system address to refer to images, but tags like the following: “

”. The file system address for that no. “2” can be looked up in the item_images table of the Pocket db: item_images.unique_id needs to match items.unique_id and item_images.image_id is 2. Pocket then inserts image, caption, and cite tags into that div tag. Hence, to import images as well, the script would have to parse the text/web.html, look for RIL_IMG divs, query the Pocket database, copy the image files ouf of the sandbox territory, insert proper image tags into the text.html/web.html files
Not-so-well-formed pseudo HTML documents: The documents imported may have an html suffix and look like html files in the browser, but the documents are anything but well-formed. Omitting proper html or body tags, they begin with several div tags.
Sandboxing? I couldn’t get Applescript/Devonthink to import directly from folders in /Library/Containers/com.readitlater.PocketMac/……/. I assume this has been caused by Sandboxing, but what do I know.
Auto-import: It would be nice to have an enhanced version of the script import articles whenever they are added to the database or the filesystem.

Some Technical details on Pocket

Folder RIL_pages
folder that contains folders for article’s html/pdf files: Library/Containers/com.readitlater.PocketMac/Data/Library/Application Support/Pocket/offline/cache0/RIL_pages/
Each article has a its own folder, named with its unique_id (e.g. 12301)

Folder RIL_assets
Library/Containers/com.readitlater.PocketMac/Data/Library/Application Support/Pocket/offline/cache0/RIL_assets/
Images are not organised by article, but by domain/path/subpath. E.g.,

Pocket’s sqlite database
Library/Containers/com.readitlater.PocketMac/Data/Library/Application Support/Pocket/readItLater3.sqlite.
* Table items has all the articles, unique_id field is used in the file system to store html/pdf files.
* Table item_images contains a list of references to images saved in the file system.

Geo · April 27, 2021, 4:46pm

Andreas,

I’ve just finished importing a bunch of notes to DT with your script. I wanted to say thank you. 6 years later and it still worked like a charm.

I did have a question, tho recognise it’s been ages. The script seems to have imported .weblocs into DT and created full .html copies on my desktop. Is there a way to merge these or does the script do this already? The .weblocs copies in DT have a tonne of structured information I’d like to retain but I’d rather the “clutter-free” & offline body text of the .html copies paired with it.

Anyhow, thank you again. I’m kicking the tires with DT, have a huge library of Pocket notes and it’s been great to import them in and have something meaningful to work with.

— Geoff

andreas_schmidt · May 9, 2021, 3:39pm

Remarkable continuity of Devonthink’s API and Pocket’s under-the-hood database and file system structure. When I wrote that script a while ago, DT didn’t have that “remove clutter” feature (or whatever it is called) and couldn’t convert feeds into persistent markdown/notes/rich text records. Also, I had a ton of articles from websites that were no longer available and I didn’t want to throw away. Hence I tried to import that local data already downloaded by Pocket.

While I still run that script occasionally, I think the in-built features of Pocket and DT allow you to keep searchable copies of Pocket articles in DT: Enable feeds in Pocket (Settings → Privacy/Data security or so → enable RSS feed); add a new feed with that URL in Devonthink; command-click on that feed icon → Information → Format to markdown/rtf/pdf etc.

Pocket is still quite useful when I need to use other operating systems. In addition, in declutters slightly more nicely that DT (e.g., for whatever reason, DT usually removes author’s name) and has an okayish text-to-speech feature.

Geo · May 10, 2021, 7:56am

Andreas,

First off, thank you again. The script works really, really well and is very appreciated.

I’m still developing my DT workflow. I quite like using Pocket adjacent to DT for all the reasons you listed (Pocket also manages GDPR prompts and logins better, tho my testing is hardly conclusive).

Now that I’ve successfully imported everything, that RSS workflow seems like a good way to stay ‘live’.

Thanks again.