scripting import of listserv archive

We have years of listserv postings archived by month in big text files. Individual messages are separated by “--------------” though not of consistent length. Has anyone created a script that will take such a file and split it up into separate messages as it’s imported into DevonThink? I didn’t see anything like that under the Scripts menu.

TIA.

I would settle for a script that worked after the large archive file was imported into DTP, worked its way down through the text and split the individual messages out into separate records in DTP. Has anyone done something like that?

Here’s a similar script but you will probably have to adjust the text item delimiters:


-- Split InfoSelect topic
-- Created by Eric Böhnisch-Volkmann
-- Copyright (c) 2005. All rights reserved.

using terms from application "DEVONthink Pro"
	tell application "DEVONthink Pro"
		activate
		try
			set this_selection to the selection
			if this_selection is {} then error "Please select some contents."
			my split(this_selection)
		on error error_message number error_number
			if the error_number is not -128 then
				display dialog error_message buttons {"OK"} default button 1
			end if
		end try
	end tell
	
	on split(these_childs)
		local this_child, oldDelimiter
		
		tell application "DEVONthink Pro"
			set oldDelimiter to AppleScript's text item delimiters
			set AppleScript's text item delimiters to "--" & (ASCII character 10)
			
			repeat with this_child in these_childs
				set this_text to plain text of this_child
				set AppleScript's text item delimiters to "--" & (ASCII character 10)
				if (exists parent 1 of this_child) then
					set this_group to parent 1 of this_child
				else
					set this_group to missing value
				end if
				repeat with this_element in every text item of this_text
					set theName to paragraph 1 of this_element
					create record with {name:theName, type:txt, plain text:this_element} in this_group
				end repeat
			end repeat
		end tell
	end split
end using terms from

Christian–

I believe I can make that script work, thanks.

–Steve

It worked perfectly, with a little trial and error on the delimiters and a bit of post-cleanup.

–Steve

Follow-up:

For an 0.5 MB archive with 700+ message to split out, it takes hours to process on my 867 G4PB. Like, 6+ hours (hard to tell when I’m also keeping the PB busy with other stuff. And DT (1.0.2) is locked up while the script runs.

But it works, thanks again.

–Steve

Did you run the script via Apple’s ScriptEditor? That’s always VERY slow, e.g. running it via DEVONthink Pro’s Script Menu can be 10-100 times faster.

Well, I saved it as a script where the other menu items were, then I refreshed the script menu and selected it from the script menu. I think I still had the script open in the script editor–would that make a difference? Should I have saved it as a compiled application? I think I tried that but it didn’t show up on the menu, IIRC.

–Steve

No, that shouldn’t make a difference.

Each archive file that I import into DT is 600-1000K and contains maybe 300-400 messages that need to be split out into separate pieces.

Question: Is the sheer volume of the task what takes a long time (basically the script says to find the next occurrence of “[CR][CR]Date:”, split the archive into two pieces, make a new DT entry out of the first piece, repeat the procedure on what’s left), or is there something else consuming CPU cycles, such as DT stopping to rebuild the database index as each new piece is created? While the script runs, DT is totally locked up with a spinning beach ball.

The text handling of AppleScript is probably way too slow to be useful for such amounts of data. You could run the script via the global script menu extra and then DT might still be responsive (but I didn’t try this).

Probably it’s better to split the files first into multiple files via the Unix command “csplit” (see “man csplit” in the terminal) and to import the results afterwards.

OK, I’ll check that out. I suspected it was AppleScript that was the bottleneck.

Just thought I’d mention that I’ve been moving over tens of thousands of Yahoo Group messages to DevonThink. First, I run the little perl utility yahoo2mbox to get the messages into a text file. Then, I break it into manageable chunks and change the file extensions to mbox. I drag and drop them into Entourage and then run the Entourage to DT Pro script I found on this forum. All of this seems to work pretty well. It takes awhile, but I come out with a very nice and wonderfully searchable archive. I suppose someone could figure out how to smooth this process out if they wanted to!

Report back–

The script works best if run as an application from the Finder. DT doesn’t display the spinning beachball.

It also works best if I split each 1MB+ archive into 4 separate entries before processing. Seems to go faster.

It also works best if I don’t leave other apps running. The compiled script will grab as much CPU time as it can get, at least it seems that way to me.

–Steve