help me improve script performance on 50,000+ items

Hi,

I am trying to use an AppleScript I’ve cobbled together to deal with a mass of data. I have a bunch of blog posts that I’m trying to make sense of. They were all downloaded at once, so the creation date is incorrect, and the tag and author data is embedded in the HTML, where I can’t easily act on it in DT. Basically, the script:

  1. looks at each item in the selection (I am selecting only HTML files that I’ve found with a smart group)
  2. renames the file with the Web Page Title (code for this section lifted from the DT Add-on script)
  3. then combs through the HTML and:
    a) extracts the date embedded in the HTML, formats it as an AppleScript date, then changes the date of the DT Item
    b) extracts the tags embedded in the HTML, formats it as an AppleScript list, then adds the tags to the DT item
    c) extracts the author name embedded in the HTML, then adds it as a comment to the DT item.
    d) changes the label so I know the file has been processed.

There are two interrelated problems:

Problem #1: The script works fine on one or a few items, but when I try to process all 50,000+ files it bogs the entire computer down. DT runs at 99.5% of CPU, and each file takes 45 seconds or more to process. Obviously that means it will take years to process all the files.

Problem #2: I am very inexperienced with AppleScript, so I’m sure a big part of the problem lies therein.

I’m guessing that the computer is bogging down simply on the list of 50,000+ DT items. AppleScript wizards: is there a way to make the script so that it gets a smaller chunk (say 20) of the DT items, processes them, then goes back to get another small chunk, until it’s done with all the items in the list?

Problem #2.5 What if there’s a HTML file in there somewhere that can’t be processed according to the rules? I’m not sure I’m handling the error correctly. I want it to just skip it and move to the next one.

Thanks so much.

Here’s the code:


property datetagbeginning : "<li class=\"time\"><a href=\"#\">"
property datetagend : "</a></li>
   		   				"
property tagsectionbeginning : "<h4>Tags</h4>"
property tagsectionend : "</div>"
property eachtagbegins : "\">"
property eachtagends : "</a>"
property authorsectionbeginning : "<li class=\"author\">"

property authorsectionend : "</a></li>"
property rightbeforeauthorname : "\">
					"


tell application id "com.devon-technologies.thinkpro2"
	try
		set this_selection to the selection
		set this_count to count of this_selection
		if this_count > 0 then
			show progress indicator "Renaming" steps this_count
			repeat with this_item in this_selection
				try
					set this_type to the type of this_item
					set this_source to missing value
					step progress indicator (name of this_item) as string
					if this_type is equal to html or this_type is equal to webarchive then
						set this_source to source of this_item
						set this_title to get title of this_source
						set the name of this_item to this_title
						set originalDelimiters to AppleScript's text item delimiters
						copy source of this_item to source_str
						set theContents to source_str
						set AppleScript's text item delimiters to {datetagbeginning}
						--Split the file into a list of strings that start with serialBeginning
						--Ignore the first item, which is just the text before the first occurence
						set theItem to text item 2 of theContents
						set AppleScript's text item delimiters to {datetagend}
						set postDate to text item 1 of theItem
						set AppleScript's text item delimiters to originalDelimiters
						set theMonth to word 1 of postDate
						set theDate to word 2 of postDate
						set theYear to word 3 of postDate
						set theHour to word 4 of postDate
						set theMinute to texts 1 thru 2 of word 5 of postDate
						set AMPM to texts 3 thru 4 of word 5 of postDate
						set postDateTime to current date
						if theMonth is equal to "January" then
							set the month of postDateTime to January
						end if
						if theMonth is equal to "February" then
							set the month of postDateTime to February
						end if
						if theMonth is equal to "March" then
							set the month of postDateTime to March
						end if
						if theMonth is equal to "April" then
							set the month of postDateTime to April
						end if
						if theMonth is equal to "May" then
							set the month of postDateTime to May
						end if
						if theMonth is equal to "June" then
							set the month of postDateTime to June
						end if
						if theMonth is equal to "July" then
							set the month of postDateTime to July
						end if
						if theMonth is equal to "August" then
							set the month of postDateTime to August
						end if
						if theMonth is equal to "September" then
							set the month of postDateTime to September
						end if
						if theMonth is equal to "October" then
							set the month of postDateTime to October
						end if
						if theMonth is equal to "November" then
							set the month of postDateTime to November
						end if
						if theMonth is equal to "December" then
							set the month of postDateTime to December
						end if
						set the day of postDateTime to theDate
						set the year of postDateTime to theYear
						if AMPM is equal to "AM" then
							theHour as integer
							set theHourInt to result
						else
							theHour as integer
							set theHourInt to result + 12
						end if
						set the hours of postDateTime to theHourInt
						set the minutes of postDateTime to theMinute
						set the seconds of postDateTime to 0
						postDateTime
						set creation date of this_item to postDateTime
						copy source of this_item to source_str
						set theContents to source_str
						set AppleScript's text item delimiters to {tagsectionbeginning}
						--Split the file into a list of strings that start with serialBeginning
						--Ignore the first item, which is just the text before the first occurence
						set chunk1 to text item 2 of theContents
						set AppleScript's text item delimiters to {tagsectionend}
						set chunk2 to text item 1 of chunk1
						set AppleScript's text item delimiters to {eachtagbegins}
						set tagList to chunk2
						set theItems to text items 2 thru (count of text items of chunk2) of chunk2
						set serialArray to tags of this_item
						set AppleScript's text item delimiters to {eachtagends}
						repeat with nextItem in theItems
							set serialArray to serialArray & first text item of nextItem
						end repeat
						set tags of this_item to serialArray
						copy source of this_item to source_str
						set theContents to source_str
						set AppleScript's text item delimiters to {authorsectionbeginning}
						--Split the file into a list of strings that start with serialBeginning
						--Ignore the first item, which is just the text before the first occurence
						set authorchunk1 to text item 2 of theContents
						set AppleScript's text item delimiters to {authorsectionend}
						set authorchunk2 to text item 1 of authorchunk1
						authorchunk2
						set AppleScript's text item delimiters to {rightbeforeauthorname}
						set authorname to text item 2 of authorchunk2
						set comment of this_item to ("[Author: " & authorname & "]")
						set label of this_item to 2
					end if
				on error from obj to newClass
					log {obj, newClass} -- Display from and to info in log window.
				end try
				
			end repeat
			
			set originalDelimiters to AppleScript's text item delimiters
			
			hide progress indicator
		end if
	on error error_message number error_number
		hide progress indicator
		if the error_number is not -128 then display alert "DEVONthink Pro" message error_message as warning
	end try
end tell

How many tags does this script usually assign? And how do you execute it? Executing it via DEVONthink Pro’s scripts menu might be faster.

Finally, could you please create a sample (launch /Applications/Utilities/Activity Monitor.app, select DEVONthink Pro in the list of processes and click the “Sample process” toolbar item while the script is running) to cgrunenberg - at - devon-technologies.com? Thanks!