Create concordance for a selection of records

korm · June 13, 2014, 6:05pm

This script will create a concordance (list of the individual words found in a given document or set of records) for each individual record in a set of records.

Use notes:

Select one or more documents – the script does not operate on a group. Document selection can be discontiguous.
The resulting concordance does no spell check or reasonableness check. A “word” is simply a string of characters delimited by spaces from other words. You almost certainly will get results that look like nonsense mixed with rationality.
Optional: change the property theSeparator to whatever character you want to use to separate items in the report that the script produces. Default is semicolon. If you want carriage returns use:


property theSeparator : return

Optional: change the property sortSwitches to control how you want the list to be sorted. Default is alphanumeric ignoring case. Refer to the manual for Sort for instructions (open Terminal, type “man sort” without the quotes. RTFM.)
The standard DEVONthink “get concordance” command has optional parameters to sort the concordance by either “frequency” or “weight”. I am not using these parameters here, but the script could easily be adjusted to comment out the sort routine and add the in-built parameters instead (adjust the “get concordance…” statement).

(*
	Assemble concordance using a selection of records
	Version 2
	
	settings:
	property theSeparator -- set this property to the character you wish to use to separate individual items in the report
	
	property theSwitches -- set this property to instruct "sort" how to sort the list (see "man sort" in Terminal for instructions)
*)

property theSeparator : "; "
property sortSwitches : "-f"

tell application id "DNtp"
	try
		set these_items to the selection
		if these_items is {} then error "Please select some records."
		
		set theReport to ""
		set theCnt to 0
		
		repeat with this_item in these_items
			-- get my concordance
			
			set thisRecordConcordance to (get concordance of record this_item)
			
			-- sort the concordance			
			set {od, AppleScript's text item delimiters} to {AppleScript's text item delimiters, ASCII character 10}
			set sortMe to thisRecordConcordance as string
			set sortedMe to do shell script "echo " & (quoted form of sortMe) & " | sort " & sortSwitches
			set thisRecordConcordance to (paragraphs of sortedMe)
			set AppleScript's text item delimiters to od
			
			-- create report or append to report			
			set {od, AppleScript's text item delimiters} to {AppleScript's text item delimiters, theSeparator}
			set thisRecordConcordance to thisRecordConcordance as string
			set AppleScript's text item delimiters to od
			
			set theReport to theReport & (the name of this_item) & ":" & return & thisRecordConcordance & return & return
			set theCnt to theCnt + 1
		end repeat
		
		if (theCnt > 0) then
			set theName to display name editor "Confirm the name" default answer ((theCnt as string) & " records - concordance") info "Edit the name of the report, if you wish"
			create record with {name:theName, type:txt, plain text:theReport} in current group
		end if
		
	on error error_message number error_number
		if the error_number is not -128 then display alert "DEVONthink Pro" message error_message as warning
	end try
end tell

Allsop · June 14, 2014, 4:47am

Thanks Korm

joost · July 24, 2014, 4:50am

@korm: I’ve been playing around with the script you provided (thanks btw) and added functions that remove junk and other uninteresting character sequences. I plan to add a “remove common words” method also.

For my purposes, the get concordance by frequency is most interesting and I have adapted your script accordingly. However, I would really like to get the actual word count of each word. It is unclear if I can actually get to the raw text of an article (haven’t researched the dictionary thoroughly enough yet) and before I do a bunch of coding, is there any way I can get this word count from DT directly?

korm · July 24, 2014, 9:38am

I believe you would have to abandon the “get concordance” verb and roll your own. The “plain text” property of each record has the raw data you want. You could create your own subroutine(s) that

for a record
– accumulate the words for that record
– eliminate common or spurious words
repeat
with the accumulated words list
– sort the list
– for each word on the list
— count the frequency of that word
— store the word and its frequency in a list of {word, frequency} pairs
– repeat with next word
end
write the report

Here’s a similar concept.