"get concordance of" question

This is a question for developers and/or users who might have stumbled across an answer.

The current “get concordance of” grabs the concordance of the record with the words sorted by weight.

Is it possible to somehow re-sort those words by frequency?

I’m imagining either a secondary sorting command or else a new version of “get conc…” that uses frequency instead of weight.

My reason for this is that the many PDFs I am working with are not always OCR’d well, and word weights are simply not effective.

Any help would be great.

The applescript interface only seems to expose the weighted list, but by studying the man pages of some basic shell commands, you may be able to establish a workaround along the lines of something like this (needs a bit of debugging):

-- DEVONthink Pro
tell application id "DNtp"
	set oRec to content record of front window
	set strText to ((plain text of oRec) as string)
	set strSorted to (do shell script "echo " & quoted form of strText & " | xargs -n1 | tr A-Z a-z | sed  -e 's/[\\.\\-\\,]//g'  -e 's/ \\r/\\n/g' | sort -db | uniq -c | sort -nr")
end tell

Perhaps more usefully, you can, of course, also select the whole Words panel (with its frequency, length, weight and word columns) and then cut and paste to a spreadsheet, where you can sort the data by whichever column you need.

As houthakker has pointed out, there are other ways to accomplish what I asked, which is why I should explain a little more fully what I hope to accomplish.

By using the concordance of DT and a script I was given by cgrunenberg, I hope to send the top X-number of words to DevonAgent in order to scour the web for related pages, documents, etc. Doing this manually provides extraordinary results, but automating the process with a script would be much more useful in my line of research. Unfortunately, I’m not nearly as code-savvy as I should be.

What this allows me to do is combine the PDFs from hundreds of articles, pick out the most frequently occurring words (with the concordance) and use DA to build an even larger corpus of material. Perhaps a complex way to use textual data to mine the internet for more, but DT and DA provide very thorough web-crawling for someone who cannot code bots or spiders.

The code sent to me by cgrunenberg is here, and you can see what it does:

property pMaxWords : 6

tell application id "com.devon-technologies.thinkpro2"
		if not (exists content record) then error "Please select a document."
		set theRecord to content record
		set theWords to get concordance of record theRecord
		set theCount to (count of theWords)
		if theCount is greater than 0 then
			if theCount is greater than pMaxWords then set theWords to items 1 thru pMaxWords of theWords
			set {od, AppleScript's text item delimiters} to {AppleScript's text item delimiters, " "}
			set theQuery to theWords as string
			set AppleScript's text item delimiters to od
			set theQuery to name of theRecord
		end if
		set theQuery to do JavaScript "encodeURIComponent('" & theQuery & "')" in think window 1
		set theurl to "http://www.google.com/search?as_q=" & theQuery & "&num=10&ie=UTF-8&start=0&filter=0"
		-- open tab for URL theurl
		tell application "DEVONagent"
			open URL theurl
		end tell
	on error error_message number error_number
		if the error_number is not -128 then display alert "DEVONthink Pro" message error_message as warning
	end try
end tell

Good application …

You feel confident that a frequency count will work better ?

You might need something like a perl regex to act as a kind of junk filter discarding all the uninteresting (the|and|that|but|not|have|etc|etc) that will tend to cluster at the top …

Thankfully, the concordance in DT can be set to filter out words under a certain length. For instance, my test package PDF of digital humanities graduate level syllabi shows the following most frequent words over 4 letters: Digital, History, Class, Humanities, Project, Course, Media, Readings, etc.

Using those results and picking how many words are chosen from the top, DevonAgent is very effective for finding other syllabi related to my existing library. However, as I mentioned, it must be done manually, which becomes laborious if you want to search using more specific documents.

Really, I think this could be solved by using the current coding for the “get concordance of” command and altering it to sort by frequency. I assume that the code currently specifies using weight because the concordance has no default setting, but is just a database. Unfortunately, my ability to reverse engineer a code and change it is… limited.