Tagging with Word Count "Class"

pete31 · January 11, 2020, 10:13am

Hi, I often thought tagging records with custom word count “classes” (not with their actual word count) could be very handy. Never really looked into this, but now I did - and it’s not as simple as I thought.

The idea is to add a tag to all PDFs (from OCRed screenshots to whole books), but of course this doesn’t make much sense - but it does if we use “word count classes” instead of the word count.

Not sure if the “classes” I used while trying to write the script are ok, but pretty sure it would make sense to go up with increasing word counts.

What I want to do (actual Word Count on the left, “Class” on the right):

23 → 100
58 → 100

201 → 200
257 → 300
654 → 700

1119 → 1000
1476 → 1500
1598 → 1750
1805 → 2000

3918 → 4000

That’s all fine, but at some point I thought it were a good idea to have it all in one - and messed everything up.

Any ideas how to go on with higher Word Counts? Are there “functions” from other scripting languages that could be used in AppleScript for this kind of stuff?

Here’s what I have right now

tell application id "DNtp"
	try
		set windowClass to class of window 1
		if {viewer window, search window} contains windowClass then
			set currentRecord_s to selection of window 1
		else if windowClass = document window then
			set currentRecord_s to content record of window 1 as list
		end if
		
		set theWordCountTags to {} -- testing
		
		repeat with thisRecord in currentRecord_s
			set wordCount to word count of thisRecord
			set theLength to (length of (wordCount as string))
			
			if theLength < 4 then
				set d to wordCount / (10 ^ (theLength - (theLength - 2)))
				if d ≤ 1 then
					set wordCountTag to "WordCount: 100"
				else
					set roundedWordCount to (round d) * 100
					set wordCountTag to "WordCount: " & roundedWordCount as string
				end if
				
				set end of theWordCountTags to wordCountTag -- testing
				
				
			else
				set d to wordCount / (10 ^ (theLength - 1))
				set char_1 to (character 1 of (d as string))
				
				if d ≤ (char_1 & ",249" as number) then
					set d to (char_1 as number)
					
				else if d ≥ (char_1 & ",25" as number) and d < (char_1 & ",5" as number) then
					set d to (char_1 & ",5" as number)
					
				else if d ≥ (char_1 & ",5" as number) and d < (char_1 & ",749" as number) then
					set d to (char_1 & ",75") as number
					
				else if d ≥ (char_1 & ",75" as number) then
					set d to (char_1 + 1) as number
				end if
				
				set roundedWordCount to (round d * (10 ^ (theLength - 1))) as string
				set wordCountTag to "WordCount: " & roundedWordCount as string
				
				set end of theWordCountTags to wordCountTag -- testing
				
			end if
			
			#set tags of thisRecord to (tags of thisRecord & wordCountTag)
			
		end repeat
		
		return theWordCountTags -- testing
		
	on error error_message number error_number
		if the error_number is not -128 then display alert "DEVONthink" message error_message as warning
	end try
end tell

BLUEFROG · January 11, 2020, 4:03pm

1119 → 1000
1476 → 1500
1598 → 1750
1805 → 2000

These two appear to be inconsistent compared to the other mappings. Why doesn’t 1598 map to 1600, and 1805 to 1800?

And I’m curious what this is all useful for?

pete31 · January 11, 2020, 4:55pm

Just checked it again, in that way the script works (but also I’m not sure yet if those “classes” of 250 steps make sense, maybe 500 will do it too for word counts from 1000 to 9999). It’s everything above that point…

I’m dealing with a lot of different sources, some are simple “ticker” news, some are newspaper articles, some are very long, and there are a lot of papers too. So it would be handy to distinguish them by word count (classes). I did not use tags before, but if I could implement them automatically it would be a great help.

Background is that I initially wanted to test (and implement as a smart rule) another script that auto-tags by frequency and weight concordance today. While testing the best balance between frequency and weight (for me) it came to my mind, that it also could be useful to change the number of tags that this other concordance script should automatically attach - depending on word count…

Yes, I got a “little bit” sidetracked, but the idea of auto-tag with noun concordance and distinguish sources by word count (which often means quality over here) got me

BLUEFROG · January 11, 2020, 7:22pm

I think this would be more easily done by creating smart groups with a Word Count criteria and or showing the Word Count column and sorting on it.
This would also alleviate having a potentially massive number of Tags.

This is my suggestion…

And here is an example of results with the Words (Word Count) column showing…

pete31 · January 11, 2020, 8:09pm

Sure, that’s why I try to find a way or handler (maybe a “function” from another script language) to “cut” the actual word counts into “classes”

I’ve never used Tags, but inspired by this forum I think they could be very useful (for me), E.g. I’ve got a topic I want to dive into:

“auto-tagging” with frequency and weight concordance will (hopefully) tell me what a PDF is about (I’ve added this record, but that may be years ago).
a WordCount: 12345 Tag opens the opportunity to decide if this record is what I want to see at this moment - considering the time I have or the quality I want to read.

It’s just another layer to sort things at a given moment.

BLUEFROG · January 11, 2020, 8:21pm

It’s your call but I would opt for smart groups as I showed in my example. This is esepcially true since Tags are used in autocompletion suggestions too. If you have many WordCount tags and type w in a tag, it is going to give you those tags as suggestions too.

pete31 · January 11, 2020, 9:03pm

Thanks, didn’t consider this before (as I used no tags), the posted script is just an example of failure (while testing I used "_WC: " for “WordCount”, "_F: " for “Frequeny” and "_W: " for “Weight” in order to have them always at the beginning).

But I think here’s all mixed up (which is my fault), I wrote about a concordance script (which I wanted to test) while I’m now searching for an answer if there’s a “function” or AppleScript handler that could round word counts. I’m sure that’s possible, but after several hours…

BLUEFROG · January 11, 2020, 9:06pm

Here is my approach you could try if you’d like…

tell application id "DNtp"
	set sel to item 1 of (selection as list)
	set wc to (word count of sel as integer)
	
	set digitCount to (count (wc as string)) - 1
	if digitCount ≥ 4 then
		set digitCount to (digitCount - 1)
	else if digitCount = 0 then
		set wordCount to 10
		return wordCount
	end if
	set rounder to (10 ^ digitCount) as integer
	set wordCount to (round (wc / rounder) rounding as taught in school) * rounder
end tell

Bear in mind, the level of rounding is very subjective and could be handled differently. For example, I have a PDF with 2,043,000 words. This rounds to 2,000,000 words, not 2,040,000. Also 367,000 words rounds to 370,000.

pete31 · January 11, 2020, 9:24pm

That’s exactly what I want to know - how would other users round (or take this challenge)

Doesn’t make much difference for a 2,000,000 words PDF, but it does if you try to distinguish a lot of PDFs with word counts between 300 and 2000 words by their word count. I’ve collected so much stuff (and DEVONthink makes is dangerously easy to do so…), every little hint on what’s possibly important is welcome. I’ll try your script now