AppleScript delete by similarity

Hi All,

I have the following “problem”:
multiple feeds that sometime say the same stuff.
is there a way to identify and delete by some criterion (e.g. date, count of words) similar items?
for example :
article 1: “Devontechnologies set to surpass apple revenue in 2024”
article 2: “Devontechnologies will have more revenue than Apple in 2024: apple filing for bankruptcy next month due to various reasons”

now, the two above are not duplicates because I use a strict definition of duplicates because it is a system-wide feature + I automate it in some places + I don’t want to delete stuff by accident.

I want to keep article 2 and delete article 1.
how can I identify article 1?
looking to compare a fixed threshold of similarity with the results of the “see also”, I believe.

PS: putting this as a per-db parameter would be cool!

Sorry but there is no way to do this automatically, only by hand-curating the articles.
It’s possible they will appear in the See Also & Classify inspector, where you can Control-click > Move All Instances to Trash for an item.

Yes, that’s where I see them! but there it lists also things that have less relation, which I don’t want to delete.
I was hoping to be able to do that since the classification info is accessible, but I understand that the see also is probably a heavier computation. too bad, that would have opened up many possibilities.
what about playing around with the definition of duplicate via AppleScript/Terminal, is there a parameter we can edit within a script to tweak the search to accomplish the same result?

No it’s not possible to change the definition of a duplicate, in application or via AppleScript.

And as you’ve already said, these are not actual duplicates but contextual ones, the See Also would be the place you could manually curate the articles.

1 Like

One possibility might be a smart rule executing an embedded script and using the trigger on news. Then the script could use the “compare” command to find similar items. If such items are found then the item would be deleted or trashed.

thanks! I tried it now to check the results. I made a simple script

tell application id “DNtp”
set y to selection of think window 1
set x to compare record ( first item of y)
return x
end tell

but this returns a list without “weights”, e.g.

{ content id 26578 of database id 4 of application “DEVONthink 3”, content id 26592 of database id 4 of application “DEVONthink 3”, content id 27441 of database id 4 of application “DEVONthink 3”, content id 32529 of database id 4 of application “DEVONthink 3”}

of these, I know the first three results have high similarity and the others do not.
how can I find the weights, that is, the similarity score we see in the see also, for each of these returned items with respect to the selected record?

See score property of records.

Those are nice articles! )))))

Hahaha! I didn’t even notice that :stuck_out_tongue:

Just a possibility:

Try this for extracting similar documents that are non-pdf:

  • Set “minSeeAlsoWeight” and “maxSeeAlsoDocs” to your preference. The “minSeeAlsoWeight” sets the threshold for the similarity score (0.0 to 1.0).
  • This script can be run when you select one item in the viewer window, or when you are reading an article/clippings in a document window. I have only tested the script(briefly) on text, rtf, and markdown files.
  • Only search for similar documents in the current database of which the compared document is located. But you can take out “to current database” in this line set theResults to compare record theDoc to current database for extracting similar documents from all opened databases.
property minSeeAlsoWeight : 0.4
property maxSeeAlsoDocs : 100
use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions

-- ngan 2020.04.029
-- Simplified version

global thisItem, theDoc, theSeeAlsoDocs, theResults

property minSeeAlsoWeight : 0.4
property maxSeeAlsoDocs : 100

-- options
property newViewerWin : true

tell application id "DNtp"
	set {theDoc} to item 1 of {selection}
	set theSeeAlsoDocs to my getSeeAlsoDocs(theDoc, minSeeAlsoWeight, maxSeeAlsoDocs, "")
	set search results of viewer window 1 to theSeeAlsoDocs
end tell

on getSeeAlsoDocs(theDoc, minSeeAlsoWeight, maxSeeAlsoDocs, theSeeAlsoChoice)
	local l
	set l to {}
	tell application id "DNtp"
		set theResults to compare record theDoc to current database
		if theResults ≠ {} then
			repeat with i from 1 to my min(maxSeeAlsoDocs, my max(length of theResults, 1))
				if ((score of theResults's item i) ≥ minSeeAlsoWeight) and (type of theResults's item i is not in {PDF document, group}) then set end of l to theResults's item i
			end repeat
		end if
		return l
	end tell
end getSeeAlsoDocs

on min(x, y)
	if x ≤ y then
		return x
		return y
	end if
end min
on max(x, y)
	if x ≤ y then
		return y
		return x
	end if
end max