I have the following “problem”:
multiple feeds that sometime say the same stuff.
is there a way to identify and delete by some criterion (e.g. date, count of words) similar items?
for example :
article 1: “Devontechnologies set to surpass apple revenue in 2024”
article 2: “Devontechnologies will have more revenue than Apple in 2024: apple filing for bankruptcy next month due to various reasons”
now, the two above are not duplicates because I use a strict definition of duplicates because it is a system-wide feature + I automate it in some places + I don’t want to delete stuff by accident.
I want to keep article 2 and delete article 1.
how can I identify article 1?
looking to compare a fixed threshold of similarity with the results of the “see also”, I believe.
PS: putting this as a per-db parameter would be cool!
Sorry but there is no way to do this automatically, only by hand-curating the articles.
It’s possible they will appear in the See Also & Classify inspector, where you can Control-click > Move All Instances to Trash for an item.
Yes, that’s where I see them! but there it lists also things that have less relation, which I don’t want to delete.
I was hoping to be able to do that since the classification info is accessible, but I understand that the see also is probably a heavier computation. too bad, that would have opened up many possibilities.
what about playing around with the definition of duplicate via AppleScript/Terminal, is there a parameter we can edit within a script to tweak the search to accomplish the same result?
No it’s not possible to change the definition of a duplicate, in application or via AppleScript.
And as you’ve already said, these are not actual duplicates but contextual ones, the See Also would be the place you could manually curate the articles.
One possibility might be a smart rule executing an embedded script and using the trigger on news. Then the script could use the “compare” command to find similar items. If such items are found then the item would be deleted or trashed.
of these, I know the first three results have high similarity and the others do not.
how can I find the weights, that is, the similarity score we see in the see also, for each of these returned items with respect to the selected record?
Try this for extracting similar documents that are non-pdf:
Set “minSeeAlsoWeight” and “maxSeeAlsoDocs” to your preference. The “minSeeAlsoWeight” sets the threshold for the similarity score (0.0 to 1.0).
This script can be run when you select one item in the viewer window, or when you are reading an article/clippings in a document window. I have only tested the script(briefly) on text, rtf, and markdown files.
Only search for similar documents in the current database of which the compared document is located. But you can take out “to current database” in this line set theResults to compare record theDoc to current database for extracting similar documents from all opened databases.
use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions
-- ngan 2020.04.029
-- Simplified version
global thisItem, theDoc, theSeeAlsoDocs, theResults
property minSeeAlsoWeight : 0.4
property maxSeeAlsoDocs : 100
-- options
property newViewerWin : true
tell application id "DNtp"
set {theDoc} to item 1 of {selection}
set theSeeAlsoDocs to my getSeeAlsoDocs(theDoc, minSeeAlsoWeight, maxSeeAlsoDocs, "")
set search results of viewer window 1 to theSeeAlsoDocs
end tell
on getSeeAlsoDocs(theDoc, minSeeAlsoWeight, maxSeeAlsoDocs, theSeeAlsoChoice)
local l
set l to {}
tell application id "DNtp"
set theResults to compare record theDoc to current database
if theResults ≠ {} then
repeat with i from 1 to my min(maxSeeAlsoDocs, my max(length of theResults, 1))
if ((score of theResults's item i) ≥ minSeeAlsoWeight) and (type of theResults's item i is not in {PDF document, group}) then set end of l to theResults's item i
end repeat
end if
return l
end tell
end getSeeAlsoDocs
on min(x, y)
if x ≤ y then
return x
else
return y
end if
end min
on max(x, y)
if x ≤ y then
return y
else
return x
end if
end max