Automatic handling of duplicates with the same name through Smart Rules?

I occasionally have the issue that 2 Macs are importing the same document, from things like Dropbox. For example Mac #1 is importing it, but before the changes are propagated to Mac #2, that Mac is also importing a file, so it ends up in the Inbox twice

Is there a smart way to deal with those files? Aka, they are 100% identical, and I want to deleting one, leaving only one, but automatic.

There are smart rules to find duplicated files, but those would remove all instances of the file and not just one. Is it doable for example, that a Smart Rule acts on only 1 file at a time? So it would remove one duplicate, then Cancel, then when it triggers again next time only 1 file exists, so it would no longer be a duplicate

Are you “importing” or “indexing”. They are different. And is your “indexed” Dropbox folders set to be “offline” or “online”? Should be “offline” to avoid duplication like I think you are seeing.

I tried both ways, one time with Hazel just copying things into the Inbox/ folder

And the method I use now is a indexed folder with a Smart Rule that uses “Move into Database” from that folder

My Dropbox content is fully synced offline, so the same file gets downloaded on 2 machines, both have DEVONthink running and importing it independently. Then once they synchronize I have the same file twice

I’m confused, actually. You appear to be importing the files into the Global Inbox, placed there by Hazel. Then being already in the Global Inbox, you move the file into the database?

Maybe I’m misunderstanding what you are doing, but indexing brings a pointer to the files into the database. I think it unusual to have an indexed file in the Global Inbox.

Per the “DEVONthink Manual” p. 144:

Move Into Database: Use this command to move an indexed file into the database. Use this command with caution as it moves the file from its current location into the internals of your database. It does not copy the file.

I have no experience with this command, but seems like that the cause of the double entry if this rule executed on both machines? Should you not be using the simpler “Move” command? And is Hazel active to on both machines at the same time to copy a file into the Global Inbox on that machine, hence doubling a copy in two distinct machines at same time?

Maybe I’m misunderstanding or not explaining myself correctly.

What is Preferences > Sync > Conflicts set to?

It’s set to keep latest document

The files are not per-se a conflict, just that the same document gets added on different machines at the same time, so it’s causing a duplicate with different UUID

Just to rephrase my issue:

  1. Put file into dropbox
  2. Dropbox syncs file to Macbook and Macmini
  3. Both Macmini and Macbook have Indexed Folder rule on that Dropbox folder and import it to Database with a smart rule (file now exists on macmini and macbook database)
  4. When macmini and macbook sync the next time, the file exists twice because it got imported twice independently

I am not running Hazel at the same time. It’s just something I tried instead of the indexed folder import in the past.

One solution is of course to not have DEVONthink automatically import stuff on my macmini, but if there was some nifty smart way to just squash one of the duplicates automatically (when name and content are 100% identical), I could clean those up with a smart rule and not worry about it :slight_smile:

I saw that there is an AppleScript execution action. Maybe use that with something like a simple md5/sha1 hash to find identical files

Got something working. Here’s an AppleScript that uses md5 on all files it gets executed on, compares them and adds “auto_marked_duplicate” on all items but one, then tags all files it was executed on with “auto_duplicate_processed”

on performSmartRule(theRecords)
	
	tell application id "DNtp"
		set hashes to {}
		-- set theRecords to records of inbox
		repeat with theRecord in theRecords
			repeat 1 times
				if kind of theRecord is "group" then exit repeat
				-- if tags of theRecord contains "auto_duplicate_processed" then exit repeat
				set filePath to path of theRecord
				set hash to do shell script "md5 -q " & (quoted form of filePath)
				log hash
				set existingTags to tags of theRecord
				set newTags to existingTags & "auto_duplicate_processed"
				
				if hashes contains hash then
					set newTags to (newTags & "auto_marked_duplicate")
				else
					set hashes to hashes & hash
				end if
				
				set tags of theRecord to newTags
			end repeat
		end repeat
	end tell
	
	
end performSmartRule

Could move directly into the trash, but I just use a second smart rule to do that instead

1 Like

Here’s the variant to directly move all duplicates but one into trash:

on performSmartRule(theRecords)	
	tell application id "DNtp"
		set hashes to {}
		--- set theRecords to records of inbox
		repeat with theRecord in theRecords
			repeat 1 times
				if kind of theRecord is "group" then exit repeat
				set filePath to path of theRecord
				set hash to do shell script "md5 -q " & (quoted form of filePath)
				
				-- uncomment this line to also check for identical names, not just content
				-- set hash to (name of theRecord) & hash
				
				if hashes contains hash then
					move record theRecord to (trash group of inbox)
				else
					set hashes to hashes & hash
				end if
			end repeat
		end repeat
	end tell
end performSmartRule

There’s a line in there you can uncomment to also make sure the filename is identical. So “file1” and “file2”, even if their content is identical, would not get moved to trash as duplicates, if you enable that

Run it with a rule like this:

2 Likes

Stupid question: What is the repeat 1 times doing there? If you want to do something exactly once, do you even need a repeat?

AppleScript doesn’t have a continue statement for skipping a loop iteration, it can only exit the loop completely. So that repeat 1 times is a fake loop that we can skip if needed.

I use it for

if kind of theRecord is "group" then exit repeat

to directly stop doing anything if the item is a group, abort, and go to the next item. Without that repeat 1 times, it would stop the loop and script completely :slight_smile:
You can also wrap everything in an if statement of course, it’s just personal preference

1 Like
repeat with theRecord in theRecords
  if kind of theRecord is not "group" then
…
  end if 
end repeat

should do the trick (I’m not sure if AS has is not, but the idea should be clear). Also, kind is locale-dependent, it’s advisable to use the type property in scripts so that they work also outside of English locales.

DT’s records contain a content hash property, i.e. an SHA1 hash over the document. It should be the same for duplicates, so it might be useful here – should be faster, too, than calling the shell on every file.

2 Likes

Oh TIL, thanks! I tried with content hash (including a couple files of different types without any text content like images or PDFs) and it looks to be working fine.

Here’s an updated version that uses type and content hash. Added another check to just skip files that have no duplicates to save on processing time

on performSmartRule(theRecords)
	tell application id "DNtp"
		set hashes to {}
		-- 	set theRecords to records of inbox
		repeat with theRecord in theRecords
			repeat 1 times
				if type of theRecord as text is "group" then exit repeat -- skip groups
				if number of duplicates of theRecord is 0 then exit repeat -- skip files that don't have duplicates
				
				set hash to content hash of theRecord
				
				-- uncomment this line to also check for identical names, not just content
				-- set hash to (name of theRecord) & hash
				
				if hashes contains hash then
					move record theRecord to (trash group of inbox)
					log "found duplicate" & name of theRecord
				else
					set hashes to hashes & hash
				end if
			end repeat
		end repeat
	end tell
end performSmartRule

Since you’re not exiting the repeat 1 times ... end repeat anymore, the whole thing is unnecessary. Just

repeat with theRecord in theRecords
  if type …
  …
end repeat

is sufficient.

1 Like

I like to short-circuit the control flow (skip the loop and return early) whenever I can to avoid having too much nesting, and because there’s no continue, it would mean I’d have to write an if block for each condition, or make one very long if statement. It’s (to me) nicer for adding quick new filters when needed

Like I find this easier to read

repeat with theRecord in theRecords
    repeat 1 times
    	if type of theRecord as text is "group" then exit repeat -- skip groups
    	if number of duplicates of theRecord is 0 then exit repeat -- skip files that don't have duplicates
        -- other if rules here

        -- logic here

than this

repeat with theRecord in theRecords
    if type ...
        if number of duplicates ...
            -- other if rules here
                -- logic here

or this

repeat with theRecord in theRecords
    if (type ...) and (number of duplicates ...) and -- other if rules here)
        -- logic here

But yeah, it boils down to whatever style one prefers to code in, and your proposed way is completely fine as well :smile:

At least no more duplicates in my Inbox so I’m happy

Ah, I missed the end of the line. My bad.
In a modern programming language, you’d do this:

records.filter(r => r.type !== 'group' && r.numberOfDuplicates > 0).forEach(r => {
… process record here …
})

But then you’d have to cope with parenthesis and all that… OTOH, you could use nice variable names :wink:

1 Like

Look Ma, just 6 brackets in 3 lines :smiley:

@syntagm great script, thanks for sharing :slight_smile: I needed to do something similar recently (stopping automatic filing if an item with the same name is already present at the destination). DT’s scriptability makes it unbelievably powerful!

3 Likes

On a side note: There is a Filter Duplicates smart rule built in. :smiley:

1 Like

But that rule filters all duplicates, and doesn’t leave one though right? So if we have file1 and file2 and those are duplicates, that rule would trash both of them because both are duplicates of each other