I occasionally have the issue that 2 Macs are importing the same document, from things like Dropbox. For example Mac #1 is importing it, but before the changes are propagated to Mac #2, that Mac is also importing a file, so it ends up in the Inbox twice
Is there a smart way to deal with those files? Aka, they are 100% identical, and I want to deleting one, leaving only one, but automatic.
There are smart rules to find duplicated files, but those would remove all instances of the file and not just one. Is it doable for example, that a Smart Rule acts on only 1 file at a time? So it would remove one duplicate, then Cancel, then when it triggers again next time only 1 file exists, so it would no longer be a duplicate
Are you “importing” or “indexing”. They are different. And is your “indexed” Dropbox folders set to be “offline” or “online”? Should be “offline” to avoid duplication like I think you are seeing.
I tried both ways, one time with Hazel just copying things into the Inbox/ folder
And the method I use now is a indexed folder with a Smart Rule that uses “Move into Database” from that folder
My Dropbox content is fully synced offline, so the same file gets downloaded on 2 machines, both have DEVONthink running and importing it independently. Then once they synchronize I have the same file twice
I’m confused, actually. You appear to be importing the files into the Global Inbox, placed there by Hazel. Then being already in the Global Inbox, you move the file into the database?
Maybe I’m misunderstanding what you are doing, but indexing brings a pointer to the files into the database. I think it unusual to have an indexed file in the Global Inbox.
Per the “DEVONthink Manual” p. 144:
Move Into Database: Use this command to move an indexed file into the database. Use this command with caution as it moves the file from its current location into the internals of your database. It does not copy the file.
I have no experience with this command, but seems like that the cause of the double entry if this rule executed on both machines? Should you not be using the simpler “Move” command? And is Hazel active to on both machines at the same time to copy a file into the Global Inbox on that machine, hence doubling a copy in two distinct machines at same time?
Maybe I’m misunderstanding or not explaining myself correctly.
The files are not per-se a conflict, just that the same document gets added on different machines at the same time, so it’s causing a duplicate with different UUID
Just to rephrase my issue:
Put file into dropbox
Dropbox syncs file to Macbook and Macmini
Both Macmini and Macbook have Indexed Folder rule on that Dropbox folder and import it to Database with a smart rule (file now exists on macmini and macbook database)
When macmini and macbook sync the next time, the file exists twice because it got imported twice independently
I am not running Hazel at the same time. It’s just something I tried instead of the indexed folder import in the past.
One solution is of course to not have DEVONthink automatically import stuff on my macmini, but if there was some nifty smart way to just squash one of the duplicates automatically (when name and content are 100% identical), I could clean those up with a smart rule and not worry about it
I saw that there is an AppleScript execution action. Maybe use that with something like a simple md5/sha1 hash to find identical files
Got something working. Here’s an AppleScript that uses md5 on all files it gets executed on, compares them and adds “auto_marked_duplicate” on all items but one, then tags all files it was executed on with “auto_duplicate_processed”
on performSmartRule(theRecords)
tell application id "DNtp"
set hashes to {}
-- set theRecords to records of inbox
repeat with theRecord in theRecords
repeat 1 times
if kind of theRecord is "group" then exit repeat
-- if tags of theRecord contains "auto_duplicate_processed" then exit repeat
set filePath to path of theRecord
set hash to do shell script "md5 -q " & (quoted form of filePath)
log hash
set existingTags to tags of theRecord
set newTags to existingTags & "auto_duplicate_processed"
if hashes contains hash then
set newTags to (newTags & "auto_marked_duplicate")
else
set hashes to hashes & hash
end if
set tags of theRecord to newTags
end repeat
end repeat
end tell
end performSmartRule
Could move directly into the trash, but I just use a second smart rule to do that instead
Here’s the variant to directly move all duplicates but one into trash:
on performSmartRule(theRecords)
tell application id "DNtp"
set hashes to {}
--- set theRecords to records of inbox
repeat with theRecord in theRecords
repeat 1 times
if kind of theRecord is "group" then exit repeat
set filePath to path of theRecord
set hash to do shell script "md5 -q " & (quoted form of filePath)
-- uncomment this line to also check for identical names, not just content
-- set hash to (name of theRecord) & hash
if hashes contains hash then
move record theRecord to (trash group of inbox)
else
set hashes to hashes & hash
end if
end repeat
end repeat
end tell
end performSmartRule
There’s a line in there you can uncomment to also make sure the filename is identical. So “file1” and “file2”, even if their content is identical, would not get moved to trash as duplicates, if you enable that
AppleScript doesn’t have a continue statement for skipping a loop iteration, it can only exit the loop completely. So that repeat 1 times is a fake loop that we can skip if needed.
I use it for
if kind of theRecord is "group" then exit repeat
to directly stop doing anything if the item is a group, abort, and go to the next item. Without that repeat 1 times, it would stop the loop and script completely
You can also wrap everything in an if statement of course, it’s just personal preference
repeat with theRecord in theRecords
if kind of theRecord is not "group" then
…
end if
end repeat
should do the trick (I’m not sure if AS has is not, but the idea should be clear). Also, kind is locale-dependent, it’s advisable to use the type property in scripts so that they work also outside of English locales.
DT’s records contain a content hash property, i.e. an SHA1 hash over the document. It should be the same for duplicates, so it might be useful here – should be faster, too, than calling the shell on every file.
Oh TIL, thanks! I tried with content hash (including a couple files of different types without any text content like images or PDFs) and it looks to be working fine.
Here’s an updated version that uses type and content hash. Added another check to just skip files that have no duplicates to save on processing time
on performSmartRule(theRecords)
tell application id "DNtp"
set hashes to {}
-- set theRecords to records of inbox
repeat with theRecord in theRecords
repeat 1 times
if type of theRecord as text is "group" then exit repeat -- skip groups
if number of duplicates of theRecord is 0 then exit repeat -- skip files that don't have duplicates
set hash to content hash of theRecord
-- uncomment this line to also check for identical names, not just content
-- set hash to (name of theRecord) & hash
if hashes contains hash then
move record theRecord to (trash group of inbox)
log "found duplicate" & name of theRecord
else
set hashes to hashes & hash
end if
end repeat
end repeat
end tell
end performSmartRule
I like to short-circuit the control flow (skip the loop and return early) whenever I can to avoid having too much nesting, and because there’s no continue, it would mean I’d have to write an if block for each condition, or make one very long if statement. It’s (to me) nicer for adding quick new filters when needed
Like I find this easier to read
repeat with theRecord in theRecords
repeat 1 times
if type of theRecord as text is "group" then exit repeat -- skip groups
if number of duplicates of theRecord is 0 then exit repeat -- skip files that don't have duplicates
-- other if rules here
-- logic here
than this
repeat with theRecord in theRecords
if type ...
if number of duplicates ...
-- other if rules here
-- logic here
or this
repeat with theRecord in theRecords
if (type ...) and (number of duplicates ...) and -- other if rules here)
-- logic here
But yeah, it boils down to whatever style one prefers to code in, and your proposed way is completely fine as well
At least no more duplicates in my Inbox so I’m happy
@syntagm great script, thanks for sharing I needed to do something similar recently (stopping automatic filing if an item with the same name is already present at the destination). DT’s scriptability makes it unbelievably powerful!
But that rule filters all duplicates, and doesn’t leave one though right? So if we have file1 and file2 and those are duplicates, that rule would trash both of them because both are duplicates of each other