I have spent hours this morning cleaning up a large database. However, I noticed many files which I was sure were duplicates, but DT was indicating they weren’t. Many of these were audio files, so I decided to check them later and continue the clean/purge.
Now, however, I’ve exported a few samples of these files, and indeed they are exact duplicates, in that “diff firstfile secondfile” after exporting them finds no differences.
I’m very concerned at this point that I leveraged this duplicate check as a gut check while cleaning up gigabits of old data, and feel like I probably have to restart from backup.
Anyone know what would cause DT to miss duplicates like this? Is there a way to resolve it?
I didn’t, didn’t realize that should be enabled. However, it seems to still persist. For example, Devonthink says none of these copies of the same file have duplicates, but exporting and comparing with md5:
It’s easily reproducible as well. For example, I selected the first file in this screenshot. The file lives in the Inbox, and I chose Duplicate To → Inbox. Now, I have a second identical file created, but note in the details it indicates no duplicates exist, even though I am 1000% confident it does exist:
Opening a terminal in the directory where those “Web Design for Developers” files are:
% file *
Web Design for Developers: PDF document, version 1.3
Web Design for Developers copy: PDF document, version 1.3
Web Design for Developers copy-1: PDF document, version 1.3
% md5 *
MD5 (Web Design for Developers) = 70909781b44f59ea6e8365c5ef533096
MD5 (Web Design for Developers copy) = 70909781b44f59ea6e8365c5ef533096
MD5 (Web Design for Developers copy-1) = 70909781b44f59ea6e8365c5ef533096
Yet the duplicate functionality doesn’t identify them as duplicates…even though they are.
It seems to work with other file types, at least in my spot checking. However, I’m curious how I can trust it if there are some file types (even known types like PDF) where it’s not working…
Oh, and since I’m trying to restore certain files from backup to make sure I didn’t delete things I thought were duplicated elsewhere (my trust is temporarily shaken), is there a way to compare databases to determine files which are in one but not another, or am I forced to drop to the command line and try to do it that way?
@BLUEFROG I will try what you suggest re: the PDF.
@cgrunenberg good to know. That’s a very, very important detail I wasn’t aware of.
So I have a number of a/v files and I need to remove duplicates. Is it safe to delete files directly from the .dtBase2 directory or do I need to export all files, do the deletions there, and then re-import?
Just to be clear, I mean can I operate on the database’s contents outside of DevonThink and still delete files safely. The way you word it makes me think you’re talking about deleting from within DT…just want to confirm we’re talking about the same thing…do you mean I can safely cd into the Database.dtBase2/Files.noindex directory and delete from there manually?
You can but you certainly shouldn’t. Or rather you must not (although it is technically possible). Although it looks like an ordinary folder, DT probably relies on metadata that gets not updated when you fool around with the data directly.
So what would be the recommended way to track down duplicates which may or may not be audio/visual files in Devonthink? Export the entire database and then re-import? I can do that, but at 100GB+ it’ll be some effort. Any other ways?
Indexing can be used when there’s a need to access files from other apps. So if you index a whole folder (instead of each file separately) you can access this folder’s contents from e.g. Finder.
Simplest way to see whether it would work for you is to test it with some dummy records. In my short test I deleted a file in Finder and DEVONthink automatically updated the indexed group, i.e. the record was directly deleted (not moved to trash). Will probably work with Terminal too but I never tested that.
Note: Please also test what happens to replicated records when you move them to the indexed group.