DT not accurately labeling files as duplicates

d1rewolf · September 8, 2021, 6:33pm

I have spent hours this morning cleaning up a large database. However, I noticed many files which I was sure were duplicates, but DT was indicating they weren’t. Many of these were audio files, so I decided to check them later and continue the clean/purge.

Now, however, I’ve exported a few samples of these files, and indeed they are exact duplicates, in that “diff firstfile secondfile” after exporting them finds no differences.

I’m very concerned at this point that I leveraged this duplicate check as a gut check while cleaning up gigabits of old data, and feel like I probably have to restart from backup.

Anyone know what would cause DT to miss duplicates like this? Is there a way to resolve it?

Thanks in advance.

d1rewolf

BLUEFROG · September 8, 2021, 7:06pm

Do you have Preferences > General > Stricter recognition of duplicates enabled?

d1rewolf · September 8, 2021, 7:43pm

I didn’t, didn’t realize that should be enabled. However, it seems to still persist. For example, Devonthink says none of these copies of the same file have duplicates, but exporting and comparing with md5:

% md5 convowithparents*   
MD5 (convowithparents-1.3gp) = 58574312a86ed80facb7cd080048e833
MD5 (convowithparents-2.3gp) = 58574312a86ed80facb7cd080048e833
MD5 (convowithparents-3.3gp) = 58574312a86ed80facb7cd080048e833
MD5 (convowithparents-4.3gp) = 58574312a86ed80facb7cd080048e833
MD5 (convowithparents-5.3gp) = 58574312a86ed80facb7cd080048e833
MD5 (convowithparents-6.3gp) = 58574312a86ed80facb7cd080048e833
MD5 (convowithparents.3gp) = 58574312a86ed80facb7cd080048e833

d1rewolf · September 8, 2021, 8:14pm

Is there any hash sum or any other indicator I could look at within DT3 Pro to determine why they’re being considered different files?

d1rewolf · September 8, 2021, 8:59pm

It’s easily reproducible as well. For example, I selected the first file in this screenshot. The file lives in the Inbox, and I chose Duplicate To → Inbox. Now, I have a second identical file created, but note in the details it indicates no duplicates exist, even though I am 1000% confident it does exist:

BLUEFROG · September 8, 2021, 10:38pm

It necessarily shouldn’t be enabled. It’s optional, hence it’s in the preferences

What is this Document ?

Try it with a known format, like a text file or an image. What do you see?

d1rewolf · September 9, 2021, 2:52am

Opening a terminal in the directory where those “Web Design for Developers” files are:

% file *
Web Design for Developers:        PDF document, version 1.3
Web Design for Developers copy:   PDF document, version 1.3
Web Design for Developers copy-1: PDF document, version 1.3
% md5 *
MD5 (Web Design for Developers) = 70909781b44f59ea6e8365c5ef533096
MD5 (Web Design for Developers copy) = 70909781b44f59ea6e8365c5ef533096
MD5 (Web Design for Developers copy-1) = 70909781b44f59ea6e8365c5ef533096

Yet the duplicate functionality doesn’t identify them as duplicates…even though they are.

It seems to work with other file types, at least in my spot checking. However, I’m curious how I can trust it if there are some file types (even known types like PDF) where it’s not working…

Your thoughts?

Oh, and since I’m trying to restore certain files from backup to make sure I didn’t delete things I thought were duplicated elsewhere (my trust is temporarily shaken), is there a way to compare databases to determine files which are in one but not another, or am I forced to drop to the command line and try to do it that way?

Thanks in advance.

BLUEFROG · September 9, 2021, 3:55am

Regardless of what the shell reports - file or md5, the file(s) are not recognized as PDFs in DEVONthink.

Copy the file to your desktop.
Add the .pdf extension.
Import the file back into DEVONthink.
If it’s recognized as PDF+Text, duplicate it and see how it behaves.

is there a way to compare databases to determine files which are in one but not another, or am I forced to drop to the command line and try to do it that way?

No there is no way to directly compare the contents of two open databases.

In Terminal you could run this command, obviously using the names of your databases…

diff -r -y --suppress-common-lines 17.dtBase2/Files.noindex 17\ copy.dtBase2/Files.noindex

In this test, it yielded the three files I deleted from the copy database…

Only in 17.dtBase2/Files.noindex/eml/a: Attached File.eml
Only in 17.dtBase2/Files.noindex/html/5: Fwd- test.html
Only in 17.dtBase2/Files.noindex/html/9: Voice Note.html

cgrunenberg · September 9, 2021, 7:13am

The duplicate recognition is based on indexed contents (of text, images, PDF etc.) but in case of audio/video files there’s none and therefore duplicates aren’t recognized.

d1rewolf · September 9, 2021, 4:23pm

@BLUEFROG I will try what you suggest re: the PDF.

@cgrunenberg good to know. That’s a very, very important detail I wasn’t aware of.

So I have a number of a/v files and I need to remove duplicates. Is it safe to delete files directly from the .dtBase2 directory or do I need to export all files, do the deletions there, and then re-import?

Thanks in advance.

BLUEFROG · September 9, 2021, 4:27pm

Yo can delete in the database. You just need to empty the database’s Trash when you’re done.

d1rewolf · September 9, 2021, 4:30pm

Just to be clear, I mean can I operate on the database’s contents outside of DevonThink and still delete files safely. The way you word it makes me think you’re talking about deleting from within DT…just want to confirm we’re talking about the same thing…do you mean I can safely cd into the Database.dtBase2/Files.noindex directory and delete from there manually?

chrillek · September 9, 2021, 5:10pm

You can but you certainly shouldn’t. Or rather you must not (although it is technically possible). Although it looks like an ordinary folder, DT probably relies on metadata that gets not updated when you fool around with the data directly.

d1rewolf · September 9, 2021, 5:13pm

Ok, thanks.

So what would be the recommended way to track down duplicates which may or may not be audio/visual files in Devonthink? Export the entire database and then re-import? I can do that, but at 100GB+ it’ll be some effort. Any other ways?

pete31 · September 9, 2021, 5:36pm

Instead of exporting/re-importing you could make use of indexing.

In Finder:

Create a new folder

In DEVONthink:

Index the folder
Move records into indexed group

d1rewolf · September 9, 2021, 5:38pm

Can you explain this further @pete31. I’m not following how this would allow me to use a third-party tool to deduplicate files contained with a DT3 Pro database?

pete31 · September 9, 2021, 6:04pm

Indexing can be used when there’s a need to access files from other apps. So if you index a whole folder (instead of each file separately) you can access this folder’s contents from e.g. Finder.

Simplest way to see whether it would work for you is to test it with some dummy records. In my short test I deleted a file in Finder and DEVONthink automatically updated the indexed group, i.e. the record was directly deleted (not moved to trash). Will probably work with Terminal too but I never tested that.

Note: Please also test what happens to replicated records when you move them to the indexed group.

chrillek · September 9, 2021, 7:25pm

I’d try to script it. Something like

calculate the md5 hash of every a/v file
save the hash as a user meta data field
sort the files by this field

d1rewolf · September 9, 2021, 8:03pm

Hmmm…that would’ve probably worked if this was a new folder and I was interested in working with it. As it stands, it’s an existing database that I’ve been working with for some time now.

Is there a way to convert it, or should I just try to export and then index?

Thanks,
d1rewolf

pete31 · September 9, 2021, 8:26pm

What didn’t work?

The idea is to create a new folder. Index it.

When you then move records into this indexed group they will automatically be moved out of the database package into the indexed folder.

Think of this folder as a “bridge” that allows to make records accessible from other apps (e.g. via “Open” dialog).

Advantage: DEVONthink doesn’t index each file’s content again.

Disadvantage: If you’re using replicated records then depending on how you move records into the indexed group may result in loss of replicants. Didn’t test.

As I understand it you are interested in doing something with the records: You want to find out which records are duplicates.

That’s why (temporarily) moving the records into an indexed group is what I would do.

DEVONthink already indexed the records’ content. No need to let it do that again.