Deleting Duplicates w/ 3rd Party Duplicate Detector

romebot · October 5, 2022, 10:17am

I read that the DT duplicate detection is not byte for by, giving the example that the difference of only a comma would make the file essentially the same. I actually have a lot of versions of docs that I do need to differentiate, as well as images that might be slightly different (not sure how DT handles that). A duplicate detector can give me more precise control over differences in files and photos, but I’m not sure if it would corrupt a non-indexed database somehow to delete them on the file level instead of via the DT interface. In general, I’m nervous to delete duplicates in DT as I had the experience of deleting things that weren’t duplicates even after selecting stricter recognition of duplicates. Sadly, I have thousands of duplicates/triplicates and need to speed up the process. Thoughts on using a 3rd party application?

This article talks about doing this before importing, but the files are already imported:

Thanks!

BLUEFROG · October 5, 2022, 11:45am

That would be quite surprising given the stricter recognition accounts for file type and size in duplicate detection.

Why do you have so many redundant files?

romebot · October 5, 2022, 11:56am

well, honesly, because I’m an idiot. I experimented w/ index vs non-indexed databases and because I couldn’t be sure if I had deleted something accidentally (something I actually did a couple times and couldn’t figure out exactly which files), I kept them all to be sure. Add to that importing files from various old drives (again to be sure I had everything), and now I’m left w/ a huge regrettable mess. I just want to clean it all up, but there are thousands of duplicates and I need to find a faster way to be sure they are exact copies. For non-exact copies I need a quick way to make decisions and delete or I’ll be at it till 2032.

So, w/ stricter recognition it’s a byte for byte comparison? no possibility of error? In any case, a 3rd party app would give me more options like deleting only from certain folders or multiple versions of photos, etc. What to the question of using a 3rd party app on the contents of the database file? (after clicking on “show contents”)… what happens when things are deleted on the file level? Does it cause problems when DT can no longer find those files?

Best,

R

BLUEFROG · October 5, 2022, 12:47pm

So, w/ stricter recognition it’s a byte for byte comparison? no possibility of error? In any case, a 3rd party app would give me more options like deleting only from certain folders or multiple versions of photos, etc.

If a file is the same type and the same size and the contents essentially the same, it’s highly likely their duplicates. As far as byte-for-byte @cgrunenberg would have to comment on the level of precision there.

PS: Even with a third-party utility there is the possibility of error.

What to the question of using a 3rd party app on the contents of the database file? (after clicking on “show contents”)… what happens when things are deleted on the file level? Does it cause problems when DT can no longer find those files?

We would not advocate anyone run a third-party application on the internals of a DEVONthink database as things should not be done behind DEVONthink’s back.

cgrunenberg · October 5, 2022, 12:59pm

DEVONthink doesn’t compare raw file contents, only indexed data & metadata is used for the recognition.

BLUEFROG · October 5, 2022, 1:08pm

But when checking the file size, is it an exact match and to what level of precision, e.g., byte count versus kilo/mega/gigabyte count?

cgrunenberg · October 5, 2022, 1:48pm

The exact size and therefore bytes.

BLUEFROG · October 5, 2022, 1:50pm

There you go, @romebot

s3mpai · October 5, 2022, 7:33pm

I do a lot of photography and use a lossless editing workflow and for what its worth Gemini (part of SetApp subscription) does a pretty good job of finding duplicate and even similar (visually) images. The top shelf for this is probably Kaleidoscope and I would have bought that years ago if I wasn’t so delighted with Delta Walker for all things code or plain text.

I used to swear by Photo Sweeper and it’s great but Gemini usually does a good enough job and does it much faster but those are my opinions on detecting changes in images.

Never run software that deletes things from DEVONthink databases/bundles or you’ll be in trouble.

If storage is the concern there are options for filesystem-level deduplication that is totally transparent to software like DEVONthink but you’ll know you need that when you get there (IF you get there i just use compression to speed up i/o)