Wrong detection of duplicates ! Serious problem

I collect some XLSX files from the web, for example about Corona statistics.

I now noticed that several of them were shown as duplicates, which I found irritating.
I then checked this in detail:

One file will be shown as “0 replicants” and “66 duplicates” …

I immediately checked this on the command line, and of course those are clearly NOT duplicates!

Their checksums:

(base) tja@mini:/Volumes/DT_CORONA/$ cksum Cases_2023-04-27.xlsx
2012546908 74844 Cases_2023-04-27.xlsx

(base) tja@mini:/Volumes/DT_CORONA/$ cksum Cases_2023-04-20.xlsx
3422689381 74800 Cases_2023-04-20.xlsx

Not even the file size is the same!

So, my question is, how does DEVONthink determine wether a file is a duplicate of another?
I assumed that a checksum of some sort would be used, but that does not seem to be the case.

This in turn means, that people may loose data, when they trust DT in this regard and delete apparent duplicates.

I think, this is a serious problem!

And what’s your „duplicate“ setting in preferences?

Ah.

Didn’t know or remember that this could be configured :open_mouth:

I searched for this now:

I can only find this: General / General / Stricter recognition of duplicates

This is not checked and I suppose that this is the default setting.
I checked it now and to my relief the duplicates are “vanished” :slight_smile:
Many thanks!

My point may still be valid, as others may also misinterprete the default “duplicates”!
Would not “total binary equality” be the best default for this?

I doubt that. Imagine an html document with a single change element attribute. Or MD/PDF documents differing only in punctuation. Etc.
You might want to search the forum for threads in this topic. That would also have revealed the preference’s setting, btw.

I understand some reluctance to change something without a good reason, but to be honest for me this feels like a good reason!

The word “duplicate” clearly means a duplicate and not just something similar.

Also, while a text file with some characters difference MAY apply as valid in the sense of “lazy duplicates” for some, this does not seem to be what DT is doing here!

My 66 duplicates were Corona statistics from different month!
They TOTALLY differ, of course.

The only similar in them may be the number and names of tabs and the number of rows and columns, I suppose.

And those clearly are no “duplicates” anymore, in any sense of the word and understanding.

I was curious to understand this and had a look at the excel sheets:

I now better understand why DT came to the idea that they were “duplicates” …

The data was expanded each day, so the table was growing - but still between the first and the last of the 66 “duplicates”, there where 66 lines added and additionally, the last few lines changed in every file because of corrected data.

Still, this is clearly not a duplicate in any useful sense of the word.
“Similars” would be better for this.

Do note this has been the behavior and nomenclature for a very long time. I’m talking well over a decade+ as it was the behavior in 2.x as well.
And we have already implemented the stricter recognition of duplicates option in the preferences.

Yes, I understand.
And personally, I am happy now.

But I try to motivate that customers MAY loose data, if they just believe the word “duplicate” as it is defined per default in the settings.

For me, it seems more secure to either don’t call those “duplicates” or to set the “strict” version as default - just because it prevents potential problems!

Doing so does not create much problems for others … and just don’t changing it because “we always did it this way” is a rather weak argument, or?

Just imagine ONE customer loosing important data because of this - and before he notices this, the backups may already be rotated … and useless.

Is’nt this a valid argument? :wink:

Development would have to assess this change. I don’t dictate this kind of thing.

1 Like

Would be great :sweat_smile::hugs:

I came across this preference setting. First of all, great that there is one! Is there a description of the differences, so what “stricter” really means?

Thanks!

The help button of the preferences pane opens the related help page which includes a description of this setting.

1 Like

Thanks for the fast reply.

“Check to have DEVONthink use document contents, file type, file size, and the content hash of the document, when detecting duplicate files.” - does it also look at the file date?

Thank you!

Replying to my own question, here’s the problem I’m trying to solve:

I’m importing emails. Some come from gmail. Gmail has a memory of an elephant and really doesn’t like getting rid of mails even if you pull them out of it’s IMAP server. They’ll tend to stay in an “All” category, which inconveniently shows up as an “Archive” folder for an IMAP client.

This leads to me tending to double import mails from Gmail.

I’ve added this little scriptlet to my import process:

	-- Deduplicate
	-- Get the Global Inbox
	step progress indicator "Deduplicating..."
	set globalInbox to incoming group
	
	-- Get all items in the Global Inbox
	set itemList to children of globalInbox
	
	-- Loop through each item in the Global Inbox
	repeat with theItem in itemList
		if (type of theItem is not group) and (type of theItem is not smart group) and number of duplicates of theItem > 0 then
			set this_database to database of theItem
			move record theItem to trash group of this_database
		end if
	end repeat

and while it tends to over-detect mails on the relaxed setting, it does not detect the clear duplicate here on the restricted settings:

DevonThink seems to be so close to solving this problem. Google isn’t interested in people deleting their mails apparently (all sorts of conspiracy theories incoming)… And DTP shouldn’t be the one solving it really. I just wonder whether one might think about having a tighter control about those “restricted” settings. For example, in my given use case, I’d know that I’m having, for example, the same subject and the same timestamp for it to consider it as a duplicate. But maybe as it is also using the hash, and there might be slight differences depending on the mail headers and the via dolorosa of each of those instances, it’s not going to be considered a duplicate.

Thanks!

M

The file creation/modification dates don’t matter but dates inside a file (e.g. in an email header) matter. However, email importing (via the Mail plugin, via View > Sidebar > Import or via drag & drop) skips emails with the same message ID too, therefore importing the same emails shouldn’t cause any duplicates.

Interesting. Here’s those two mails, only difference is the content id:

In that case it’s indeed not a duplicate.

OK thanks. The mail is a duplicate, but that line is different, probably due to that handling by google.

I’ll try to filter those out before handing them over to DTP; I will observe a bit to see if there’s more than just this difference.

Thanks!