I recently downloaded and imported some documents for my banking account… some of them loaded twice by mistake.
But DT does not identify them as duplicates. The files are official documents and should not contain any bitwise changes between the two downloads. The finder shows exactly the same size of bytes. I have tried to turn on and off the setting of more precise detection of duplicates but without any effect.
Any ideas ?
Is the word count identical too?
Yes, they are identical (counted with the script).
Using the concordance it has identical word counts too but some little of them have slightly different weights.
Maybe the order of the words is not identical? In this case both the size & word count could be identical. Could you send both documents to cgrunenberg - at - devon-technologies.com? Then I could try to reproduce this.
I would like to but it’s the statement of my accounts. That’s why I cannot imagine that there are changes within the documents. Anything else I could check or try to do ?
You could try to convert them to plain text. Are they marked as duplicates?
I’ve compared both … the plain text version as well as the original via Total Commander file compare. It says they are identical but DevonThink don’t do in both cases.
Does reimporting them into a new database make a difference? Any files that you could share would be really appreciated.
If you could ZIP the files, start a support ticket and attach them, we may be able to determine a cause.
And as noted in this blog post, all data is private, doesn’t go outside our company, and is removed when testing is done…
I tried the following:
- Moving the 2 documents between different databases within DT > no effect
- Reimporting the 2 document to the same database while keeping the old ones in the database > now DT recognizes this as 2 duplicates but it should actually show 3 duplicates
- Importing the 2 documents to a different database > now these 2 are marked as duplicates.
Sorry I cannot provide you the documents but I will do some further testing later with different documents.
Just to give you an update:
I don’t remember if it was exactly this document but I found a similar constellation.
There is a tiny little change in the footer of the document because of a change in the management of the bank.
Apparently, the account statement is generated in real time with the current letterhead and is not stored historically fixed in the bank’s mailbox.
I wonder if this is even permissible and correct on the part of the bank. It’s good to know that DT seems to be functioning correctly.
Thank you for the update!
Hi, I’ve got the same problem. In the previous version of DT the duplicates are marked. The actual version DT 3.0.4 does not mark two scans of the same page. The word-count ist identical. I scanned approximately 100 documents from the last 3 years and I know, there are duplicates in the database. As usual I want DT to find them but this doesn’t work. BTW: “Mark duplicates…” in the preferences is checked.
What can I do?
Update: I’ve got an answer from the support to enable Stricter recognition of duplicates in the preferences. I’ll try.
Note the stricter recognition accounts for file type and file size. They must be the same to be seen as duplicates.
I have the same issue with some duplicate files not being recognized as duplicates. Thought I sent and inquiry this AM but don’t see it in the support listing.
Was trying to clean up some files and dragged a group to the desktop. About an hour later dragged that folder back into the same DB creating 2 identical Groups in the DB. But out of 1629 files only 664 are recognized as duplicates in DT3.
The Group contains various file types and sub groups. I can’t see a pattern for what is recognized as a duplicate vs not.
If I drag a subgroup of that duplicate Group to the Desktop and then re-add it to the database either by importing or indexing, the newly created sub group shows that all files are duplicates with the subgroup I dragged out, but not with the files in the original sub group.
If I drag a subgroup of that original Group to the Desktop and then re-add it to the database either by importing or indexing, the same result occurs where the original subgroup and its files are NOT recognized as duplicates but both newly added subgroups (which are duplicates - actually triplicates) are recognized of duplicates of each other but for only some of the files in the original subgroup.
I have tried the “Stricker recognition of duplicates” setting but it does not alter the behavior. Looking at a sample of files that should be recognized as duplicates vs those that are - shows the same data in the information pane.
Also while working with a file that should show as a duplicate in the original subgroup – duplicating it via command-d in DT yields a duplicate file and both the new and original file are recognized as duplicates (good as expected). BUT dragging the same file to the desktop and re-importing yields the same issue where the original file is NOT recognized as a duplicate.
Appreciate any thoughts or suggestions for getting files to accurately be recognized as duplicates?
What types of files are you referring to?
Pdfs, Office files (older and newer formats), jpg, webarchives, weblog, rtf, png. Gif … Mp4, mp3, m4v …
I would make sure your local backups are current and do a File > Verify & Repair on the database. I wuld also consider doing a File > Rebuild Database to see if this resolves the issue.
Ok - that cleared up the vast majority of the issue with exception of audio and video files