False Duplicates

This issue was posted on a different April '09 thread but never resolved there. There is a hint that the issue might have been resolved by email between the member and DT.

ISSUE: DTPro 2.0pb5 lists as duplicates files that are widely disparate. For example, doing a search for the PDF titled “Alexander, E. John Col The Venus” (which is listed as a duplicate) brings up PDF “Alexander Intro” as a duplicate (a completely unrelated text and author) as well as PDF “Alexander Lynching” (also unrelated and listed as a duplicate). This occurs with virtually all duplicates.

Convert the files to plain text - is the text (almost) identical?

I have not yet tried what you suggested–to convert the files to plain text–but the text is in no way the same. In addition, the files of which I write are PDFs and Word docs.

Just because the search results identified (in blue bold type) the 3 files you named as duplicates does not logically imply that they are duplicates of each other!

My guess is that your assumption that DT has incorrectly identified those 3 files as duplicates of each other is mistaken. I suspect that you will find that each of those files actually has one or more copies in your database.

Yikes, I’m sorry, Bill. Has my duplicate question upset you? You wrote with exclamation:

As a DT novice I have no problem admitting that I can sometimes be illogical. I have a PhD yet it doesn’t mean any logic was entailed in getting it. :laughing: These forums can be helpful in that they sometimes present the obvious when one is in a doltish state. So again, really, please accept my apology for not being able to figure this out on my own, or realizing any patterns in this thing. Yet and still, my issue exists.

To close things up, I should probably have noted earlier that I provided 3 files in my question solely as examples. In fact, the duplicate issue exists with approximately 72 files (which are collected in my Duplicates Smart Folder). I searched my database for the duplicates of these files and they do not exist. I am not saying that you’re in any way wrong, and I can live with these duplicates as is, I just thought DT might want to know that the issue with duplicates within my database exists. And, I readily admit that this issue could solely be due to my own technological ignorance.

Perhaps someone else will also have the issue later on and can refer back to our discussion or perhaps it’s just an issue with which I am stuck if I choose to use DTPro. Thanks so much for your efforts.

I’m having much the same issue. I added a large number of jpeg files to the database today. Five of them show up in blue font and if I go to show info it will tell me that there are 0 replicants, 4 duplicates for each picture. If I do a search for the name of any jpeg it only finds the original. If I get out of DTPro all together and use finder for the search it finds only one copy of each picture.

The pictures that have the blue font are not duplicates of one another.
They did come from the same database but they completely different photographs.

For some reason DTpro thinks these 5 pictures are duplicates, I have no earthly clue why.

Could it possibly be because they all have the same file name? I know from transferring images from cameras to my Mac sometimes I wound up with several images with the same name due to the naming convention of the camera.

I don’t know if DTPO can distinguish differences among several images with the same filename (such as exif or other metadata characteristics).

A bit off topic here, but this suggestion might be of interest: I’ve used several different applications in place of Apple’s Image Capture app and on all of them I’ve been able to specify a unique filename when images are downloaded. The filename takes the form Tod2009_0620AA.jpg and each transfer is put into a folder named for the current date. This technique has prevented the issue of different images named the same.

No the jpegs all have different names.


These are not from a camera (well ultimately they were but…) they were downloaded from the Allen Brain Atlas. I don’t know if there is some hidden header information that DTPro is parsing to determine if jpegs are the same. I brought down about 30 images but only those 5 show up as duplicates.

The problem as I see it is DTPro is getting fooled by whatever convention the Allen Atlas is using.

Ah, then perhaps this problem isn’t as pervasive as originally thought? That is, other groups of images don’t turn up with duplicates, do they?

What happens if you use a batch renaming utility (like, for instance, A Better Finder Rename) and rename your files to something quite different from the Alien Brain Atlas convention, then import them into DTPO?

As you can tell, I’m just fishing around here hoping to hook something… :stuck_out_tongue:

Please send the files to cgrunenberg - at - devon-technologies.com and I’ll check this over here. Thank you!

So the answer was…

It seems DT looks at the preview of the image and determines identity base on what it sees in that thumbnail. These particular images were fluorescence microscopy of genes expressed at a very low level in the brain. As such the images were 99.9% black and the previews of these 4.5mb jpegs were 100% black and so they all looked the same to DTPro.

Oh well, not much anyone can do about that.

Here’s a case where DT consistently identifies documents as “duplicate” when a human would not (sorry, no Turing test prize for DT this year). I regularly download PDFs of 19th century histories from Google Docs - usually 300 - 600 page documents. Google Docs adds a one-page disclaimer to the front of each PDF. The rest of the documents are scanned images, not OCRd. DT always reports that these books are duplicates of one-another, although only 0.2% of the content is common. Easy to remedy (remove the Google page), but definitely a false duplicate.

Could you send two examples to cgrunenberg - at - devon-technologies.com? Thank yoU!

This is an old thread, but I have a bizarre example of a False Duplicate with 2.5.1. One entry, item A say, is in bold and Get Info indicates that it has 1 duplicate. I use Get Info to go to that supposedly duplicate entry, item B, and find that item B is not in bold and its info says 0 duplicates! They are both pdfs, but one is 141 kb and the other is 720 kb. One is type pdf and the other is type pdf+text. Their names are different. One is imported and the other is indexed.

I think the answer is the same as before … if you can, send the document to DEVONtech so they can evaluate

Thanks Korm - I’ve done that now