Comparing custom meta data

bergdesign · April 3, 2023, 7:18pm

Is it possible to compare records using DT’s AppleScript “compare” command with data from a custom meta data field? I have searched far and wide for an existing script in the forums and on Github, but I can’t seem to find a script that uses the compare command with meta data. I’ve taken a bunch of stabs at a working syntax for the comparison, content and record parameters, but I only wind up with either no results or syntax errors

I have a database in which I archive email, and DT’s duplicates are not accurate enough for identifying what I do and do not want to be discarded. It can miss exact duplicates, and it may identify two or more emails that are extremely similar but should not be considered exact duplicates. For this reason, I have a script which adds a SHA1 checksum to a custom “SHA1” meta data field for each selected .eml file, and I then use that to visually compare the results from DT’s duplicates search.

I’d like to identify the cases where the same checksum appears in two or more records, much like a search where the matching count > 1 for each checksum. I can do the brute-force search and match in AppleScript for each record, but I was hoping the compare command could return results more quickly than writing a search routine from scratch.

BLUEFROG · April 3, 2023, 9:23pm

The compare command compares record contents, not custom metadata.

I have a database in which I archive email, and DT’s duplicates are not accurate enough for identifying what I do and do not want to be discarded. It can miss exact duplicates, and it may identify two or more emails that are extremely similar but should not be considered exact duplicates.

Have you enabled Preferences > General > General > Stricter recognition of duplicates?

bergdesign · April 3, 2023, 10:05pm

Yes, I have that preference turned on.

I think what causes the most false matches for me is that I have emails with attachments, and while the body of the emails are nearly identical as well as the names of the attachments, the attachments themselves have unique data in them such as transaction data. For example, I have some old Kagi emails, and a single entry in the search results says that there are 22 duplicates, but they are not duplicates. When I hash each one using SHA1, the hash of the .eml files is indeed unique for each.

The SHA1 comparison has been working well for my manual comparisons, but I have some old email archives pulled together now and I need to carefully and accurately remove a fair number of true duplicates.

BLUEFROG · April 4, 2023, 3:39pm

Perhaps development could implement using the content hash (already built into DEVONthink) for duplicate detection.

cgrunenberg · April 5, 2023, 6:01am

And the size of the attachments (or of the emails exported to the Finder) is indeed identical?

bergdesign · April 5, 2023, 2:22pm

Yes, the attachment to each .eml file is a .csv file, and each .csv file is identical. Each of the “duplicate” .eml files is a daily report that has a .csv attachment with column headers but no data rows, representing a day without transactions. So the body of each .eml file is also the same, but the mail headers are all different because the date of each email is different and the server paths and date/times are all different. Hence, I want to keep each and every .eml file because they represent the chronological history of Kagi transactions.

I think the fuzzy duplicate search of DT is great for many circumstances and usages, but I think it would be great to also have the ability to match on exact duplicates like I can do by visually comparing hashes. In my case, the files are single .eml files and not packages, so a single hash is suitable for the DT item. Haven’t considered how to handle packages, but I recognize that they may need to be handled differently.

cgrunenberg · April 6, 2023, 9:03am

The next release will use the document’s hash too if the stricter recognition is enabled.