I have a bunch of documents scanned to PDFs, and there are many instances of duplicates across my internal and external drives, cloud storage etc. I imported one set of such documents into my library. Later, I came across another set of similar files and imported it into my library as well, hoping to remove duplicates and sort out the rest. But when I checked my Inbox group, I spotted dozens of files that are duplicates of ones already in the library, but they were not marked as such.
After hours of experimenting with similar files, different settings and external comparisons, I can highlight the following:
- The files I am referring to are definitely duplicates of files present in the library ā bitwise, text-wise etc.
- When I convert existing and newly added files into text within DT4, the text differs. The text recognition isnāt very accurate in the first place, but it differs by a few symbols here and there.
- I started changing import encoding, OCR encoding etc., but it didnāt affect the text conversion at all.
- Then I suddenly realised that the existing files had been added using one computer, while the newly added ones were imported using the second computer. I removed the newly added files from DT, transferred them to the first computer and imported them into the library again. Soon after indexing, all added files were finally indicated as duplicates. I repeated this a few times and yes ā that is the core of the issue.
- I then thought the difference between the two Macs might be that one has the ABBYY OCR Add-on installed and the other doesnāt. I installed the add-on on the second computer and repeated the experiment ā the problem persisted. The settings should be identical or at least very similar between the two systems.
It appears that the environment or certain macOS settings may affect the indexing of PDFs upon import, resulting in differences in the text layer and preventing them from being marked as duplicates.
Some info:
ā My library was created under DT4b2
ā I am using a two-device setup synchronised via encrypted Local Sync Store
ā The files are multilingual
ā I have āStricter recognition of duplicatesā enabled
ā Both Macs are running macOS Sequoia 15.5
Unfortunately, I canāt share the files with you, but I hope the information above helps clarify the issue.
What version of DEVONthink were you running when you updated the database?
Iām not sure what you mean by āupdatedā, but the addition of new files took place within DT4b3.
Are both Macs using the same kind of CPU, ie Apple or Intel?
How exactly do you add these files? Is OCR performed on one of the machines or not at all?
One device uses the M2 chip, the other uses the M2 Pro.
I simply dragged and dropped the files into the main documents window. I didnāt perform OCR manually, but upon addition, the files were indexed and the list of words appeared in the properties, which suggests some OCR was performed in the background. The settings in the OCR tab are identical.
Are you able to reproduce this using a document that you could share?
Fair request. Iāll try to spot such a document when adding more files to my library and will get back to you.
1 Like
I managed to replicate the problem:
- Asked ChatGPT to generate a dummy multilingual file as a .docx
- Printed it, then scanned it with Image Capture (without OCR ticked) and merged both pages with Preview
- Dragged and dropped the file into DT4b3 using an M2 computer, and synchronised the database
- Transferred the file to the second computer, synchronised the database and added the file to the library
- These two files are indexed, appear in the Concordance tab with word lists, but are not marked as duplicates:
- If I perform Convert ā to Plain Text on one device, the resulting .txt files have minor differences:
- If I add the PDF twice using the same Mac, they are marked as duplicates
Here is the file:
Test-Scan_001.pdf (394.8 KB)
I just tried this on two machines (MacBook Air and Mac Studio) using the same macOS & DEVONthink version but the files are marked as duplicates even if the strict recognition is enabled. Do you use any smart rules that might perform OCR on one machine?
I donāt have (and have never had) any smart rules except the defaults and I donāt have any old versions of DT. Could something outside of DT be affecting OCR within DT? Iām not sure what that could be, but if weāre talking about major app differences, the second computer (M2 Pro) has Xcode and VS Code installed.
Do you still use the second or already the third beta on both computers?
I am using the third beta on both computers. Both have the same add-ons installed (including ABBYY).
The set of installed fonts may be different.
This shouldnāt affect Appleās Vision framework. Actually nothing should affect it if the macOS versions are identical. Unless it depends internally on the CPU power but over here (M1 vs. M1 Ultra) this didnāt make a difference.
That was probably just Appleās Vision framework OCRāing the text. When I OCR (Abbyy) your PDF, I get a new one that is slightly bigger than the original (420 vs. 403 KB). Also, itās type is now āPDF+Textā, whereas the original document is shown as āPDF documentā (with DT4b3).
Interestingly, Vision does a fairly good job with the different writing systems (Cyrillic seems ok, I didnāt bother to check Chinese, though). Abbyy OCR just created garbage for non-latin script, that might be due to my OCR settings.
I didnāt know DT relies on Appleās Vision when indexing items. Does it mean that, from time to time, when Apple Vision is updated, adding new files might not trigger a duplication alert, as the textual representation of the same files could differ from those already indexed under a previous Apple Vision version?
Iāve just tried to OCR the PDF with Preview before importing it into DT and repeated the import of the now PDF+Text file on both computers ā they are not marked as duplicates. This suggests DT ignores the documentās OCR layer and performs its own indexing upon import.
If you turn that on in DT4ās preferences, it does. DT3 does not.
No and no. DT4 uses either the text layer or Vision for indexing. What would it even index without a text layer or the data generated by Vision.
DEVONthink isnāt actively scanning your documents to update them with Vision. And no, Vision doesnāt ignore a valid text layer when importing documents. Pages with a bad text layer may be processed with Vision on import.