PDF duplicates not detected across devices due to inconsistent OCR/indexing behaviour

Seagull · June 1, 2025, 11:14pm

I have a bunch of documents scanned to PDFs, and there are many instances of duplicates across my internal and external drives, cloud storage etc. I imported one set of such documents into my library. Later, I came across another set of similar files and imported it into my library as well, hoping to remove duplicates and sort out the rest. But when I checked my Inbox group, I spotted dozens of files that are duplicates of ones already in the library, but they were not marked as such.

After hours of experimenting with similar files, different settings and external comparisons, I can highlight the following:

The files I am referring to are definitely duplicates of files present in the library – bitwise, text-wise etc.
When I convert existing and newly added files into text within DT4, the text differs. The text recognition isn’t very accurate in the first place, but it differs by a few symbols here and there.
I started changing import encoding, OCR encoding etc., but it didn’t affect the text conversion at all.
Then I suddenly realised that the existing files had been added using one computer, while the newly added ones were imported using the second computer. I removed the newly added files from DT, transferred them to the first computer and imported them into the library again. Soon after indexing, all added files were finally indicated as duplicates. I repeated this a few times and yes – that is the core of the issue.
I then thought the difference between the two Macs might be that one has the ABBYY OCR Add-on installed and the other doesn’t. I installed the add-on on the second computer and repeated the experiment – the problem persisted. The settings should be identical or at least very similar between the two systems.

It appears that the environment or certain macOS settings may affect the indexing of PDFs upon import, resulting in differences in the text layer and preventing them from being marked as duplicates.

Some info:

– My library was created under DT4b2
– I am using a two-device setup synchronised via encrypted Local Sync Store
– The files are multilingual
– I have ‘Stricter recognition of duplicates’ enabled
– Both Macs are running macOS Sequoia 15.5

Unfortunately, I can’t share the files with you, but I hope the information above helps clarify the issue.

BLUEFROG · June 2, 2025, 12:18am

What version of DEVONthink were you running when you updated the database?

Seagull · June 2, 2025, 12:49am

I’m not sure what you mean by ‘updated’, but the addition of new files took place within DT4b3.

chrillek · June 2, 2025, 5:07am

Are both Macs using the same kind of CPU, ie Apple or Intel?

cgrunenberg · June 2, 2025, 7:37am

How exactly do you add these files? Is OCR performed on one of the machines or not at all?

Seagull · June 2, 2025, 10:50am

One device uses the M2 chip, the other uses the M2 Pro.

Seagull · June 2, 2025, 10:50am

I simply dragged and dropped the files into the main documents window. I didn’t perform OCR manually, but upon addition, the files were indexed and the list of words appeared in the properties, which suggests some OCR was performed in the background. The settings in the OCR tab are identical.

cgrunenberg · June 2, 2025, 10:52am

Are you able to reproduce this using a document that you could share?

Seagull · June 2, 2025, 10:55am

Fair request. I’ll try to spot such a document when adding more files to my library and will get back to you.

Seagull · June 2, 2025, 11:34am

I managed to replicate the problem:

Asked ChatGPT to generate a dummy multilingual file as a .docx
Printed it, then scanned it with Image Capture (without OCR ticked) and merged both pages with Preview
Dragged and dropped the file into DT4b3 using an M2 computer, and synchronised the database
Transferred the file to the second computer, synchronised the database and added the file to the library
These two files are indexed, appear in the Concordance tab with word lists, but are not marked as duplicates:

If I perform Convert → to Plain Text on one device, the resulting .txt files have minor differences:

If I add the PDF twice using the same Mac, they are marked as duplicates

Here is the file:
Test-Scan_001.pdf (394.8 KB)

cgrunenberg · June 2, 2025, 1:25pm

I just tried this on two machines (MacBook Air and Mac Studio) using the same macOS & DEVONthink version but the files are marked as duplicates even if the strict recognition is enabled. Do you use any smart rules that might perform OCR on one machine?

Seagull · June 2, 2025, 2:46pm

I don’t have (and have never had) any smart rules except the defaults and I don’t have any old versions of DT. Could something outside of DT be affecting OCR within DT? I’m not sure what that could be, but if we’re talking about major app differences, the second computer (M2 Pro) has Xcode and VS Code installed.

cgrunenberg · June 2, 2025, 2:49pm

Do you still use the second or already the third beta on both computers?

Seagull · June 2, 2025, 2:53pm

I am using the third beta on both computers. Both have the same add-ons installed (including ABBYY).

Seagull · June 2, 2025, 3:11pm

The set of installed fonts may be different.

cgrunenberg · June 2, 2025, 3:24pm

This shouldn’t affect Apple’s Vision framework. Actually nothing should affect it if the macOS versions are identical. Unless it depends internally on the CPU power but over here (M1 vs. M1 Ultra) this didn’t make a difference.

chrillek · June 2, 2025, 3:44pm

That was probably just Apple’s Vision framework OCR’ing the text. When I OCR (Abbyy) your PDF, I get a new one that is slightly bigger than the original (420 vs. 403 KB). Also, it’s type is now “PDF+Text”, whereas the original document is shown as “PDF document” (with DT4b3).

Interestingly, Vision does a fairly good job with the different writing systems (Cyrillic seems ok, I didn’t bother to check Chinese, though). Abbyy OCR just created garbage for non-latin script, that might be due to my OCR settings.

Seagull · June 2, 2025, 6:32pm

I didn’t know DT relies on Apple’s Vision when indexing items. Does it mean that, from time to time, when Apple Vision is updated, adding new files might not trigger a duplication alert, as the textual representation of the same files could differ from those already indexed under a previous Apple Vision version?

I’ve just tried to OCR the PDF with Preview before importing it into DT and repeated the import of the now PDF+Text file on both computers – they are not marked as duplicates. This suggests DT ignores the document’s OCR layer and performs its own indexing upon import.

chrillek · June 2, 2025, 7:03pm

If you turn that on in DT4’s preferences, it does. DT3 does not.

No and no. DT4 uses either the text layer or Vision for indexing. What would it even index without a text layer or the data generated by Vision.

BLUEFROG · June 2, 2025, 7:25pm

DEVONthink isn’t actively scanning your documents to update them with Vision. And no, Vision doesn’t ignore a valid text layer when importing documents. Pages with a bad text layer may be processed with Vision on import.