Is it okay for synced databases to have a different number of unique words?

My databases are synced in iMac, Air M1, and iPad.
And some databases are huge both in their sizes and words, so it has been a really long time to get them synced across the devices.

But what I recently found out is that one of the databases has a different number of unique words in two devices (iMac and Air, both running on Big Sur 11.5.2). All the other numbers are the same (e.g., total Words, groups, total items, and so on). Both databases on each device are synced through iCloud (cloudkit) and they all stay synced.

Should they be the same since they are synced?

And some databases are huge both in their sizes and words,

My question is: do they need to be?

Are you running a diffetent operating system on the two Macs?

I have a lot of scanned books, so my databases are inevitably big.
And as I mentioned, my devices are running on the same OS (Big Sur 11.5.2).

Sorry, I overlooked that.

@cgrunenberg would have to comment on the possibility of having a different number of unique words.

This might happen e.g. if different versions of DEVONthink & macOS were used to build the database. Or there are some really huge documents having a text longer than 16 million characters in the database and the database is synchronized.

Thanks for your answer.

Both devices have the same versions of DT3 and macOS.
And I don’t have a document with more than 16 million characters; the longest document in this database has about 9.7 million character. And when I compare some synced documents with a large number of characters and words, they are identical in both devices.

So, is there any other possible reason?
And most importantly, is it potentially problematic? I’m wondering if there is a need for me to, for example, rebuild the databases or clear the sync location (which I recently did).

How big of a discrepancy?

total 1,193 items (44.3 GB)
4,732,136 unique, 89,965,701 total
vs.
total 1,193 items (44.3 GB)
4,728,041 unique, 89,965,701 total

@cgrunenberg would have to respond on it but the numbers don’t seem outrageously different.

But did the devices always use the same versions? As the total number of words is identical, it’s quite likely that certain documents (e.g. PDFs) were indexed using different versions. Anyway, it shouldn’t be an issue at all usually. In the worst case, e.g. a search does not find a document, a rebuild will fix this.

Glad to hear that it shouldn’t be an issue.

No, both devices have always used the same version. The database was built on my iMac (Intel) with the latest version of DT3 (I believe it’s 3.7.2) and it was synced through iCloud (CloudKit) to my Air (M1) with the same version of DT3. (Maybe Intel/M1 could have been the reason? )

At least for DEVONthink this doesn’t make a difference, not sure whether it might affect system frameworks but it shouldn’t.

I just found something interesting. I rebuilt both databases on each device, and surprisingly, the discrepancy became greater, and now even the total words, not just unique words, became different to each other. So it seems to me that my Air M1 and my intel iMac rebuild their databases differently, even though they are running on the same OS, and the same version of DT3.

Are the same third-party apps installed on both computers? In case of documents requiring third-party Spotlight plug-ins this might also make a difference.

In addition, if the database shouldn’t contain any sensitive/private data, then it would be great if you could send us a copy exported via File > Export > Database Archive… so that we can check this over here.

Since the database has some copyrighted materials, I can’t share it. Sorry about that.

But I did another test:

  • deleting the databases on both devices (with a zip backup),
  • clearing the sync location,
  • restoring the database from the TimeMachine and rebuilding it in DT3 (on my imac),
  • and syncing through Bonjour (not through the Cloudkit this time) to my Air M1.

Now the database properties of both databases are completely identical. So, I just guess that (maybe wrong) the cloudkit sync works differently from the Bonjour sync, so that the receiving device has to do some process (rebuilding?) on the database.

The sync logic that updates databases is actually the same in all cases.

Just a thought here. Is it possible that the ambiguity in Unicode for some characters (for some characters there are both a glyph and also a set of combining characters)? It is possible that different OS versions or some libraries may have treated these inconsistently over time. Offhand, I would think that compression and encryption could both be affected by this.

This is unlikely as DEVONthink’s index uses a normalized Unicode variant internally. Do your databases contain any huge documents with more than 16 million characters? Or do your Macs use different versions of DEVONthink and/or macOS?

I don’t have particularly large databases and I don’t have databases running on more than one Mac at a time. I would expect you to be careful about this, but it is possible that other players involved such as CloudKit and encryption providers may not have been as careful. Having said that, though, I agree that it is very unlikely to be the source of the problem.

Third-party services don’t have any impact, DEVONthink’s synchronization transfers encoded, (and frequently compressed) data using checksums.