Is it okay for synced databases to have a different number of unique words?

juseong · September 12, 2021, 3:46am

My databases are synced in iMac, Air M1, and iPad.
And some databases are huge both in their sizes and words, so it has been a really long time to get them synced across the devices.

But what I recently found out is that one of the databases has a different number of unique words in two devices (iMac and Air, both running on Big Sur 11.5.2). All the other numbers are the same (e.g., total Words, groups, total items, and so on). Both databases on each device are synced through iCloud (cloudkit) and they all stay synced.

Should they be the same since they are synced?

BLUEFROG · September 12, 2021, 4:11pm

And some databases are huge both in their sizes and words,

My question is: do they need to be?

Are you running a diffetent operating system on the two Macs?

juseong · September 12, 2021, 4:41pm

I have a lot of scanned books, so my databases are inevitably big.
And as I mentioned, my devices are running on the same OS (Big Sur 11.5.2).

BLUEFROG · September 12, 2021, 5:05pm

Sorry, I overlooked that.

@cgrunenberg would have to comment on the possibility of having a different number of unique words.

cgrunenberg · September 13, 2021, 8:00am

This might happen e.g. if different versions of DEVONthink & macOS were used to build the database. Or there are some really huge documents having a text longer than 16 million characters in the database and the database is synchronized.

juseong · September 13, 2021, 4:01pm

Thanks for your answer.

Both devices have the same versions of DT3 and macOS.
And I don’t have a document with more than 16 million characters; the longest document in this database has about 9.7 million character. And when I compare some synced documents with a large number of characters and words, they are identical in both devices.

So, is there any other possible reason?
And most importantly, is it potentially problematic? I’m wondering if there is a need for me to, for example, rebuild the databases or clear the sync location (which I recently did).

BLUEFROG · September 13, 2021, 4:06pm

How big of a discrepancy?

juseong · September 13, 2021, 4:55pm

total 1,193 items (44.3 GB)
4,732,136 unique, 89,965,701 total
vs.
total 1,193 items (44.3 GB)
4,728,041 unique, 89,965,701 total

BLUEFROG · September 13, 2021, 9:20pm

@cgrunenberg would have to respond on it but the numbers don’t seem outrageously different.

cgrunenberg · September 14, 2021, 6:40am

But did the devices always use the same versions? As the total number of words is identical, it’s quite likely that certain documents (e.g. PDFs) were indexed using different versions. Anyway, it shouldn’t be an issue at all usually. In the worst case, e.g. a search does not find a document, a rebuild will fix this.

juseong · September 14, 2021, 11:54am

Glad to hear that it shouldn’t be an issue.

No, both devices have always used the same version. The database was built on my iMac (Intel) with the latest version of DT3 (I believe it’s 3.7.2) and it was synced through iCloud (CloudKit) to my Air (M1) with the same version of DT3. (Maybe Intel/M1 could have been the reason? )

cgrunenberg · September 14, 2021, 12:07pm

At least for DEVONthink this doesn’t make a difference, not sure whether it might affect system frameworks but it shouldn’t.

juseong · September 14, 2021, 11:29pm

I just found something interesting. I rebuilt both databases on each device, and surprisingly, the discrepancy became greater, and now even the total words, not just unique words, became different to each other. So it seems to me that my Air M1 and my intel iMac rebuild their databases differently, even though they are running on the same OS, and the same version of DT3.

cgrunenberg · September 15, 2021, 6:22am

Are the same third-party apps installed on both computers? In case of documents requiring third-party Spotlight plug-ins this might also make a difference.

In addition, if the database shouldn’t contain any sensitive/private data, then it would be great if you could send us a copy exported via File > Export > Database Archive… so that we can check this over here.

juseong · September 16, 2021, 1:06am

Since the database has some copyrighted materials, I can’t share it. Sorry about that.

But I did another test:

deleting the databases on both devices (with a zip backup),
clearing the sync location,
restoring the database from the TimeMachine and rebuilding it in DT3 (on my imac),
and syncing through Bonjour (not through the Cloudkit this time) to my Air M1.

Now the database properties of both databases are completely identical. So, I just guess that (maybe wrong) the cloudkit sync works differently from the Bonjour sync, so that the receiving device has to do some process (rebuilding?) on the database.

cgrunenberg · September 16, 2021, 6:12am

The sync logic that updates databases is actually the same in all cases.

Archimedes · December 6, 2021, 7:16pm

Just a thought here. Is it possible that the ambiguity in Unicode for some characters (for some characters there are both a glyph and also a set of combining characters)? It is possible that different OS versions or some libraries may have treated these inconsistently over time. Offhand, I would think that compression and encryption could both be affected by this.

cgrunenberg · December 7, 2021, 8:51am

This is unlikely as DEVONthink’s index uses a normalized Unicode variant internally. Do your databases contain any huge documents with more than 16 million characters? Or do your Macs use different versions of DEVONthink and/or macOS?

Archimedes · December 7, 2021, 5:28pm

I don’t have particularly large databases and I don’t have databases running on more than one Mac at a time. I would expect you to be careful about this, but it is possible that other players involved such as CloudKit and encryption providers may not have been as careful. Having said that, though, I agree that it is very unlikely to be the source of the problem.

cgrunenberg · December 7, 2021, 6:50pm

Third-party services don’t have any impact, DEVONthink’s synchronization transfers encoded, (and frequently compressed) data using checksums.