Dropbox sync approach to avoid duplicates

I’ve been struggling with duplicates generated by DEVONthink. I’ve settled on an approach, but would like to confirm that it is the best one and that I’m not missing something that could make it easier.

I have DEVONthink on two laptops and DEVONthink To Go on my phone. Because of DEVONthink To Go, I need to use DEVONthink’s database synchronization with “Synchronize content of indexed items” enabled. Also, I do need access to my documents on other platforms, so I need to use cloud synchronization (with iCloud and Dropbox) and have DEVONthink index that content.

All of my cloud content is always available on disk rather than just on use.

I tried to use DEVONthink’s database synchronization (I’ve tried iCloud and Dropbox) between my three devices. I often end up with duplicate entries - two identically named entries in DEVONthink pointing to the same external file. This usually happens on my secondary laptop, probably because of the timing of the cloud synchronization and DEVONthink’s when first starting up that laptop after many changes have been made on my primary one.

Your (absolutely exquisite) manual discusses my use case and doesn’t raise any red flags except for duplicated content in the cloud; I have plenty of space so that is of no concern.

My current approach to avoid this is to use DEVONthink’s synchronization only on my primary laptop and DEVONthink To Go. On my secondary laptop I have unsychronized databases which index the same cloud folders.

I do see that DEVONthink’s synchronization settings allow me to choose how to deal with conflicts. I have that set to “Duplicate documents”. But, duplicates due to editing conflicts are not what I’m trying to address. If there were a real conflict, I would want the duplication so that I could intervene and possibly merge the documents.

Is using different databases on my secondary laptop the best I can hope for? I believe that I’m losing some things by not having the same databases on that laptop.

1 Like

I will take more time than I have to work through your post. However, what immediately comes to mind is you mentioning you use two internet syncs services—Dropbox and iCloud. are you databases being synced with both or does each database use one or the other? I believe that it is recommended to do the latter. Perhaps the former contributes to duplicates? I do not know for sure. Just asking.

@rmschne thanks for the response.

The DEVONthink synchronization uses Dropbox. I was originally using iCloud for that but switched to Dropbox for greater reliability. I made that switch a while ago.

Each indexed file is only on one of the two cloud services.

So to clarify:

  • You have indexed cloud files in your database(s)
  • You’re currently using Dropbox for your sync
  • You were syncing two laptops and a mobile device, but because you’re getting duplicate items in your database on the second laptop, you’ve now removed the second laptop from the sync

Please correct any of the above if it’s wrong!

My first thought is that it’s possibly going to be a bit of a pain now rectifying the second laptop, as the changes you’ve now made to your database on that device are going to want to sync if you create a new sync between the two laptops.

I guess my follow-up questions to help others propose a solution are:

  • How frequently were you syncing between devices? (Are we talking maybe a day’s work syncing and occasional duplicates, or are you going weeks between syncs with hundreds of amends and duplicates?) If e.g. the two laptops come together once a day, is Bonjour a better solution for your sync needs between those two devices?
  • Are you clear in your mind as to which device is the primary/parent device that should be the source of truth? (That doesn’t really affect your sync, but I think it’s important to be clear as it can sometimes alter our behaviour, and for the Bonjour connection I mentioned previously you’ll need to be clear on this.)
1 Like

What version of macOS and DEVONthink are you running?

Your clarification is spot on.

The plan was never to reintroduce synchronization to the secondary laptop. Because cloud folders are indexed, the cloud service guarantees both laptops have the same files. Also, all the tags seem to come across as properties of the files themselves.

I had it set to automatic on both laptops, but whenever I noticed the little dot next to the database, I would tend to click on the synchronize toolbar button. So, my primary laptop, which I use during my workday, is synchronized frequently. After work, I grab my secondary laptop, make my way to the couch, and then fire up DEVONthink. It would synchronize automatically rather soon after that. I then might notice duplicates.

Such a good question. Yes, I’m clear. My primary laptop is the source of truth for everything; even all my backups are centered around the files that are present on its drives.

Both laptops are running macOS 14.5 (Sonoma) and DEVONthink 3.9.6.

  • Are you running the same version of Dropbox on both Macs?
  • And the Dropbox folder in the same relative location on both machines?

Maybe the simplest example, from when I was synchronizing the databases on the secondary laptop, would make it more clear.

  • Cloud folder “F” is indexed in the synchronized database.
  • I modify a markdown document “m” in folder “F” on the primary laptop.
  • When I begin working on the secondary laptop I notice that there are two DEVONthink entries in the group, identically named and referencing the same document “m”. One of them has the same item link as the one on the primary laptop. The other one has a new item link.

It’s possible that my memory is slightly off. It could be that this duplication happens only when I’m moving files on the primary or adding new ones. But, I’m somewhat confident that modifications can also cause it.

In DEVONthink, hold the Option key and choose Help > Report bug to start a support ticket.

Both are running Dropbox v203.4.4857.

yes.

It’s not just Dropbox that can cause this. I used to synchronize my databases with CloudKit and had the problem. Also, I’ve had the problem with cloud files on my iCloud drive; those cloud files are in individually indexed folders in ~/Documents.

But you’re also talking about disparate mechanisms here. DEVONthink’s sync engine is not using the Dropbox application nor is it using iCloud Drive on your Mac. So you essentially have two or three independent sync processes doing their own thing and hopefully not stepping on each others’ toes. DEVONthink strives to avoid that as much as possible. We don’t control and can’t account for the behavior of the others.

1 Like

I worried about the interaction until I read page 63 in the DEVONthink manual under the heading “INDEXING AND SYNC” (with it’s opening line “Often people index content from the local repository of a cloud service like Dropbox.”) I stopped trying to synchronize the databases on the secondary laptop after concluding that the manual was overly optimistic.

DEVONthink is my most important tool. I’m happy to modify my behavior if it would make database synchronization more reliable. For example, I could move all my documents to either iCloud or Dropbox and also do my database synchronizations to those same services. That would eliminate one of the three players in the mix.

The problem that DEVONthink suggests it has solved in the manual seems to be a hard one. I’ve speculated that situation relates to the timing of the updates (the ones from DEVONthink’s synchronization and the ones from the cloud service). They could both be trying to create the new file on disk at the same time. Though, I have tried to wait a while, allowing the cloud sync to finish before running DEVONthink. Even with that, I’ve seen the duplication.

The problem would go away if I were to turn off “Synchronize contents of indexed items”. The manual suggests that as a space saving measure. Unfortunately, the manual also says that one shouldn’t do that if using DEVONthink To Go, since the content in the synchronized databases is the only content it has access to.

If the world was static, all problems would be gone. But just as there are small outbreaks e.g., of smallpox, etc., things change over time and problems can resurface. Dropbox isn’t sitting still and neither are we, developmentally. Just look at the Crowdstrike debacle going on. Innocuous update, right? Seemingly, and with far more disastrous results than a duplicated document here and there :wink:

The problem would go away if I were to turn off “Synchronize contents of indexed items”. The manual suggests that as a space saving measure. Unfortunately, the manual also says that one shouldn’t do that if using DEVONthink To Go, since the content in the synchronized databases is the only content it has access to.

That is correct, re: DEVONthink To Go. Even if you weren’t using a shallow sync, the contents need to be available in the sync location for it to transmit data between it and a remote sync location… which leads me to: Why are you using a remote sync option at all? To beat the deceased equine once more:

Regarding syncing, the first question you need to ask yourself is, “Do I need a remote sync option?”. Consider these questions…

  • Do you need to sync between machines – especially non-portable desktop Macs – in different geographic locations?
  • Do you have a colleague, assistant, significant other, etc. that needs frequent updates to synced data?
  • Do you need to use a shallow sync, i.e., Download Files: On demand in DEVONthink To Go?

If the answer is no to any of these questions, a local sync on your network is suggested.

There is a forum post on syncing: Sync Types Explained. This is a good place to start.
There is also one specifically about Bonjour: Bonjour Simplified

If nothing else, I would disable the remote syncs and try the Bonjour sync on your local network to see if the duplication persists. I have a suspicion it won’t, but that’s what the scientific method is for :wink:

PS: See my response to your support ticket before proceeding.

1 Like

You guessed it - I’m using “Download Files: On demand” in DEVONthink To Go. I often need access to those files when I’m away from home.

But, I was one of those silly people who massively oversized their phones. As a test, I’m going to turn off the download on demand and switch to bonjour. We’ll see how it goes.

I see that you responded to my bug report, offering a more advanced tweak. I’ll start with the more mainstream one of using bonjour. If that doesn’t solve the problem, I’ll move on to the other.

Thanks for all your support.

P.S. I know many people sing the praises of DEVONthink; I’ll add my voice to that chorus. But, the quality of the PDF manual deserves the same. In no way am I criticizing it because of the section which motivated me to pursue something which didn’t work.

1 Like

You’re welcome :slight_smile:

But, I was one of those silly people who massively oversized their phones. As a test, I’m going to turn off the download on demand and switch to bonjour. We’ll see how it goes.

Just as we do with remote versus local syncs, we recommend using a shallow sync when it’s needed. It was implemented at a time when a 64GB iPhone was top of the line. And while it’s still an available option on a 1TB iPad, it introduces a dependency on a remote connection which may not always be available or responsive. And in our experience over the years, there’s rarely a need for someone to carry all their databases on mobile so we’re not advocating just filling a device because it’s possible :smiley:

And having offline storage, i.e., a full sync, is part of our data model and philosophy, which you can read about if you have nothing more interesting to do on a Sunday afternoon - haha!…

I see that you responded to my bug report, offering a more advanced tweak. I’ll start with the more mainstream one of using bonjour. If that doesn’t solve the problem, I’ll move on to the other.

In this case, that tweak may improve performance as well.

But, the quality of the PDF manual deserves the same. In no way am I criticizing it because of the section which motivated me to pursue something which didn’t work.

As I am the one responsible for the manual – including any errors – I appreciate the generous comments :heart: :slight_smile:

2 Likes

No luck with the Bonjour replication. I did some significant work on my primary laptop, moving some directories around. When I synchronized my secondary laptop many files were duplicated. None of those files were involved in the changes I was making.

I’m going to try that other approach, adjusting the advanced setting, and see if it fixes things.

I’m going to go back to using Dropbox as my point of synchronization. I was out and about today and wanted to look at something I was just working on at home and I had forgotten to trigger a sync before leaving. So, for the way I work, it’s not going to be enough to just keep all the content on the phone.

What is the best way to clean up 34 duplicates (17 items, 1 duplicate for each)? They all look the same; they have the same item names in DEVONthink and they reference the same files. It’s quite tedious to pick the right ones to delete.

Something is quite wrong. I don’t know the cause. But what is making picking the “right” ones to delete tedioius? If duplicate, then delete all but one. Smart Rules help? Are they really duplications or different versions of what you think are the same?

I had so many issues with indexed items in cloud services that now I’m using only as few indexed folders as I can have. The “tree body problem” (wink, wink) with cloud services is epic, and does not only happen with DT but with any other application. It seems that the combination of a) cloud provider, b) operative system stuff, and c) a third-party file processor ends with duplicates/conflicts.

In my experience, most of the issues come because cloud provider is not fast enough to process the file-in-the-cloud and then it generates an interlocked issue that ends in a duplication/conflict.

Duplication: while the cloud provider is handling the sync or controlling the file, and “touching” it, a third actor that is faster than the cloud provider (and e.g. any is faster than, say Dropbox), handles the file, and both save the same file: duplicate.

Conflict: cloud provider finds the same file in its cloud and locally modified because it is slow, that fights with the local file: a duplicate file, with “conflicts” added name, but not always.

I’ve tested most of the public ones, iCloud, Dropbox, Drive, OneDrive (that is really a joke), and the least problematic is… Synology Drive Client, both online-only and always local.

I’m talking in about 400.000 files/1.2 TB.

Now enters DT/DTTG with my own experience. Even playing with the hidden options, you will end with duplicates. Do you want less duplicates? Use Synology Drive, but you will have as well.

A workaround for the problems mentioned here, and only if you don’t need immediate sync, is use al files indexed but out of the cloud, and then use a utility like FreeFileSync to synchronize the changes with the files in the cloud (that can be online-only as DT does not touch them). When you arrive at your destination, do the reverse, from cloud to local.

They are all duplicates. They are named identically in DEVONthink and they link to the same file in the cloud folder.

On the main computer I have an item bozo.pdf → bozo.pdf (on Dropbox) and it has an item link x-devonthink-item://xyz. On my secondary computer I have two entries that look identical - “bozo.pdf → bozo.pdf”. One will have the same item link and one will have a new one. I would like to simply delete the new one. How do I identify which one that is? So far, the only way I’ve found is to copy the item links and paste them into an editor to view. It’s quite tedious. It would be very helpful if there were some way to show the item link as a column in the list of files. After all, I only have 34 duplicates to delete.

This is my suspicion as well. On the other hand, in one experiment I did, I was careful when logging in to my secondary laptop to give the cloud services plenty of time to synchronize before starting DEVONthink. I still got new duplicates.

I’m not quite sure of the state of things in the sync location. The primary laptop, when subsequently synchronizing, never acquires the duplicates. Also, I’m 99% sure that one time, when I added a new file to a group that had some duplicates, the duplicates went away. Finally, when I fully recreated my databases in DEVONthink To Go last night, I got duplicates.

In all cases, I’ve not experimented enough to derive reproducible results and my memory of things could be off.

I still haven’t tried the advanced setting change that was recommended outside this thread. I’ll probably get to that tonight.

I wasn’t referring to the time it takes to really synchronize the file, but the time it takes to check if the local file has been changed against the cloud one. For example, Dropbox subscribes to the entire filesystem instead of their own folder. When a file changes, Windows (read to the end) notifies Dropbox with a file change, passing the new data. Then Dropbox compares that file changes with their internal data, and if it needs to be uploaded, it does it. My Windows has nearly 3 million files, each change is passed to Dropbox, which must find the same file, compare, and decide what to do. 3M filenames are nothing for a good comparison algorithm, but it seems Dropbox one is the worst of all (not counting OneDrive, it can take one or two minutes to “detect” a change in 400.000 files).

I’ve said Windows because I know Windows internals more than macOS, but in macOS it is the same: you subscribe to filesystem events and react to them. That is the dramatically slow part.

In Windows, some time ago, and in C#, I did some tests. Reading from disk 500.000 file names with date and size took less than 10 seconds, I don’t exactly remember the timings, but finding filenames with partial matching was in order of microseconds. I was using “standard” C# API and LINQ, which is not the fastest way to query data, but it is optimized enough. Then, why cloud providers are so slow comparing and checking the files? Information is not on the disk. Windows passes via a memory block, cloud provider has all its stuff in memory…

1 Like