Incremental Time Machine Backups of DevonThink Pro Databases

I’ve been using DT since, IIRC, DTnotes in 2004 but I remain unclear in general about Devonthink’s backup strategies. This question specifically regards how DT interacts with Time Machine.

There seems to be no confusion as to whether Time Machine will backup the entire database package but it is unclear to whether Time Machine will look inside the package of the database and only backup the changed files within. Further, I can’t tell if such an incremental backup will be properly restored as a coherent DT database.

Based on my recent experience, I believe Time Machine is trying to backup the entire package when even the smallest change is made.

Problem Story: In the past my DT databases where fairly small and mostly text files but after I decided to use tagging heavily, I moved gigs of data inside of various DT database (probably the majority of my work data) so that now some are in the tens of gigs in size. I used a combination of Carbon Copy Cloner and a custom unix kludge but neither were network based and I had an increasing tendency to forget to maintain them after moving to a laptop.

After a drive failure on the laptop, I decided to switch to Time Machine for its automatic nature and easy support for laptops.

I had problems backing up initially which led me to discover a failing hard drive. I cloned drive with CCC and then had the drive swapped out. I thought I had fixed the problem.

But while twice attempting an initial backup of a single drive of 230gb, I left it running overnight via direct ethernet and in the morning the backup reached the around 225gb and said it had only about 15minutes to go. Thinking everything was fine, I opened DT to do some work and 30 minutes later noticed that the Time Machine was still running. Only now it reported that it had backed up 235gb with ~250 pending. Worse the size of the pending continued to grow. The estimated time remaining always said “5 seconds.”

Using opensnoop and rwsnoop on backupd I saw that it was churning through and copying the entire DT databases I had opened. Since these databases were multigigabyte in size, backupd was trying to copy another ~30gb and, apparently tried to do so again, whenever I made the least change. Clearly, this was untenable in my case.

This has become a rather serious problem for me because I centralized perhaps the majority of my data inside DT, especially vital small notes and snippets, that I must backup on a continuous basis. If Time Machine can’t work with DT, I need to rethink my entire backup strategy.

So, does Time Machine incrementally backup DT database packages or not?

If not what is the recommended solution for incremental e.g. hourly, backups over network for laptops?

(removed; not useful)

That’s what I had assumed based on my knowledge of the underpinnings which is one of the reasons I avoided using Time Machine in the first place both for DT and some other apps. However, there appeared to be newer information in my surfing that said otherwise so I thought I could try it. (Time Machine is especially good with mobile laptops in which other options are lagging IMHO).

I would note that as far as I can tell, the 2.5.2 DTP manual doesn’t even mention Time Machine and I feel this is an oversight on the manual’s part given that Time Machine is largely regarded now as a core function of the OS itself now that everything is moving away from actual media. You should just state that Time Machine can’t be used.

(As an aside, one of the problems that I have encountered using DTN and DTP is that I have used them so long that I have a tendency to assume the capabilities I learned years ago are still in effect e.g. syncing used to be just between DT processes on separate units, now that seems to have changed. I just don’t have time to keep active track of all the changes. Doubtlessly, this contributed to my misunderstanding. You might consider adding the option of “things that have recently changed” to the tip screen for us old hands.)

Well, at one point, the system reported one of them as 117gb after the restore but that rectified down to 17gb. (The hard drive failure was in hardware and fairly subtle. Had to drill down to the Smart Sensors via command line to find it. It had been going on for a couple of weeks before I caught it. It took some repairs to get things back even off the backup.) Another had 12gb and some smaller multi gb ones. I was using them organize rather a lot of some scanned historical text in PDF and lot images large images.

I ran through all the verifiy, repair, backup and rebuild and then exported them entirely.

Right now, I’ve decided to rebuild the databases from scratch by indexing the large files and maybe even the some of the small text files. That, combined with the new (to me at least) local sync, should give me the redundancy I wish. I’ll post another question about Sync if I need to.

Thanks

If you’re interested in keeping an incremental backup and don’t mind acquiring some new skills, I recommend using a local sync store and keeping it in a git (or other good version control system) repository. Use a cron job to make daily or hourly commits making sure that these commits don’t occur during synchronizations, which would be bad. Then, if you need to restore your database, you checkout an appropriate commit (e.g. 09/12/2013 at 5:00PM) and just import the database through the normal Sync interface.

This isn’t perfect – Sync stores change format occasionally, and 2.6.2 will change it massively because it adds record-level encryption on stores and some other features – but it should be good enough to provide peace of mind along with a maximal amount of configurability and optimal space utilization.

Version control is not the simplest subject, so it requires some research and time spent messing around with tutorials, but I do think it works very well with this situation.

korm contacted me via PM and asked me to clarify my comments, since I’ve previously been observed on several occasions flying into hysterics when users mention “Sync” and “backup” in the same sentence.

Sync is purely synchronization. It copies things, including mistakes. If you delete every document in your database and then sync, congratulations: you have just deleted every document in all of your copies of the database. If there is a database error, or a bug in Sync, or you work in DEVONthink while drunk… it’s very easy to destroy everything. Just because Sync keeps a copy of your database on Dropbox doesn’t mean it’s a backup, in any meaningful sense.

In my mind, you don’t have a backup unless you have several versions of your database from several points in time that you can flip back and forth between in finite time. This is what a version control system provides.

Of course, this doesn’t protect against all catastrophes. Hard drives still fail. Software still has bugs. I recommend doing whatever puts your mind at ease.

In terms of elegance and feature set, though, I think Sync + git is pretty hard to beat.

I use Time Machine and have no problems with it. When my MacBook Pro is on my desk I mount the RAID external storage and start up Time Machine, which will continue to make backups every hour as long as the external storage to which backups are made is mounted.

Time Machine does make incremental backups of DEVONthink Pro Office databases. For example, for the last couple of days I’ve not had the Promise RAID unit mounted to my laptop. During that time I had modified all of the five databases that I normally run, including scanning and Web downloads as well as new rich text notes. New content was added in other apps as well, including Mail.

The database files add up to 6 GB storage size. The initial Time Machine backup was 2.07 GB. Obviously, the entire package files did not have to be copied. As additional confirmation I added new content to three of the databases (including one that bulks 4 GB), and had also added content to the Finder by exporting several files from DEVONthink. I forced a new Time Machine backup before the next hourly interval, and the backup was 624 MB.

As korm noted, DEVONthink databases are dynamic. I agree that in principle one should make Time Machine backup while databases are closed. But in practice DEVONthink is always running on my Mac, so is running while Time Machine backups are happening. On several occasions I’ve looked for possible problems when I was modifying a database during the backup (although I’ve never tested by adding thousands of files during a backup). I took a screenshot of Database Properties, then moved the database and replaced it with the Time Machine backup and compared Database Properties. No difference, although I think differences are possible in some cases. Even if that were the case, the next hourly backup would likely resolve this in a new backup.

I’m a believer in backups, although I haven’t had to resort to one in years, as my databases are stable.

A more potentially serious problem with backups is that they are being made of a damaged database. Time Machine (and most other backup utilities) can’t check databases for errors, and will happily backup a damaged database. Once in a while, I’ll run Tools > Verify & Repair to check; again, I haven’t seen errors for a long time. But it’s a good habit to develop.

My gold standard for backups is the Database Archive that can be produced by (at least) DEVONthink Pro and DEVONthink Pro Office. It does an error check before starting the backup, and will stop and alert the user if there is a problem. I periodically update Archives of my important databases and store them offsite.

Here’s where I’ll sing the praises of Time Machine (and Thunderbolt) for moving everything to a new computer. When I got my current MacBook Pro, I mounted the Promise RAID to it and transferred everything to it from Time Machine via ThunderBolt. Fastest data transfer to a new computer that I’ve ever experienced. And no problems at all. All my Preferences settings, registrations, cookies and so on were moved as well.

You guys are making me nervous about backing up the DTPO database. I’ve been letting Time Machine do hourly backups and I do a nightly incremental backup to the ‘COPY’ cloud service using Apple’s ‘BACKUP’ program. Am I not as safe as I thought? If you open Devonthink’s preferences and go to the ‘Backup’ tab, it says right at the bottom of the window:
“Backups contain only the metadata storage and search index. Use Time Machine to backup complete databases including internal and external files.”

So, are Apple’s Time Machine and Backup programs making reliable copies or not? If not, I need to get everything out of Devonthink and just do indexing. I was hoping to be able to keep most files stored in DTPO and use it as the center of my little world here!

@clane47 – don’t be nervous. Understanding how these things work makes your data safer.

I’d summarize the discussion as follows:

  1. The best backup is File > Export > Database Archive, done periodically, and saved off the machine or even off site. Full archival off-machine or off-site backups is the best practice for backing up any kind of data in general.
  2. Time Machine backups can be reliable, but they are incremental and could have anomalies – so restore a Time Machine backup from time to time and validate that it has what you think it should. This periodic verification/validation is good practice with any backup, IMO.
  3. Closing DEVONthink while backing up is a good practice, as it is for any data belonging to any application – but experts believe that for daily use it’s safe not to do this. BUT, see #1 and #2.
  4. Incremental backups done by services such as CrashPlan might not be useful – verify that you’re getting what you think you’re getting.
  5. If you feel comfortable with the technology, the alternative solution suggested by Nathan is a good one. BUT, see #1 and #2.
1 Like