Verifiable long term document storage

sgbirch · September 18, 2020, 2:36pm

I have been using Devonthink for many years now, so much so that it has become the standard repository for everything. But I live in fear of losing a document by accident and not noticing it until years later.

Has anybody come up with a way of checking the repo to make sure that all files are still in place and correct? Perhaps an Archive where old documents can be stored, but with a way of running a shasum check over it to be absolutely sure nothing changed or was corrupted.

The question of git has come up in the past. It would be ideal if the archive could be committed to a git repo so it can ALWAYS be recovered at a much later date. But I understand git fails because the documents and database are stored in different locations. Is that still true with 3.0?

Any thoughts or suggestions about safe verifiable long term storage would be extremely welcome.

cgrunenberg · September 18, 2020, 2:49pm

File > Verify & Repair Database… doesn’t support checksums but at least it can verify the integrity of the database and that all files exist. The latest version performs this automatically if a database wasn’t closed properly (e.g. due to a crash or kernel panic or power outage etc.)

BLUEFROG · September 18, 2020, 3:02pm

It would be ideal if the archive could be committed to a git repo so it can ALWAYS be recovered at a much later date.

That wouldn’t be the case if you are offline or your network went down.

But I understand git fails because the documents and database are stored in different locations.

Imported documents are located inside the internals of a DEVONthink database.
However, you should not be putting a DEVONthink database in the cloud. You periodically could export a ZIP file (see File > Export > Database Archive or Script menu > Export > Daily Backup Archive) for cloud use.

sgbirch · September 19, 2020, 8:06am

I didn’t phrase my question particularly well. What I am trying to detect is the accidental loss of a random document, either in Devonthink or in the file system. It could be an accidental delete or a some other mistake.

I already have a full backup schedule in place using a combination of Time Machine and periodic offline and offsite backups so data recovery isnt the problem.

This question is about verifiability. I want Devonthink to be even more dependable than a file cabinet containing numbered documents. In five years time I want to know for sure everything put into that cabinet is still there and not corrupted. I’d like a list of anything that has been removed or changed over the five year period. Git can do that.

Recovering the lost file is easy using time machine but it would be so much easier if I could just shut down Devonthink and then have git checkout the database to an earlier version. I don’t understand the issue. this seems to be the same as using Time Machine to go back to an earlier date.

What exactly is the problem with using git on the database?

Forget the cloud btw … that is not the purpose of this question.

cgrunenberg · September 21, 2020, 6:59am

It’s very likely that the contents of the database package won’t be restored consistently. E.g. just restoring the files/folders is not sufficient.

sgbirch · October 14, 2020, 12:36am

So … I am trying to recover an important tax file lost from 2019 using Apple backup. It’s messy so I need to come back to this question. I need a way to be able to go back to earlier versions of a Devonthink database. It seems to me that the ideal solution is to commit the database to git every so often.

This question has been asked several times over the years but I don’t see a definitive answer. Some confusion seems to exist about what exactly git it. It is NOT cloud based software, it simply lets you take snapshots of a directory tree. It is highly efficient because it stores deltas from previous versions.

Can git be used and, if not, why?

rkaplan · October 14, 2020, 12:53am

You probably could use Git but it would not be a good idea for non-text files. Git’s main advantage is being able to compare changes and track differences; it would not be able to do that for a .pdf document and certainly not for an entire .dt3 database.

A better solution would be either Apple Time Machine or a (superior) 3rd party backup program like Carbon Copy Cloner. These keep incremental or snapshot backups on an hourly/daily/weekl/y/monthly schedule you specify.

cgrunenberg · October 14, 2020, 8:00am

It might work if the databases are closed before taking a snapshot.

sgbirch · December 11, 2020, 1:46am

Time machine backs up fine, but if you have a corrupt file it will be backed up. Eventually you lose the uncorrupt original.

sgbirch · December 11, 2020, 1:49am

When I’ve asked in the past DT tech support said no. But I’m wondering if they really know what git is. I don’t understand why a TimeMachine restore is ok but a git checkout is not.

I know git sucks with binary deltas btw but mostly we don’t change binaries (PDF etc).

BLUEFROG · December 11, 2020, 1:54am

Of course we know what git is. We use it in-house.

And git is not a backup system so comparing it to Time Machine seems quite odd, even if you’re just referring it a file restore.

git is made to do what it does. Time Machine as well.

bruce · December 11, 2020, 9:16pm

While this won’t help with the backup part of the equation, Howard Oakley has a multi-part series on file integrity and several apps that create and store checksums in extended attributes that allow one to check for corruption, though not repair it.

MacPAR deLuxe creates secondary files to provide protection against corruption, though it becomes inefficient as file size grows.

Not sure how well those tools would integrate in DEVONthink and/or your workflow. One of Oakley’s apps is a command line tool and can be called from an AppleScript. You might be able to find a way to automate creating checksums and periodically verifying files via smart rules. Assuming DEVONthink perseveres all xattrs within sync stores.

mhucka · December 14, 2020, 3:00am

Git user here. Being able to recover older versions of your DEVONthink databases (or any databases) is a sensible idea. IMHO, the use of git is not the best approach, and I’ll try to summarize why.

The purpose of git is to let you store, track, and retrieve versions of files. However, it’s optimized for text files, which it stores not by keeping full copies of every version but by storing the differences from one version to the next. This is easy to do when a file is stored on disk as plain text (there are known methods for doing that), but as you yourself alluded to, it’s generally not possible to do if the file is a binary object such as a database or a movie file. Yes, non-text files can be stored in git, but then git has to store the entire file (albeit compressed) every time you save a version. The consequence is that space usage goes up dramatically if you edit/annotate a lot of PDFs, images, and other non-text files in your database.

Using git also requires manual action; it is not something that will automatically save a version very time you make a change in some other software like DEVONthink. You could arrange a scheme to do that (or DEVONthink could incorporate it), but at minimum, that’s going to lead to a lot of very small changes being saved all the time, compounding the problem described above. Could you take snapshots periodically, say every hour? That might help, but there remains another issue: database consistency.

A DEVONthink “database” actually contains individual files (the assets like the PDFs and other files), plus some kind of core database index (containing the metadata, tags, and other aspects that are not stored in the files). I don’t know the details of DEVONthink’s implementation, but that index is probably a binary object on disk. For reasons discussed above, git would have to save the whole index file when it takes a snapshot. But software systems like DEVONthink don’t save their state to disk constantly: doing so would be prohibitive performance-wise. Instead, some dynamic state is in computer memory, and only at certain points in time does that memory content get saved to disk. The issue is that no changes can take place to the index file on disk while a snapshot is being saved in git, because you need the disk contents to be in a consistent state at the time of the save operation. (I think this is what @cgrunenberg was referring to when he said the database would have to be closed before taking a snapshot.) The practical implication is that either DEVONthink flushes everything to disk and becomes temporarily unresponsive while the git save takes place, or it keeps running for a short time without saving changes to disk, which brings the risk of data corruption if something bad happens (e.g., power goes out) at the wrong moment.

The longer you wait between snapshots, the more accumulated changes to both the database index and some subset of individual file assets there will be since the last snapshot was taken. The more changes there are, the more work git has to do to save a snapshot, and thus, the longer the save operation will take. The longer it takes, the more irritated the user or the greater the risk elsewhere.

There are database systems that work around that – after all, database backups is not a new concept, and maybe DEVONthink could use such a system. But, given all the issues above (and probably many I didn’t think of), you can see that git per se is just a good match for this purpose.

sgbirch · December 14, 2020, 7:11am

@mhucka, thank you for your insightful and knowledgable response to the frequently suggested git concept.

Would it be possible to pm, I’d like to have a brief “blue sky” chat with you before I pen a forum response, we appear to be on the same page.

You can reach me using “sgbirch” on most channels (Twitter, gmail.com, Facebook etc).

If you do have time to ping me but I don’t respond a spam filter probably intervened on the channel you chose. Let me know here.

I’m currently in Peterborough, London TZ, 8 ahead of PST,

rmschne · December 14, 2020, 8:10am

Given your expertise, could you also construct a guaranteed bullet-proof clause(s) for the updated license agreement for this proposed feature that would indemnify (or otherwise protect) the software company from any kind of court actions should anyone claim the storage guarantee didn’t work and caused them loss of data.

efndc · December 15, 2020, 9:44pm

@sgbirch as others mention, git is designed for file content that is largely text base and can be tracked per line…and not blobs of data. This page will discuss all of the reasons why git isn’t the correct choice for backing up “large” non-text files though, and how bup attempts to work around them: https://stackoverflow.com/questions/17888604/git-with-large-files
http://episodes.gitminutes.com/2013/10/gitminutes-24-zoran-zaric-on-backups.html

One option would be to use a dedicated APFS volume for your DT store and then use snapshots within that volume to preserve known good copies. The way commercial database backup solutions work is to “stop” the database from writing while a “snapshot” is created, they then do some verification check on the snapshot copy…if it passes, it is marked “good”. This may sound trivial to implement, and if it is grab your favorite IDE and get to work…however there is a reason that good backup software that can manage database integrity fetches a high price.

sgbirch · December 16, 2020, 3:38pm

That is a valid point. I suppose the answer is to make the mechanism a hook into (say) a SHA1SUM generator. Then the guarantee is that the checksum is correctly created and nothing else. The fact that name collisions can occur is well known however rare they might be.