Avoid rewriting of database files to allow sync'ing database

I use unison and rsync to copy databases between computers and to a backup disk. In the current implementation of DTPO, I have two problems:

  1. Suppose computers A and B now have identical copies of xyz.dtbase. I open xyz.dtbase on two computers A and B, but edit it only on A. I make change to its contents only on A, but not on B (view only). However, xyz.dtbase is updated on both machines, and therefore unison gets confused as to which copy should overwrite the other.

  2. Suppose I have a local computer A and remote storage server, to which I use rsync over ssh to backup data. I open many database files in a day, but only some of them I make changes to their contents. However, all database files that were opened get rewritten and therefore rsync will send all of them. This unnecessarily adds up to the backup time and also use up more bandwidth.

Can there be a switch to regulate this behavior of rewriting the database files even when there is no editing?

I’m convinced that the logical problems involved in resolving differences between two databases are so deep and pervasive that I don’t trust any software to handle them, at least without human intervention. That’s true even in the relatively simple case of a TextEdit file on two computers. Suppose that I make different edits to both files. Think about it. So I “equalize” two databases using the History tool and export/import changes between the two databases. That can result in some duplicate files, which I think is the logically correct result.

I don’t know. Does rsynch use the “last opened” mark as a cue? But it seems to me that information can be important to some database users.

Most of the schemes I can think of would require a change to rsynch.

I’m not asking you to make that judgement. I’m merely asking not to rewrite these database files when nothing is changed.

Unison and rsync make smarter decision than what you assumed. I sync my data between my home office, my lab office, laptop and a pocket hdd drive every day (indeed several times a day) for more than two years but there is absolutely no problem with this. Unison asks or skips (depending on what you specify on the command line) files for which it does not know which is newer. Devonthink rewrites databases that didn’t get changed and this presents a problem.

That’s not the point. If you use textedit, you don’t “save” files to which you didn’t make any change, and any reasonable tool like rsync or unison will know which one is newer. I’m merely asking that Devonthink not to save when there is no change made to the database contents.

Last opened timestamp is irrelevant. Last modified timestamp is.

Devonthink uses database files in DEVONthink-{[1-9],10}.database format under each database directory, and when the database is opened and then closed, all these 10 files are rewritten even if there was no change made to the database or contents in a very brief view-only activity. This is not the same as opened and closed. I’m merely asking not to rewrite these database files when nothing is changed.

They don’t. Unison and rsync are smart enough to work with Devonthink reliably. It just requires that files that didn’t get updated not be rewritten.

Some of your points taken. :slight_smile:

However, there will be times when database files have to be rewritten even if the user makes no changes to content. Backup and optimization are obvious cases.

The changed database structure in version 2.0 should considerably reduce the size of files required to be rewritten in your backup scenario. In the current database structure text (plain and RTF), HTML and WebArchive files are stored within the “soup” of the monolithic database, which can become very large for that reason. In the future version of the database, those files will be stored individually in the Finder. The database files per se (including the Concordance and AI code) will become much smaller.

You missed my point about differing edits of the same TextEdit file on two computers. A backup decision based on the most recent modification time would be wrong. A skip would be wrong. Possibly important information would be lost. Both versions of the file should be retained. That’s why I referred to equalization rather than synchronization of the information on the two computers used to edit the file.

Yes, it might be possible to set up rules for requesting a user decision in such cases. But it becomes difficult for the user to make the decision in the absence of an opportunity to review the documents, especially if multiple files are involved – an automated procedure could become a frustrating experience with high potential for judgement error.

I’ve seen that issue in my own databases and in other users’ databases in cases where copies of a database on different computers have been used and both copies have been modified.

So I try like heck to avoid differing working copies of the same database on two computers. Sometimes it happens. That’s when I use Tools > History to select new or modified content in each database, and File > Export > Files & Folders to send that content to the other copy of the database. Now both copies have the same content (although the copies will have different modification date/times).

That’s not a problem. That’s a user-initiated operation and most likely the data elements in the database files are rearranged and therefore need be rewritten.

That’s nice, but not absolutely necessary for what I am trying to do here. The fact that Devonthink “touches” database files and changes the last modified timestamp confuses unison to necessitate user operation, which could be avoided if Devonthink behaved as I requested above. Size of the database files can be quite big (at least in my usage) but that’s not problematic at all. They add up to extra traffic in backing up and synching but it’s also not that big of a deal. That Devonthink changes mtime timestamp even if the database files are bitwise identical is the source of annoyance.

I’m not talking about cases where two sets of database files receive different edits at all (see my original posts). At least one end is a static unused copy or viewing-only usage. Yet, due to mtime change, unison gets confused and requires user selection of which copy to keep.

Besides, if the file is plain text, Unison has a way to merge diffs. I’m not going to explain here, but I suggest you look into Unison, which is a very useful program for anyone who uses multiple computers daily.

I’m not demanding a perfect and universally applicable and scalable solution. I’m merely seeking a solution that can cover what one individual person can achieve on multiple computers in a day’s work. I’m perfectly happy to make manual adjustment or manual consolidation for a very small fraction of data I generate every day taht can’t be handled by Unison’s intelligent judgement.

Again, that’s not a scenario I’m talking about, as you’ll find if you read my first posting here. If the database structure is updated in version 2 as you described, it only expands what you can do with such a simple tool like Unison. You’ll be able to modify two copies database and consolidate them later automatically as long as you don’t edit same file across copies of the database. But at this point I don’t need that level of operation.

Yeah, I did switch from your original post to discussion of multiply modified copies of a database. And yes, unisom is neat, but I can present it with issues for which it will make the wrong decision.

The problem remains that in the current database there are housekeeping chores on a database that require writing to disk, without user intervention and even if the database isn’t modified (in content and organization). And scheduled backup will initiate writing to disk.

One of the neat features is that the database remembers which windows (views and documents) were open at close (if Preferences is so set). That requires writing to disk. Features such as that would have to be foregone to do what you are asking. There are a number of things a user may do while opening and viewing documents – without changing content – that can trigger writing to disk, such as changing the apparent (as viewed) font size. The database remembers that. You might switch among view types during viewing of content. The database remembers that.

That’s why I commented that there might have to be a new convention for marking a database as unchanged (in content and organization, but not, perhaps for remembering open windows) that would be written to disk, and a corresponding modification of rsynch to recognize the new convention. But I think that the revised database structure in version 2.0 will reduce your problem of long backup times; the issue won’t be entirely moot, but less aggravating.

What about SuperDuper! smart backups?

Setting aside the “sync” problems (which weren’t listed as a problem in the first place), the issue here is that meta-data (the state of windows, the “openness” of the database) are stored as changes in the database.

There are two ways around this that I can see:

  1. Move metadata into a secondary file and bundle the core database and meta information file as an OS X package. Sync tools will end up copying both when “real” changes happen, but when the database is just opened and a few files opened then closed, the only thing that gets copied over is the meta-data file (which, presumably, would be really small).

  2. Make these types of database write non-touching writes. In other words, keep the “last modiied” date the same after makin such a change. This is kinda “cheating” and I wouldn’t advise building the system in this way, but it would achieve the purpose of not triggering a sync when there is nothing of substance to sync (and window positions, etc, are not at all likely to be “important” from a syncing perspective …)

I use ChronoSync for my backup/syncing (my DT database is just backed up, not synced, for whatever it’s worth), and have the same problem. In fact, I’d wager that every user who has a sensible, incremental backup sees the exact same problem! Just opening the database file means the whole thing gets copied over again on the next (fundamentally incremental) backup. This is a pain in the rear, and only gets more painful as the size of my database grows.

Workaround: Want to have a copy of a database on a second computer that’s only for viewing and that won’t trigger all these problems? Try copying the database to the second computer, then choose Command-I to open the Info panel and lock the thing. Now it’s read-only. You can do searches, read documents and use See Also. But you can’t modify the content and the modification date and last opened date won’t change.

Otherwise, for unlocked databases…

As noted, the revised database structure when version 2.0 is released will reduce the size of the database file that is changed by opening a database that is only viewed, without modification of documents.

For example, my main database contains thousands of text, HTML and WebArchive documents in the ‘monolithic’ database file. So that’s a big file to synch. In version 2.0 all of those documents will be moved out and stored as individual files in the Finder, inside the database package file. That will reduce the size and memory footprint of the database file per se. A simple backup synch of one database package file won’t take as long in version 2.0.

The initial post in this thread did raise the issue of syncing two copies of a database, only one of which had been modified by content change, to a common backup. Of course, synching issues should also be remembered in a case where both copies of the database have been modified, if only because it is so likely to happen.

Because a database ‘remembers’ features and settings that are convenient to the user, there will be changes made on disk when one opens and views a database. (Cache files stored to disk on the two computers may well differ even if the two copies of the database have both only been read and searched, without modification of document content. But perhaps the synch software can ignore cache files.)

Even splitting out those changes into a separate small data file will still present some logical problems to synch software if two copies are to be ‘merged’ as one backup. Version 2.0 will reduce the size of the data files that pose problems, but not the logical issue of which database file should be ignored or overwritten. A simple “most recent” decision could be wrong.

That’s the problem with generalizations. Some pesky reader will come up with a counterexample.

I do daily automatic offsite backups. The backup software (BackJack, for those interested.) is automated enough, and polite enough about system resources, that I’ve never noticed any particular issues around DT Pro’s way of doing things. My own generalization is that if your backup solution isn’t seamless, it’s probably time for a new backup solution.

Katherine

I just want to throw in my two cents here and add that I heartily agree. I have a ~1 GB database that I want to store on my iDisk, as I move frequently between desktop and laptop. But it is so slow to sync that I really question the value – every time I open DTP I have to write 100 MB over the network even if I haven’t changed a thing. That’s just a huge pain… not to mention wasteful of resources.

I understand the argument about prefs and what not, but would suggest that those belong in a separate file – everyone else uses plists, why not DTP? I don’t even mind if DTP wants to do it once in a while so it can optimize. That’s fine, but every single time is really annoying.

I’m glad to know that so many people had the same problem as mine.

Like mentioned above several times, I think rewriting of the database files when there is no change in the contents is unnecessary and provides little advantage. I personally don’t care if the cursor position wasn’t saved, particularly because mouse cursor information isn’t effectively utilized within a single session within a same database, if you move between multiple files, whether it is plain text, RTF or PDF. I requested this information be preserved a couple of years ago but Devontechnologies don’t seem to be interested in this. So why should the database files be rewritten to preserve information that’s not used?

I’d rather save the time for data transfer, save bandwidth, save per GB charge on online storage service, and save battery life on my laptop.

Just a suggestion: if you need to copy database files between computers, ZIP them first, because the archive files will transfer far more rapidly. I do this routinely when I take my DNote file on the road. And it will also work with DT projects, though they are often much larger.

The procedure is to locate the DNote folder in Library: Application Support, select the folder, and then Finder: File: Archive. Copy the ZIP file to a flash drive (or park it on your .Mac space), and then on the laptop, copy the file to Application Support, trash the old DNote folder, and decompress the ZIP file. You should also remove the ZIP files from both drives.

That doesn’t really help because the bulk of the database package in my case is PDF files that don’t really compress much. It’s also that, with modern hard drive and fast network connection, just copying the data is faster than compressing the data first. When I use slower network I use unison or rsync over ssh (with compression turned on) to take advantage of files that don’t need to be copied and then compress those that need to be copied. Zipping the whole database package is much less efficient because you’ll end up compressing and copying files you don’t need to update.

Think of the database structure of the DT 1.x applications as a monolithic “soup” which holds, in a way that cannot be differentiated by the Finder, all the text, HTML and WebArchive files in the database. That includes the images and video and audio media that may be contained in RTFD and WebArchive documents And of course the “monolithic” database file also includes the database code – the mechanics and AI features. Any change in this monolithic file requires recopying for incremental backup.

Also included in the database package file are PDF, postscript, images, QuickTime media and “unknown” file types, but those are stored in the Finder. If these files are not modified, they need not be recopied in incremental backups.

A future upgrade of the DT applications will substantially modify the database structure. All documents will be stored as files in the Finder. That will reduce the size of the database itself, as well as substantially reducing memory requirements for opening a database. That, alone, will reduce the size of incremental backup copying.

The idea of using .plist files to store information such as remembering the state of open views and documents. was suggested. If stored externally to the database – perhaps as preferences in the User Library for each database – the downside would be that moving a database to a different computer would lose that information. But that information can be important, e.g. as part of the instructional design of a database for teaching purposes. And I can foresee advantages to such design under Leopard.

I’ve been running Time Machine to an external FireWire drive attached to my MacBook Pro. It’s so easy and seamless that it requires no attention. It will be even more transparent with the future version of DEVONthink applications, as one will be able to see individual files for recovery. I like the frequency of automatic hourly backups when I’m working. And my 500 GB backup drive will never run out of space, unless I wish to maintain historical archives for some reason. My first hard drive was much larger (physically) and held only 5 MB, but cost more than 8 times as much as the 500 GB drive.

Bill,

I think you are changing the point. Some of us are arguing that rewriting of database files when there is no change is unnecessary, costly and inconvenient. What we like to hear is that devonthink stops doing this useless and costly behavior. How the data structure of database changes in version 2 is a totally different story and, even if the database file gets smaller, my argument still holds: devonthink should stop unnecessary rewriting of database files. I don’t need to store cursor position or window configuration especially because these informations are not utilized effectively, and the cost and inconvenience from rewriting database outweighs the advantage of saving such information. Regardless of the sizes, rewriting of database files when there is no change in content is obnoxious when using unison and other smart file system synchronizer.

I’d also add that my 1TB external RAID hard drive unit is almost full due to large PDF files made from DTOP with scan and OCR. I really hope that DTPO saves PDF files with JBIG2 data compression algorithm like Acrobat CS2. (This was the subject of another post of mine.)

There won’t be significant changes in the DT 1.x applications, other than maintenance updates for Leopard compatibility.

We’ll see what happens next. :slight_smile:

DEVONthink doesn’t do this, it’s only updating necessary parts.

This information is not part of the database.

Anybody who uses iDisk or in general WebDAV on the Mac should know that the Apple implementation is the most horrible imaginable and results in many file operations on the WebDAV front for one on your harddisk (even though the WebDAV protocol allows for greater efficiency).

An example: if you rename a file in the Finder, in WebDAV on the Mac this will result in the creation and deletion of an empty file, a copy of the old file to the server and a rename of that file. This is why Apple “created” the option of syncing iDisk with an on-disk representation to make you think things go faster.

Backing up large or many files to iDisk on a regular basis is really not a good idea. The workaround suggested earlier in this thread by compressing our database into a single zip file is really your best option if you want to go this path.

It did until version 1.3.3. It appears that this behavior was quietly corrected in the 1.3.4 that just came out, and the database files are not modified including their timestamps after view-only use of the database. This is a highly appreciated improvement.

I was referring to the folder open/close etc. mentioned by Bill in his earlier postings. This info seems to be saved somewhere else. Anyhow, I and those others who use rsync/unison/other tools will most likely welcome this change in version 1.3.4.