Splitting DBs or not

Jweber · April 29, 2008, 2:56pm

I’ve been using DT for about 3-4 years for all my personal and business receipts. Everything imported is a PDF file. I’ve always kept each year as a separate db for size reasons and general paranoia of something larger crashing more frequently than something smaller
The only drawback is when searching to find something I have to know the year to search. Which as I add years to the dbs and get older seems to be more of a challenge.
So for a test I made a composite db of the last 4 years of receipts and found the combo is smaller than just my 07 db! The combined db is just over 900 megs without backups. 07 was almost this size by itself.
How can this be? Am I better off keeping all the years as separate dbs or should I combine?

Bill_DeVille · April 29, 2008, 4:58pm

Your key phrase in comparing the size is “without backups”. The default setting for internal Backup folders is three. A new database doesn’t yet have any content in those Backup folders. In version 1.x all text-type files including text (plain, rich and other text formats), HTMLand WebArchive are stored in the monolithic database, and so are also copied into the internal Backup folders (including images contained in RTFD and WebArchive files).

Other Imported file types, including PDF, Postscript, images, QuickTime media and “unknown” file types are stored in the internal Files folder and are not duplicated in the internal Backup files, except for their text content which is included in the monolithic database and, therefore, is duplicated in Backup folders.

When you open a database the monolithic database is loaded into memory. But the internal Backup folders are not loaded into memory, nor are the contents of the Files folder. Those internal Backup folders do require additional disk storage space, but there’s no memory use penalty for the security they provide.

So the real issue concerning whether you need to separate databases by year, or aggregate them, is whether or not your computer has enough physical RAM to run the aggregated database quickly.

The user must be the judge as to whether or not database performance is satisfactory. As I frequently say, I’m spoiled. I want instant gratification – fast response. That’s especially important for my main database, the one that is open most of the time and that reflects my main interests.

For that database my criterion is 50 milliseconds or less for a single term “ignore case” search, an initial lag time of no more than 3 or 4 seconds the first time I invoke Classify or See Also, and essentially instantaneous appearances of successive Classify or See Also suggestions. Note that a single-term Exact search is faster. On my ModBook with 4 GB RAM a search for “Indiana” produces 471 results in 0.006 second. Of course, an Exact string Wildcards search for Thus, takes longer – 2.056 seconds to find 1184 items. An illustration of how many writers start a sentence with the string “Thus,”.

At the moment I have 1,485.6 MB free RAM and have had 0 pageouts since the last restart 30 hours ago. I’ve got several other applications open. Note that my ModBook is a custom Mac tablet computer modified by Axiotron based on a lowly MacBook, and not even the latest generation of MacBook. It’s not a supercomputer in the current Mac product line. (But it’s incredibly more powerful than the CDC 6600 mainframe computer that I used many years ago.)

That main database contains more than 25,000 documents with a total word count of more than 29 million words. I just optimized it (using the Backup Archive script). The disk storage space is 4.64 GB, including 3 internal Backup folders.

The limiting factor for database performance is satisfaction of its memory requirements by available free RAM. That’s far more important than CPU speed. I can run this database on my TiBook with 1 GB RAM, but will encounter more and more frequent slowdowns when I’m working it hard. This same database runs acceptably on my MacBook Pro with 2 GB RAM, but with occasional slowdowns. On the ModBook I’ve still got enough RAM headroom for significant growth of the database without performance lags. Slowdowns occur when a data processing procedure runs out of physical RAM and begins to use Apple’s Virtual Memory, which involves swapping of data back and forth to disk to complete the procedure.

I’ve got a number of other databases, managing more than 150,000 documents in all. I create topical databases containing clusters of information related to my interests and needs. That’s done for two reasons: to keep the performance of the databases fast on the computers that I use, and also to improve the focus of See Also suggestions for some of my collections of references.

What about database integrity? Your other point was a concern for safety of your data. My experience is that databases are very stable and reliable on a stable computer. I keep a pretty stock OS X operating system (no haxies or other hacks). I run routine preventive maintenance on the operating system and disk directory. Within the last 3 years I’ve had to resort to a backup only once, when I experimented by installing an Input Manager plugin that a user had installed on his computer – it scrambled my database, too.

I have a Belt & Suspenders attitude. When I’ve added a batch of new content to a database I run Tools > Verify & Repair. If there are no errors I then run Tools > Backup & Optimize. I don’t wait for a scheduled backup when I’m adding batches of content.

Once in a while, when I’ve been making changes to a database (new content, writing & editing, organization) at a convenient break time I’ll invoke Scripts > Export > Backup Archive. When I return from break the database has been verified, optimized and has current internal and external backups. But what if my hard drive were to fail, or my laptop is lost or stolen?

I use Leopard’s Time Machine. But what if someone were to steal my computer equipment while I’m away from home? Or my house catches fire?

Backup Archive archives are my answer to both those questions. Periodically, I’ll save recent Backup Archive files to a DVD and drop that off at my bank. That’s added insurance. My databases are far more valuable to me than the computers that host them. Off-site storage is cheaper (free except for the cost of discs) than the insurance premiums I pay on my house and car, and the potential value payoff to me is comparable.

Backup Archive isn’t available to users of DEVONthink Personal. But the DT PE database, the folder named DEVONthink at ~/Library/Application Support/, can be copied as a zipped file and stored elsewhere. Caution: Always Quit DEVONthink before making a Finder copy of the database. That caution holds for Pro/Pro Office databases, too.

Jweber · April 29, 2008, 5:44pm

Wow thanks for the detailed response Bill.
You are right about the internal backup files, they do take up space I’m not seeing right away with the new combo.

As for RAM, I’m do a lot of Photoshop work so I have over 8 gigs. The combo opens pretty fast and smooth! I run Cocktail everyday and try to keep my system pretty well maintained.

About 6 months or so ago I had an issue with one of my yearly dbs and since then I’ve been paranoid about the integrity issue. I think in retrospect what might have happened was since I run my db from an encrypted disc image on my hard drive (7.9 gigs for double layer DVD backup) I might have been running low on physical space while working in DT and that caused the issue. I had no problems before or after that event but it was enough to give me the willies.

What has been interesting about combining the years was finding all the duplicate files that I could change to replicants then delete which has saved me some room. One function I wish DT had was to somehow show where a duplicate was residing by selecting a file (in blue) and showing a path to the other file somehow. That would save time with duplicates.

John