Maximum Database Size, Again

tharpold · July 19, 2022, 7:10pm

For many reasons having to do with my academic workflow these days – too many writing projects! – I’ve found that my primary use of DT3 as a repository for texts used in my research and writing (mostly .pdfs, journal articles and books, I’m a prof in the humanities), has been to combine multiple large-ish databases into one or two very large databases that are, essentially, my local giant encyclopedia. This gives me the maximal flexibility in terms of searching, linking, tagging, replicating, etc. across a very large number of files, and makes it easier to discover all kinds of cross-connections in the fields in which I work.

I’m aware that doing this probably reduces DT3’s speed and efficiency in some basic functions but that loss seems to be more than made up by increased convenience and efficiency for the wetware end of application use (my brain): it’s like having one giant archive with everything present at the same time, albeit carefully structured with groups and tags.

So… I’m working on a regular basis with couple of big databases: (140 GB + 470 million words total, and 58 GB + 75 million words total). All of the contents of each of those are imported files (no indexed files). And then I have a fair number of much smaller databases, project-related, teaching-related, or archival in nature, none of which is bigger than 10 GB or so (most are much smaller than that), and many of which include indexed files from elsewhere on my computer (a 16 GB MacBook Pro Intel). I keep the two big databases open all the time and open or close the project and teaching databases as I require them. I practice scrupulous backup procedures: Time Machine, Carbon Copy Cloner, Arq, multiple redundancies saved.

What would I lose or gain if I further combined the two big databases into a single really big database? As I keep them open all the time, their words (total and unique) are always loaded; in effect, as I understand how DT3 manages this stuff, the performance penalties of using two large databases, when they are open all the time, are roughly the same as using one combined database open all the time. My guess is that there is a good bit of overlap in terms of the unique words of each of the databases, so making them into one really big database, while it would increase the total file size of the largest database DT3 manages, would not increase the total word count, and would decrease overall the unique word count. (I don’t know how much that matters.)

Are there any obvious advantages in terms of memory footprint or other performance of going for the large, single database + satellite specialty databases? Are there any real and worrisome downsides?

(Made a couple of edits after initially posting to clarify some details.)

cgrunenberg · July 20, 2022, 8:33am

The memory usage of one large database is actually lower than the one of several smaller databases which are always opened anyway. Multiple databases have other disadvantages like no replicants or duplicate recognition across databases but have of course several advantages like performance, memory usage if not always opened concurrently, faster synchronizing and I/O (verify, optimize, backup etc.) and make it easier to limit See Also & Classify results and to separate independent stuff (e.g. professional vs. private).

tharpold · July 20, 2022, 12:08pm

Thanks, cgrunenberg, I’ve merged the two databases into one (210 GB, 540 million total words, 5.5 million unique words) and will watch how performance is affected.

My computer is a 2017 MacBook Pro 13" Retina, with 16 GB of RAM, so DT3 performance with big databases is already an issue sometimes. Am hoping to upgrade to an M2 MacBook Pro early next year, but will have to accept trade-offs re speed vs workflow efficiency for how in I primarily use DT3.

Can see already some reduced RAM use overall (a couple of GB saved). Searches, document navigation, etc., don’t seem much different. See Also & Classify is a noticeably slower for files with many connections to other files – I expected that – but still useable. “Verify & Repair” and “Optimize” take significantly longer, but I don’t do those things more than once a week usually. The duration of backups viaTime Machine, Carbon Copy Cloner, and Arq, all of which only back up changed files within packages, not much changed at all. I already segregate working projects / teaching notes / personal files from the big “Bookshelf” database that now includes all my reference materials, so that that will limit See Also & Classify to only research-relevant hits. Will be interesting to see what other advantages/disadvantages accrue over time of this one-big-library-or-encyclopedia approach.

cgrunenberg · July 20, 2022, 12:11pm

The memory usage is actually only comparable after verifying all databases, otherwise the usage might vary a lot due to caching.

matthias · August 20, 2023, 9:47am

It‘s already one year. Would you mind sharing more advantages/disadvantages of the combination of the two big database?