Database - why use more than one?

thelostone · November 1, 2014, 10:54am

I have DEVONthink Pro and I was wondering what are the pros and cons of using one database vs multiple, because one could also arrange data within groups inside a database.

I could think of only three reasons:

RAM / performance issues on slower computers if databse file is too huge
Work and Personal i.e. entirely unrelated items that must be kept separate
Temporary vs permanent (feeds or other items that one would delete later and only save some can be kept in a separate database)

For me, probably only no. 3 could be applicable. But what is the functionality loss in keeping items in groups within one database vs different databases? (Like one can’t replicate or search across databases?)

Bill_DeVille · November 1, 2014, 6:07pm

Your comment that one cannot replicate items across databases is true. However, given the organization and topics covered in my multiple databases, there have been a few cases in which I’ve stored an item in more than one database. Not often, perhaps only a handful of instances.

Your comment that one cannot search across databases isn’t true. I prefer using the full Search window for searches, and it can search across all open databases if desired (as well as making inspection of all search settings easier). If each database has the option to create Spotlight indexing checked, a Spotlight search can be performed to search across all databases, including closed databases. Tip: when Spotlight displays the search results, chose the option to view all results. A Finder window opens, and the results that are within a DEVONthink database will display the blue shell icon. Select such a result and press Space to see a Quick Look display, or double-click to open the item within the DEVONthink environment.

Your comment 2) about separation of work and personal information can be meaningful. When I first started using DEVONthink back in 2002 I worked in a governmental agency as a division administrator. Under law, all work-related issues, especially those involving regulatory issues, were subject to subpoena. I kept such data in a separate database (yes, that was possible, although unwieldy, back then and became much easier when DEVONthink Pro was introduced). There’s another set of data that I’ve always kept separate, relating to past activities in conducting seminars that included federal personnel in the NSA, CIA, Secret Service and DoD. Even the names of some of those persons remains sensitive information.

Let’s talk about your reason 1) for multiple databases. A single database may well be feasible in terms of performance when one starts out. When I first started using DEVONthink, I was using a TiBook with 1 GB RAM and fairly limited storage capacity. As I continued adding content I stretched the capacity of the TiBook to its limits. My next Mac had 2 GB RAM and a larger drive, then 4 GB, then 8 GB and my current MacBook Pro Retina has 16 GB RAM and a 500 GB SSD. But over the years I’ve accumulated more than 250,000 documents in a number of databases, some of which I rarely need to access. Were I to try to manage that collection in a single database, my current Mac would slow to a crawl, and I suspect it would frequently run out of free RAM even with a capacity of 32 GB RAM. For that matter, if I stored all those databases on my computer’s internal drive, its capacity would be strained. Many of them are kept on an external drive, available when needed.

Designing multiple databases allows for growth of document collections in manageable ‘chunks’ that can fit into the capacity of one’s computer.

Now let’s talk about your comment that topics within a single database can be separated by group. Yes, that’s true. But it can result in less effectiveness of the AI assistants that I often use, especially See Also and Classify.

My main database is the one I most often use for research and writing. It reflects my professional interests of long standing in environmental sciences and technologies, environmental case studies, policy issues and environmental law and regulation. It contains about 30,000 documents covering a number of scientific and engineering topics, case studies, policy papers and reports and environmental law and regulations (principally U.S. and EU). The total word count of this database is more than 40,000,000 words (about the size of the Encyclopedia Britannica). I’ve been lovingly pruning and updating this collection for many years.

A separate but related database contains methodological topics such as environmental sampling procedures, sample analysis procedures, environmental data analysis procedures (statistical and quality assurance procedures), risk assessment methodologies and cost-benefit methodologies.

Why did I separate those materials? One reason is that combining these two large databases would result in performance slowdowns. But that’s not the really important reason. If, for example, I’m researching a topic such as mercury contamination in edible fish, I want to look at toxicology, case studies of human health impacts and existing or proposed regulatory limits. When I do a search or invoke See Also in that database, I don’t want to be distracted by numerous references to sampling procedures, sample preparation, analytical procedures &c. Likewise, when I’m researching issues related to sample preparation for analysis of mercury in fish and choice among analytical procedures in the other database, I wouldn’t want to be distracted by essentially irrelevant issues including toxicology of mercury &c.

By separating the topical coverage of these databases, I’ve increased the effectiveness of DEVONthink’s tools for helping me work with the information content of each of them. I’ve got a number of other databases that meet other needs and interests. I don’t capture every file on my computers into a database, however, as many of them are of little or no importance for adding value to my databases.

DEVONthink provides two very distinct methods for capture of documents. Import copies the documents into a database. Index creates Paths to the documents, but they actual files remain external to a database. Both are valid approaches. My personal preference is for Import, so that my databases are self-contained. When I capture a document from the Finder, I’ll delete the original file, or archive it to an external drive. I don’t use citation manager software. I’m old school, having worked with publication of massive bibliographies and with citations in published papers in the days before computers were available to individuals. I make sure that the information necessary for footnotes or citations of every document I collect is available, if needed. My databases have always been very stable, and of course I use a multiple backup strategy (Time Machine backups and also Database Archives of the most important databases).

As always, others are free to consider my personal preferences and workflows eccentric, and to choose approaches that work for their own needs.

thelostone · November 2, 2014, 9:52am

Very interesting.

Thank you very much for the detailed reply

jprint714 · November 17, 2014, 4:20pm

I have a similiar query about this…

So far, I’ve created one DB for work, which includes all of my projects and articles (divided into groups and corresponding subgroups), as well as my group and subgroups for my book project. I’m not sure if it would be prudent to create another DB for my next book (just starting) or if I should instead just create a group and subgroups for the next book and save it in the “Work” DB. This book will have material that relates to the old one, but be different.

And then the question is where would it best to have tags… I’ve used DTP for years, but haven’t really used Tags – yet! I’m starting to re-think how to use them for annotation and organizational purposes. When I do set them up properly, should they stay in one unified location (where, as time goes by, the number of tags would grow by large numbers)? Or is it best to have them in a separate DB where a newly created tagging system would be neatly self-contained in one location (though would also be divorced from other, overlapping areas)?

Also, I’m basically using the default organizational set up that DTP produces whenever one creates a new database. But…I’m not sure if that’s the best system, and if there’s a better way that would suit my organizational and research needs – nor am I sure how best to determine what would best suit me.

I welcome any ideas / suggestions on these matters… Thank you!

korm · November 17, 2014, 6:04pm

Apart from technical considerations about the machine you are using, my rule of thumb about creating a database is that it should be the “largest self-contained repository for related information”. So, for example, I create different databases for work I do for different offices within the same client business – because each office’s work is self-contained and documents for one office are rarely used for another office. So I will have a different database for offices “A” and “B”. But, for a given office, I don’t create different databases for different projects because the information I collect or create for one project is generally useful or relevant to other projects for that office. So, within the “A” database I’ll have group hierarchies for the different projects I oversee for the “A” office, as well as groups for general information or research about “A”'s issues and concerns.

When I create a new database, I don’t concern myself with structure. Structure in a database emerges over time as I work with the information I gather through research, the documents I create, etc. Periodically I’ll step back and see if the structure in the database needs rearranging and I’ll spend an hour or so pruning and adjusting – but usually I do this only a couple or three times a year. Emergent structure/organization of data is the best practice I think and works best with the DEVONthink UI.

I use replication extensively – but I am aware there is an enormous potential downside with replication: the more I replicate the more I am locked into that database and its content. It is not easy to split a database where replication is widely employed. There is no reason not to replicate – just be aware of the lock-in factor.

If I have documents relevant to more than one database (say, notes or scanned workbooks) I’ll index them into multiple databases and not import them.

Tags are not sharable between databases (except in a particular case regardng the Global Inbox). I don’t recommend copying or moving tags between databases because of some unwanted side effects related to inadvertently copied documents. I’ve written here before about not being a fan of tags – the effort to manage and curate them is an order of magnitude more than their value, IMO. But, if you want to explore their use, start small. If you need the same tag in more than one database just tag the two (or more documents) in those databases individually. You’re going to have to do that anyway, and DEVONthink creates the tags on the fly. I’ve never seen the point of having nested tags.

Bill_DeVille · November 17, 2014, 6:36pm

I’m in agreement with korm on his point that group organization is something that tends to evolve over time in most of my databases, rather then something I spend a lot of time designing a priori, when a database is first created. Once in a while I’ll dive into a reorganization of material when my uses of the content have become apparent in practice. As most of my databases are entirely self-contained (contain no Indexed items), I’m free in those databases to move things around without the complications that might result from Indexed content.

And like korm, I don’t spend much time and effort on tagging, for the same reason. My usual use of tags is during the progress of a research project, where I find them useful for identifying and categorizing references and my notes. When that project is finished, I usually erase the tags that were used for it.

jprint714 · November 18, 2014, 1:40am

Thanks very much for the terrific replies! I initially created a separate DB for the next book project, but think it’s best that I just move those groups/subgroups into my main work DB.

@korm - a small question: I think I understand your process, and assume that the DB’s for your different offices mean that those DB only contain “self-contained repository for related information” – and that there’s no overlap with data from those DB’s that might be linked (or synced) to other office DB’s, correct?

Thanks for the tip on tags… I included because of the new approaches w/ annotation that I’m considering (per my other post today), so that’s very helpful to hear. I’d like to employ a modest approach with them as well, and appreciate the sage words of caution. @Bill_DeVille - It seems painful to hear about trashing all of those tags after a project! Really!? After all of that work creating them??

Quickly… Is it advisable to research other approaches to DB structure? For example, if there’s a better hierarchical approach to nesting groups/subgroups (or smart folders, tags, inboxes, etc.)? Or is that a waste of curatorial time, like overwhelming tag management?

Thanks, guys!

korm · November 18, 2014, 3:43am

Yes

Sure — there’s enough opinion and advice on information structure to fill libraries I don’t think it is possible to decide questions about structure in the abstract – without evaluating the characteristics and relationships in one’s own data. And it is not possible to finish one’s analysis of the data until all the data is available. So, until that point is reached, the structure will be emergent and fluid. Which is a good thing – and a good challenge for DEVONthink.

Bill_DeVille · November 18, 2014, 3:48pm

I’m lazy. I don’t want to do work that isn’t repaid in more efficient and effective use of my information. My attitude towards tags is that it would take a lot of time and effort to do a really comprehensive tagging job on new items as they are entered into a database, and there isn’t enough payback to justify that time and effort. The literature of information science isn’t very kind to tagging, and I agree.

When I’m working on a project I may create a handful of tags that are highly specific to that project, and that take little time and effort to apply, e.g., to selected documents in the database that are relevant in one way or another to the project. There’s a good use of tagging, as far as I’m concerned.

Precisely because those tags are highly relevant to a specific project, in my experience they wouldn’t be useful to the next project I undertake and may even be impediments to that work. That’s why I clear them out of my database after I’ve gotten proper use of them. But for interesting projects, I’ll select the group created for that project, export it to make sure its tags are saved, and import the exported content back to a database that’s archived for that work product.

Consider me a curmudgeon about tags. While they can be useful, I think most people tend to spend too much time and effort on them without a commensurate return.

jprint714 · November 19, 2014, 4:39pm

Thanks @korm and @Bill_DeVille for your replies. In a way, both of you have touched upon some of the issues I’m trying to resolve, that relate to this thread.

Re: Tags, as @Bill_DeVille said,

… I’ve been told that Tags should stay in the first hierarchical area in a DB – as opposed to a group/subgroup for a particular project. Is that correct? It’s seems like the use and location of tags directly affect their utility. That is, if Tags are located in the main hierarchal pane of the DB, I can see how Tags group disparate documents and files located in different groups/subgroups. That can be a useful way to see disparate information – that is, seeing how documents from different projects have similar points of convergence. But I suppose it can also overwhelm the tagging system, and therein reduce its efficacy. If Tags can (or should) be moved to a group/subgroup for a particular project, then maybe it would them a more useful research tool. Hence my question about them vis-a-vis creating a new DB per project, and how to re-jigger my organizational setup.

Re: @korm’s point about other organizational approaches to DB’s, namely,

I understand what you mean, and I suppose I was fishing for alternate models (or templates) that might be useful to emulate. So, my challenge w/ the DTP organization approach I’ve chosen is that I tend to get overwhelmed with the system of nesting groups/subgroups. There might not be a clear solution to this – I might just need to deal with an organizational system built on collecting vast and disparate amounts of information, and placing them into tidy subgroups (for topics that most directly relates to an area). I’ve tried using replicants as a way to make this system a bit less gangly, e.g., replication files that could have more than one location (as opposed to having to quickly located the particular subgroup in which specific info out to be housed. As I said, I was just trying to get a sense of other alternative approaches. I welcome any other ideas or suggestions. Thanks!

korm · November 19, 2014, 5:01pm

So, I happen to do a lot of work with projects and tasks, which lend themselves to a somewhat repetitive folder hierarchy – so that’s one model. When I start a task I know what subfolders to make.

On the other hand, a lot of DEVONthink users have reported that they are very relaxed about structure because tools such as See Also & Classify, Auto Classify, Search, Smart Groups, and tagging, can help put documents into buckets of related content and locate it later. This approach wouldn’t work for projects, I don’t think, but the approach can be great when you’re in the mode of researching and collecting and not ready to sift and refine.

You might want to read Steven Berlin Johnson’s books on emergence, and his essays on DEVONthink. The latter are almost 10 years old, but relevant. I also like what Mark Bernstein has written about emergent structure. He is the maker of Tinderbox, which was created to capitalize on emergence. But you don’t need to use Tinderbox to benefit from his thinking.

jprint714 · November 19, 2014, 8:14pm

Thanks @korm. Really appreciate your feedback, as always.

I’ve actually created a template for creating repetitive folder hierarchy as well. That’s been helpful as far as creating the folders and some of the start-up files. But getting through all of the nested groups, subgroups, to get to particular files and folders is tedious and time consuming. There’s just gotta be a better way.

Anyway, I’ll def. take another look at the See Also & Classify & Auto Classify tools – they might provide some insight into a better solution. I suppose that’s why I’ve been considering Tags as an under-utilized tool that could provide an alternate approach… I keep thinking there might also be a shortcut w/ replication, as far as a quicker / easier way to place the reliant file through my mass of folders/subfolders…(it’s a terrific tool, but that process also seem somewhat tedious and time consuming as well). Ah well…

I’ll also check out the books as well as Tinderbox. You’re not the first person to mention Tinderbox, so you’ve piqued my curiosity! Anyway…I thank you again…

OogieM · December 2, 2014, 5:18pm

I use multiple databases to separate logical groups of data, Android Development, Farm, Personal etc.

Makes it easier to file stuff.

I also hate tags due to difficulty in keeping them current.

JMichaelTX · May 10, 2015, 5:43am

Hi Bill. I’ve been following a lot of your posts, and have found them very helpful.

Could you please expand on and/or provide some references for your comment about information science not being kind to tagging?

I’m coming from Evernote, where many consider tagging to be the primary means of organization, and essential for finding Notes without a lot of false positives. I’ve just starting using DTP and I’m trying to determine the best organization methods for it.

Thanks.

Bill_DeVille · May 11, 2015, 5:07pm

Back in the Stone Age of computing, when there were only mainframe computers that searched tapes containing cryptic items of information, I was project director of the Environmental Systems Applications Center at Indiana University.

The mission was to help disseminate results of federally funded research and development that might be pertinent to environmental issues and problems.

A Quonset hut on the campus held shelves of shoe boxes, each containing abstracts of papers that were filed by unique identifiers. There were some 2,000,000 such abstracts. We also received computer searchable tape reels that held information consisting of keywords tied to an abstract identifier. The concept was that, when we received a search request for information (usually from an industry or governmental agency), our staff would attempt to translate that request into one or more keywords that could be searched for on the tapes. The keyword(s) would then be sent via a punched card to a high priest at the computer center. Later – perhaps days later – the high priest would summon us to receive the computer’s output, which consisted of a strip of paper on which were printed abstract identifiers that resulted from the search.

This strip of paper would be carried over to the Quonset hut, where staff would remove identified abstracts from their shoeboxes, make Xerox copies of them, then refile the abstracts to their shoeboxes. Now the set of copies of abstracts would be sent to the staff member who was assigned to the search request. The abstracts would be reviewed for relevance to the search request. Relevant items would be organized using paperclips and sent back to the Quonset hut where staff would paste them on letter-size sheets and Xerox them as the search output.

Primitive as that now seems, this was state of the art back in the 1960s!

What could go wrong?

First, there was the step of translating the request for information into keywords that could be searched for on the computer tapes. We typically used grad students familiar with the relevant scientific or engineering discipline(s) deemed relevant to the request, to select appropriate keywords. First problem: two different persons, equally familiar with the disciplines, are highly likely to come up with different sets of keywords to use for a search. Even the same person, presented the request at different times, is highly likely to vary the set of keywords to be searched. Of course, variations in the keywords to be searched are likely to result in variations of the search results.

There were the same problems at the other end. I visited the federal agencies that provided the abstracts and coded them for searching by keywords. They also found that different persons would vary in keyword assignments to the same abstract, and that the same person, asked to translate the information content of the same abstract at different times, often produced different sets of keywords. No amount of training could make that issue go away, nor was it (as a practical matter) possible to mitigate the problem by creation of glossaries of standardized keywords.

So one major problem is consistency. This is in part a logical issue, as a rich language can provide multiple ways to identify the same concept. It is also a behavioral issue, resulting from human choices or attitudes that may vary among individuals, as well as in the same individual over time. It is a significant problem in assignment of keywords or tags.

The other major problem is comprehensiveness. How many potentially important elements of information might there be in any document, and how might that number be affected by the context in which the document is evaluated, and how much variation among individuals (and by the same individual over time) might exist? From my perspective, this is the most important reason not to do a priori keywording or tagging of all new items as they are entered into a database. The level of effort to do a good job, for example including variations in the context within something might be more or less important, just isn’t worth the effort. Which is to say, the return on investment of time and effort becomes unsatisfactory. By the same token, doing a priori keywording or tagging in a very limited way is almost certain to miss opportunities to use information in other ways, in other contexts than the one that was in mind when the tag was assigned. I don’t think this adds much value.

Fortunately, we no longer suffer from the many deficiencies of operation that were implicit in that old computer information center. Nowadays, we can do full text searches of documents. We don’t have to wait for days to see results; we have true interactivity, so we can try variants of searches very quickly. In DEVONthink, we have AI assistants to help us organize content, and to suggest items that are contextually similar to one being viewed.

I use group organization in most of my databases. I rarely bother to design deeply nested groups. I agree with korm that organization is best addressed as an emergent feature of database use. I spend a bit of time now and then reorganizing a database in cases where I’ve found that makes it more useful, and that’s likely to be a much better return of my time and effort than had it tried to do the design a priori, before getting experience with that database and how I use and evolve its content.

Don’t become obsessive/compulsive about database organization, tagging or keywording. You will save a lot of time and effort is you limit those activities to the minimum that turns out to be rewarding in use of your information.

My financial database is tightly organized by categories and year. That makes it easy to file new content such as receipts, and easy to work with at tax time. My research databases, in which I spend a lot more time, tend to be more loosely organized. I rarely use tags, and then at the level of working on a project, such as for identification of useful references and notes. When the project is finished, those tags are removed, as they would be of little or no use for the next project.

JMichaelTX · May 11, 2015, 7:25pm

Bill,

Thanks for the history lesson on searching for digital information.

I’m not anywhere close to “obsessive/compulsive” as you suggested. I have found some tagging to be useful in other systems (Mac OS X, Evernote, Outlook, GMail, etc), and found your statement “The literature of information science isn’t very kind to tagging, and I agree.” to be surprising. From my experience, information systems seem to be moving away from, or at least adding to, the hierarchical directory/folder/group organizational approach, and adding tags to that. The Mac OS X made some major additions to support tagging in Mavericks and Yosemite. For the record, I have found both very useful.

So, I was just wondering if you have any references, studies, etc. that support your statement?

Thanks.

Selected snips from:

Bill_DeVille:

JMichaelTX:

Could you please expand on and/or provide some references for your comment about information science not being kind to tagging?

I’m coming from Evernote, where many consider tagging to be the primary means of organization, and essential for finding Notes without a lot of false positives. I’ve just starting using DTP and I’m trying to determine the best organization methods for it.

Back in the Stone Age of computing, when there were only mainframe computers that searched tapes containing cryptic items of information, I was project director of the Environmental Systems Applications Center at Indiana University.

The mission was to help disseminate results of federally funded research and development that might be pertinent to environmental issues and problems.
Don’t become obsessive/compulsive about database organization, tagging or keywording. . .
. . .
My financial database is tightly organized by categories and year. That makes it easy to file new content such as receipts, and easy to work with at tax time. My research databases, in which I spend a lot more time, tend to be more loosely organized. I rarely use tags, . . .

OhioSteve · December 23, 2020, 3:25pm

I realize that this is a very old post, but as a beginner I found it to be a blessing. I’ve been using Evernote (EN) for 6 years, but due to recent degradation of performance, I’m migrating my piddly 4,000 notes to Devonthink 3.6.1 Pro. I know there are plenty of EN users with over 20,000 notes and haven’t heard much about limits but there are some. For the Premium version, documentation says the limits are 100,000 notes, 200MB file size for individual notes, 1000 Notebooks and 100,000 tags. That’s a lot of storage. For me, that’s never been the problem. My problem has been getting specific information out of the database. Evernote has only one database and searches are poorly filtered without a thorough and consistent tagging system being applied by the user. I came in with 35 years of experience with hierarchical storage methodology; physical file cabinets, DOS, Windows. EN doesn’t promote this structure, rather focuses on topical and relational tagging. Unfortunately I learned the critical importance of tags the hard way after having over 2000 notes in my database. I spent over an hour trying to find software registration data in an untagged note. A part of my search text was in the note title, but for some reason the note was not presented in the search results. My point: learn how the application handles input and output of your data, lay down the information storage foundation properly and then build upon that.

BLUEFROG · December 23, 2020, 3:29pm

Welcome @OhioSteve

Thanks for sharing your experience and welcome to the club!