Best Database Practices to auto-classify “everything”

Dear DTPO Power Users,

I find the thought of dropping a file into a database that automatically classifies it topically for me rather powerful.

I have read and re-read a number of well written and thought-out posts in this forum, but as a Newbie who needs a much better system to help my disorganized tendencies, I need some clarification as I wrap my little head around such a powerful application.

I’d like to see if I’m on the right track; I’m considering using DT to capture and organize lots of information - i.e. “almost everything” in my Documents folder or that would be heading there (pertinent emails, web pages, screenshots, ideas- text or audio, images, docs…).

I understand DT’s AI auto-classification will not magically classify everything (yet :wink: !), but from what I gather– before I go setting up databases and dumping my current and future Mac into DT!), there seems to be some better ways to assist the AI.

So, here it goes…
The classification AI (and Show Also) is based on contextual relationships found in the topics of documents, not their file name or the name of any Groups (folders) they are under hierarchically or their tags (or other metadata if they have any). That is, the AI is concerned with the “what” in the documents and items, not who created or edited them how, when, why or where…

What I think this means for “best practices”:
1. The nomenclature of documents (files) and Groups can be based on my own personal needs for a database and is not related to the topical taxonomy of the AI.
2. The topical structuring of folders (Groups) and subfolders, however, is critical and generally the deeper the hierarchies the better, keeping documents out of parent Groups.
3. It’s best to keep databases organized and separated at as high an unrelated topical level as it fits the purposes of the database.
4. This or that…, which a newbie won’t know to ask.

Good so far?!

What I’m not sure this means for “best practices” (assuming I’m “basically good so far” - lol):
1. If Groups essentially function as Tags and I’ll have lots and lots of Groups and sub-groups in order to assist the AI, why would I want to use Group (folder) Tags? I would end up with– at least it seems to me, a visual plethora of “tag mess”, which one reason I want to keep tags to a minimum and maybe even drop them entirely; It’s too easy for me to end up just tagging the crap out of everything into a meaningless oblivion. (I can see using “ordinary tags” as it fits the purpose and needs of a database, particularly for quickly capturing and labeling an item in order to later remember why I grabbed it in the first place when I organize and classify it, but also to help with future searches and smart folders.)

2. Yet, in order to help me topically organize my thoughts and the database, I could see making some some (but not too many!!!) of the higher parent folders Group Tags. This might also facilitate in quickly tagging captured items to later organize and classify them or find in searches as their child groups would inherit that Group Tag. Two problems, however, arise (I think):  

 A. First, I see no way to visually distinguish between Group Tags and Ordinary Tags (unless I create only two main parent tag groups, one for each, and that’s not too helpful).  

 B. Second, it seems this could make the database a bit “unruly” by mixing non-Tagged Groups with Tagged Groups. (I think I “hung up” a test database when changing the “exclude Groups from Tagging” Preference and changing Groups back and forth between tagged and non-tagged too often?).   

 Are those two problems really problems? If so, how big and are they easily solved?

3. In the place of what could be a “tagging into oblivion” tendency, which is somewhat useful if not overdone in other applications so that a document can be located with a meaningful label rather than in it’s location in the file/directory structure, it seems that perhaps a somewhat generous, but reasoned use of replicants might be helpful for items that have higher-level parents that are either possibly or less-strictly related topically, or not clearly relatable to the AI? (I need to avoid organizing and sorting and cleaning lots of tags, the file structure is more than enough for me already! lol)

4. This and that, which a newbie would miss.

Thanks for any help and advice. :slight_smile:

There’s a lot going on with your post, more than I can respond to (and more than I have answers to).

I’d say your feelings about see also/classify are pretty good - the relationship between documents within groups is the most important factor in how the AI works. Deep hierarchies are a relative thing – the documentation doesn’t necessarily forbid having documents at the root of a group, it suggests you should avoid having groups and documents at the same level. For example

[database]
  [group a]
  [group b]
      [group b1]
      [group b2]
  [group c]
  [group d]
    [group d1]
    [document a]
    [document b]

I see no issues leaving [group a] or [group c] as a single level if their contents are sufficiently interrelated. Breaking up a relatively homogenous group can lead to unnecessary and illogical fragmentation (based on some contrived criteria derived simply for the purposes of breaking up the group into subgroups). This can through off the AI more than it helps, since it doesn’t accommodate arbitrary distinctions between groups of documents!

Looking at [group d], this situation doesn’t make sense. If you have a sub-group, there should be no documents in the root [group d] because as you note, the documentation says not to do this, it is confusing for the AI. Either Documents A and B should be in their own sub-groups, or should be in [group d1]. However, if the latter case, if [document a] and [document b] also being in [group d1] anything in [group d1] could just be in the root and there’s no need for the subgroup. Basically to avoid confusing the AI, don’t mix groups and documents at the same level.

As per the documentation for [group b], creating two subgroups is wise, assuming there are two reasonably logical and coherent conceptual subgroups. What the documentation warns against is filing things directly under [group b] when [group b] contains subgroups.

So there’s nothing wrong with having a single-level hierarchy if that’s what works. In the example [group d] makes a sub-group for the sake of it and to no benefit. Meanwhile, [group b] works well, just as long as you don’t file any documents directly in [group b], everything MUST be in a subgroup of [group b].

Tags are a difficult subject. There are lots of opinions about tagging. It’s their very flexibility that makes tags such a hotly contested topic.
You are familiar enough with how tags differ from groups taxonomically, obviously, so it’s really just a matter of how you want to use them, if at all. This is really a personal and creative, and ultimately arbitrary part of this whole situation. What I’ve found most useful is: be flexible and open-minded, and read lots of different approaches to organizing (and tagging, specifically). No single person’s approach will be directly transplantable into your case, but you can likely take bits and pieces to create something that is your own.

To organize my academic literature and general work database (which is organized very differently from, for example, my personal files database), I used to use a mix of documents organized by topic-groups. I’d then use tags to associate documents that might cover some similar concepts but be about different topics. (e.g., I have a group called “food studies” and a group called “waste politics”, and since there is food studies literature on risk, and waste politics literature on risk, I might use a “risk” tag to bring those documents together under that tag despite being under different substantive topic groups).

This worked reasonably well because it basically gave me two ways of accessing or surfacing documents based on a number of concepts or areas of study that are, ultimately, non-mutually-exclusive.

Then I ended up with lots of tags, and some of those tags overlapped with groups out of carelessness or out of some kind of taxonomical messiness.

In addition, the other downside is that with over 1500 documents, I had not universally or consistently applied this technique. If I went to the “risk” tag, there is a strong likelihood that there were other documents about “risk” that weren’t tagged as such, so it was unreliable. It was also the case that what constituted a paper “about risk” is subjective. How much of the paper needs to attend to that concept to justify tagging it as such? Does the mere mention of “risk”, or the citing of a major scholar in the area justify it or not? taken together, these inconsistencies meant that tags were unreliable. They weren’t exhaustive so I’d need to resort to searching or the “see also” pane anyway, somewhat defeating the purpose of the tag. Moreover, since the criteria for tagging is arbitrary, I likely end up with “false positives” – those documents that really aren’t useful in the end because they aren’t sufficiently related.

If I put all this effort into tagging, but ended up using search and See Also anyhow, why bother tagging in this way?

So my new system, which is still being “developed” (that’s a bit of an overblown term but I don’t know what else to use), is as follows:

  • Maintain my (still unsatisfactory) topic-groups and rely mostly on search and See Also, foregoing tags altogether for topic/subject-related retrieval. In instances where there is a VERY strong link, replicate documents as necessary (so perhaps I have a Food Studies document and a Waste Politics document, each in their own respective groups, but also replicated to a “risk” group, and so on). This would only be in circumstances where there is a very clear need to do this, and does not create a second level of abstraction (of “tags” as somehow transcendental).

  • Use tags as a way of quickly access documents based on their purpose. Documents that might be good readings for a course get tagged with that course code. Documents that pertain to a specific manuscript get tagged with that manuscript’s name. A general list of documents that I should read get added to a “ReadingList” tag that I can add to and remove from as I complete reading them.

So tags are no longer a way of dealing with the contents of the files, they are a way of organizing documents by their purpose. “These are documents I need to read; these are documents that would make good readings for ENSC 321; these are documents that pertain to this manuscript; etc.”

I’ve gone from over 100 tags to about 12 tags (the exact number varying a bit depending on the number of concurrent projects), and each tag’s purpose is very clear and obvious.

The advantage to this spills over to something like DTTG (DEVONthink To Go) on iOS. It takes about 3 taps to open a document in my ReadingList tag (either in DTTG or in the document provider), compared to the many many more it would take to find that document in a group or scrolling through a long list of 100+ tags!

That said, I’m not sure if I’m still 100% satisfied with this. I’m going to roll with it a bit longer and see how it matures. I am debating, however, swinging the pendulum the other way entirely for my work files database and doing something like this:

  • all documents in groups based on VERY coarse categories (say, 4 or 6, as opposed to the current 70+ groups and sub-groups.
  • Use tags to denote subject matter and topic areas, using symbols to force certain “organizational” tags to the top (e.g., #ReadingList) to keep those easily accessible. Tags are easy to create and apply, so I can just tag as I go along reading a given document. Replicating to groups is a bit more cumbersome to do “on the fly”, and can sometimes cause grief with indexed files.

This would almost just be an inversion of the current scheme.

As for the “Exclude groups from tagging” toggle… I’m not sure how much I grasp the immediate utility of this, I suspect it would be worth experimenting with to see if it opens up something that is helpful for you. Personally, at this point, I’m not sure how it would fit in to my taxonomy, but it’s something I would consider.

So my advice still stands. Read a lot of other people’s ideas, experiment, and be flexible (not too wedded to one way as the “right” way). Don’t have documents and groups at the same level of hierarchy.

Scott,

Thank you for such a thoughtful and helpful reply. I knew my question had some big bites I was chewing on, and then wondered after posting. “Who will answer this?”? lol (The quality and care of this forum in-and-of-itself makes me want to dive into the DT technologies.)

You’ve made some great points that have helped my clarify my questions.

Hierarchies
Your explanation of hierarchies, especially with the visual, for the AI helped a lot. I’m encouraged to know I’m pretty much on the right track here, but see more clearly how to assist and confuse the AI. This was particularly helpful point to avoid:

I need to be “Group by Topic” conscious as I restructure my folders. I don’t want to “subgroup them to death”.

However, it my current understanding that the deeper the hierarchies the better for the AI. (The depth being relative to the needs of the database.) Is that correct?

If I’m correct, then the documents in Sub-Groups are essentially Sub-Topics to the AI.

So, as long as I see a strong conceptual sub-topic within group, I should go for it and create those sub-groups?

Tags
You’re thought here have really helped. :slight_smile:

Tags can be used so creatively and differently that we can each find them amazingly helpful. I do enjoy hearing about how other find ways to use them; however, most of the time, while inspirational as to how I could better tag or differently, the use is fairly case specific.

I understand these type of tags to be “Ordinary” tags in DT, but your “risk” tag really nailed it for me:
Even though it resulted in false-positives, which I agree, would for a research database would lead to “why bother tagging at all”, the power it seems was in referencing concepts that were potentially related:

In terms of tagging and searching in general, but particularly with the AI I realized what I didn’t want to loose:
Non-mutually-exclusive concepts can be highly related topically, that is not an “either/or”, but a “both/and”.

When I see such concepts, I can replicate them, but when I do, I probably need to consider why and possibly create a new group (e.g. “risk”) to replicate them to as well, which would make my database tighter topically (as long as I don’t create a second level of transcendental abstraction - lol - your comment there on “tags” was point on and hilarious!).

However, when I can’t or don’t see such a relationship, setting up the database so that the AI helps me find those relationships when I don’t, seems critical to the purpose of See Also. It also makes the Classify function work better.

To that end, not using tags to reference the contents of the data topically makes a lot of sense to me as a “best practice” for the AI. I need to name my Groups topically, not the tags, and not confuse the two. Does that make sense?

Your use of tags based on the purpose of is a good one! If I can’t find the purpose for capturing and keeping a document, much less classifying it, then why bother - lol. Knowing it’s purpose would tell me not only which database to put it in, but more precisely what to do with it once it’s there, e.g. when, who, where, how…

Excluding Group Tags from Tagging
Although I can visualize Groups that are also tags since they are yellow, because I do not know how to visualize Tags in DT that are Group tags from those that are Ordinary tags I don’t find them helpful.

In addition, the mixing of Groups that are tagged with those that are not, creates an additional hierarchy in the Tags that probably need to be locked so that I don’t loose documents. I’m still looking into this mixing option as a few higher level “coarse” parent topic Groups might be helpful, but I’m not sure about how to make it work and how it effects the database; “Exclude Groups from Tagging” seems to be at a fairly deep level in the structure of the database.

(More to test and learn there.)

My head now spinning

That is where I was going to start with DT before reading more in the forum and the documentation! I could create just a few high level categories and the AI would eventually catch on.

My recent understanding, however, is that I need as many logically topically coherent groups and sub-groups when designing and setting up a DT database in order to assist in training the AI! Is that correct?

However, once a database has been used enough with a fairly deep hierarchy (say 50-100+ groups and subgroups), then perhaps the AI will only need a few higher level coarse categories retained, the others could be removed, and then “the machine will have learned” and we can literally throw documents in it that is will classify as accurately as before, but better yet find even more See Also. A good question?!

Then, I could see taking all the Names of the lower level groups and sub-groups to be discarded and creating them into a system of Ordinary tags, not for the AI, but for me. (My memory is great. It’s just short - especially compared to a computer’s!)

Thanks again Scott!!! :slight_smile:

In terms of the DT AI, which is a key point of interest and purpose for diving into DT for me, and before I take the time and energy to set up databases properly, you’ve certainly bring some clarity. I really appreciate your thoughts and feedback.

See Also Vs. Classify
As I understand it (and I may be wrong):
Classify: compares the selected files with contents of groups - in other words, consistency and strong relationships within groups matters for Classification. “What group has files that are similar to this one?”

See Also: looks for concordance between files regardless of their grouping - In other words, See Also is largely agnostic to how files are grouped. “Which files in this database look like this one?”

Hierarchies and AI:
The depth of a hierarchy is not really related to the ability for the AI to categorize (though I do not know the ins and outs of the DTPO AI… so I may be wrong).
What is important is the strength of the relationship between documents within a given group, and how distinct they are from documents in any other group. Whether this is 1 level, or 10 levels, depends on what is required for your taxonomical needs.

So lets take a straightforward example of PDF documents that contain searchable/selectable text. Lets say they are bills from 3 different companies: A B and C.
All bills from Company A are formatted identically with exactly the same header information.
All bills from Company B are formatted identically with exactly the same header information
All bills from Company C are formatted identically with exactly the same header information

and the contents of any given bill from Company A are totally different from any given bill from Company B or C.

So we have a strong textual relationship between bills from any given company, but very weak relationship between bills from different companies (A bill from Company A is very UNLIKE a bill from Company B, for example).

Now, we might have something like:


Database
   [Bills May 2015] (3 documents)
   [Bills June 2015] (3 documents)
   [Bills July 2015] (3 documents)
   [Bills August 2015] (0 documents)

Lets say we’ve got all our bills sorted out for may-july, and we just got a new batch of bills for August from Companies A…C. We have them in our inbox and we want our AI to take care of them.

The AI is going to have a very hard time knowing where to put those bills. It will probably recommend placing our new bills in the May-July folders, since it says “I see that these folders have other documents that look like these new ones”. But that isn’t what we want! we wan’t it in [August 2015]. But since [August 2015] is empty, the AI doesn’t know what belongs in there. May-July DO have documents it can base its recommendations on, and so that’s what it will spit out.

Not helpful though!

However, lets say we aren’t so concerned about grouping bills by date (we can always include dates in file names, or sort a group by date, etc), but rather, we want to group them by company, we might do something like this:


Database
   [Bills Company A] (X documents)
   [Bills Company B] (Y documents)
   [Bills Company C] (Z documents)

Here, we have three groups, one for each company, and each group contains some number of bills from each company already. We get a new batch of bills from Company A…C sitting in our inbox. If we tell DTPO to auto classify the bill from Company A (or we open the See Also/Classify pane), the odds are VERY STRONG that the top recommendation will be [Bills Company A].

As you can see, both schemes are only 1 level deep, but one works with the AI, and the other completely confounds the AI. Even if we were to do something like:


Database
   [Business]
      [Employee Record]
      [Recruitment Strategy Files]
      [Bills]
        [Utilities]     
          [Bills Company A] (X documents)
          [Bills Company B] (Y documents)
          [Bills Company C] (Z documents)
   [Pleasure]

Now we have our bills from Company A…C 4 levels deep. There would be no change in how well the AI can categories things based on this change to hierarchy alone.
All that matters is how similar documents are within a group and how different groups are from one another, and how consistent this is. (so, if you filed your bills really badly and had bills from Company B in the Company A group and vice versa, the AI is not going to do a good job!)

Tags
Yeah I think you’re highlighting some of the important things to consider. Tags don’t really influence (Auto) Classify or See Also, it would be purely for your manual retrieval. However, tags are a lot easier to create on the fly whereas replicating is a bit more labour intensive. Tradeoffs!

You’re totally correct, under this scheme, Classify would not work well (though See Also would work just fine because that’s based primarily on concordance between text contents of documents and not on groups). So this strategy would be very bad if you want to rely on the AI for sorting new, incoming documents. It would be fine for retrieving files using “See Also”.

Hope this helps!

If you haven’t already, I’d strongly recommend Bill’s post on Tips for See Also & Classify (Tips on Classify & See Also)

which gets into some more of the nitty-gritty (and is definitely correct unlike my own statements which could well be incorrect!)