Is classification of documents into multiple folders useful?

I have been under the impression that the AI weights documents by how they are classified, and that the more structure in the classification scheme (ie., nestedness) the better. I have dutifully constructed a hierarchy of folders that reflects how I view my particular portion of the universe.

Needless to say, this takes more effort than dumping all the documents into one folder. Lots more. I have 4000+ records and 70 groups.

So I did an experiment. As I initially import my documents into the inbox, then replicate/move each to 1 -3 folders, I was curious how the AI changed its “See Also” output when the PDF sat in the inbox vs. when it had been classified. I ran this experiment for 10 documents. The result?

No difference whatsoever. The same documents showed up in the See Also window, ranked the same way, regardless of the classification scheme.

This suggests to me that

  1. at best, the AI weights the words in the title of the folder as one of the many words in the document,
  2. I’ve been wasting a lot of time classifying.

This is too bad for a second reason. It would be pretty nifty if I could tell the AI to weight by folder classification. And, for that matter, by the title I give the document. This meta data, that I personally input, should matter more.

But in the meantime, it would be useful if folks could clear this up, once and for all: Re the performance of “See Also”, does the title and folder classification really matter?

1 Like

The Classify AI routine weighs existing documents by how they are classified, and recommends classification of new additions based on that analysis, which includes contextual patterns of words used in the new document, compared to contextual patterns of words in the existing groups of documents. Similarly, the Group routine examines word usage and contextual relationships among a group of selected, unclassified documents and creates new groups in which to place similar documents.

But the See Also and See Similar Text routines weigh the contextual patterns of words in the content of documents in order to suggest other documents that may be contextually similar to the one that’s being viewed.

The Name of a document is metadata and may or may not also be in the content of that document. I prefer – for my own edification – to give descriptive names to documents. But other users may use cryptic names related, perhaps, to an organizational scheme. See Also doesn’t “care”.

Classification is another form of metadata about the contents of a database. As in the case of naming documents, I like to organize documents in terms of relationships among them that I understand and am comfortable with. I will look in the contents of a particular group for a document, because that’s where I likely will have placed it. That’s often a convenience to me. But if I do a database wide search, the Search tool will find that document, even if I made a mistake in filing it. Search doesn’t “care”, nor does See Also. But of course if I’m consistent about the way I file things, the Classify AI routine will make better and better suggestions about the location of new content as my database grows.

I’ll confess that I’m often more interested in finding things (Search) or in finding relationships between ideas (helped by See Also) than in very definitively organizing the placement within groups of each and every document in my database. So I’m grateful that Search and See Also don’t punish me about getting sloppy with organization (although the Classify routine weakens in utility, the more sloppy I get).

1 Like

When I use DevonThink, my only goal in classifying documents is to create a “value added” scenario for See Also.

“See Also” is the tool that allows me to find related documents that I forgot about, it is the “second brain”. Based on what Bill says, it seems now that we don’t “train” DT to do See Also at all. Instead it always looks for links within the text of existing documents, but ignores the very personal, and hence singularly powerful, information we use to place each document in the context of our individual research programs.

Here is an example of why this is a particular bug.

In ecology, the word Stoichiometry has been introduced recently to integrate a variety of studies–from biogeochemistry, to nutrition, to population ecology etc–into a coherent new research program. In most of the PDFs I own that are relevant to the field of stoichiometry the word never shows up in the text of the article. It may show up in the title I give the article, and it definitely shows up when that article is replicated to the Stoichiometry folder.

(I am sure any user of DT can think of a similar scenario. )

The notion that this info is lost when I “See Also” a document is disappointing. javascript:emoticon(’:cry:’) And it seems infinitely remediable by adding to preferences “use title and group names in See Also, and weight by X”.

I’m almost afraid to ask this next question. But is folder and title information incorporated into a “Find” search? Does using the “Fuzzy” option increase or decrease this likelihood?

1 Like

As an old chemist and erstwhile philosopher and historian of science I’ve always been interested in how terminology and methodologies spread from one field to another.

Stoichiometry is basically the discovery of, and methodologies associated with, quantitative measurements that have predictive value. The concept was familiar to the ancient Greeks and has been a mainstay in chemistry for a long time, e.g., reaction equilibria.

Political scientists discovered quantitative approaches back in the 1960s, and I once teased a friend about papers that read like “Tom Swift and His Electric Factor Analysis Machine”. (Tom Swift was the hero of a series of children’s novels in the early 20th century; he used science and technology to solve mysteries.)

Quantitative approaches are important in many fields, certainly in ecology. I remember a landmark study of field mouse populations related to nutrient availability that was done on a tract of land over a period of years. There were equilibria between population size of the mice and the quantity of nutrients on the site, but with temporary undershoots or overshoots of population size related to nutrient levels. In general, population size adjusted to nutrient availability, a dynamic relationship. Other studies have focussed on behavioral changes of mice (or their predators) during periods of excessive population density, or in adaptation to inadequate nutrient availability.

You are interested in classifying ecological literature that uses stoichiometric approaches, even though the term “stoichiometry” doesn’t appear in the content of many of the writings.

See Also is probably “smarter” than you think. It’s forte is finding similarities of words and especially the contextual relationships among words in a collection of documents. No, See Also doesn’t look at Names or at the group locations of documents (although classification may help you, the human part of the interactive team, organize your own thought).

Let me give an example. Dogs are canines. So are wolves, foxes and coyotes. Suppose you are viewing an article about dogs, which doesn’t include the term “canine”. You invoke See Also and find that the list includes an article about wolves, even though the term “dog” doesn’t appear in that article about wolves. How did that happen? Somewhere in that database is a “bridge” document that includes the term “canine” as related both to dogs and to wolves. The greater the number of such “bridge” documents, or the greater the frequency with which the relationship is defined even in a single “bridge” document, the more likely See Also is to make such a connection.

Take that as a tip. You are trying to force a connection among documents by grouping them. The connection may be the concept of stoichiometry, but that term doesn’t exist in many of the documents in your collection. One way to enhance the behavior of See Also to make that connection would be to make sure there’s one or more documents in the collection that “bridge” the term “stoichiometry” to other terms or word patterns common to the concept. That bridge document might be a beautifully written overview of the field, or it might be a “nonsense” document that is basically a glossary of related terms, perhaps repeated for emphasis.

I still do organization, at least to some degree, of my database collections for my own benefit. I can’t create and hold in my mind tables of all the tens of millions of words in my database and also the patterns in which those words occur, like See Also. But my database isn’t trained as a chemist, or ecologist, or economist or whatever my interests may be. So I’m responsible for determining the pertinence of documents suggested by See Also. Some of those suggestions may be “dumb” while others are “brilliant” – it’s up to me to make the distinction. This is human/machine interaction, and I often find it very useful.

Sometimes I find it useful to follow a trail of See Also suggestions. Perhaps the first list of suggestions may not give me what I’m looking for; but selection of a document from that list and another invoking of See Also may lead me to discovery of a relationship I hadn’t thought of.

2 Likes

“You are trying to force a connection among documents by grouping them.”

What?

You are presuming that a database is going to have enough information that “See Also” can learn that wolves and dogs are both canines without listening to the user. The user shouldn’t have to have enough documents for See Also to learn from–See Also should be able to learn from everything, and the user can turn off what’s taken into account if they want.

See Also gets it wrong, all the time. Though now that I know what it looks at, the list of things I’m given makes a lot of sense. Unfortunately,as it stands now, it’s pretty must useless.

My databases are not ideal–they are not built for AI. If I only need two papers in the same field then that’s all that will be there. Which means that there often isn’t enough information for your bridge. And while I would and do input additional information in the form of classification, naming, etc, I’m not going to add in more documents so that it can learn.

It seems ridiculous to me that the AI is designed to only look at some information, and that if it’s not enough, the user hasn’t done enough, the user is trying to force connections. I think that users, with non-artificial intelligence, might just know something about their databases and that the AI should take advantage of that. And if the users don’t know anything, if their inputs hinder the AI, hopefully they would know enough to know that and tell the AI to ignore them. :wink:

In all seriousness though–it’s not like this is a package for simpletons–it has a learning curve and users need to be able to either self-teach or read a manual so I don’t think that letting them decide how much input AI accepts would overwhelm them, and it would make this feature work (and work better) for a lot more people.

1 Like

Include the following when calculating See Also:
Document text _ (default)
Document title _
Folder _
Folder hierarchy _

Then assume there is some intelligent weighting, such that the information the user brings to the document is weighted more heavily.

1 Like