Classify vs. See Also

transdef · November 7, 2014, 6:29pm

I carefully studied Bill’s 2012 missive “Tips on Classify & See Also” to see if I could figure out why Classify doesn’t work for me. I even broke up a folder so that it contained only subfolders, in an effort to see if that would help Classify. That made no difference.

Here’s the confusion: See Also is able to consistently figure out what group similar files have been placed in. DTPO clearly operates Classify differently than it does See Also. I’d appreciate help in understanding how context recognition in Classify is different than in See Also, so I can try to accommodate its needs.

I’ve been accumulating documents for years now in DT, and am finally forced to solve this Classify problem. I’m really looking forward to harnessing the power of DT to get my workspace unclogged.

Bill_DeVille · November 7, 2014, 9:40pm

As you noted, See Also is able to “figure out” the group locations of documents it suggests. But that’s only because the group location is metadata associated with the suggested documents. That’s what philosophers would call an accidental association. See Also is looking at documents, not groups.

Classify can work very well, or not. Its effectiveness depends on the contextual relationship similarity of the documents contained in each group. That, in turn, depends on the user’s design of groups.

Suppose I’ve got a collection of documents that cover three scientific topics: genetics, astronomy and quantum physics. How might I file them?

Scenario 1: I create a group for each topic and file the documents according to their topic. In this case, it’s highly probable that if I add a new paper about quantum physics and ask Classify to recommend a location for it, the top-ranked location will be the quantum physics group. Great! Classify is wonderful.

Scenario 2: I create groups by year of publication of the papers and file them accordingly. I add a new document about quantum physics to the database, select it and ask Classify where it should go. Unless, by sheer accident, there was a preponderance of papers about quantum physics in the group holding papers published in a certain year, Classify wouldn’t be able to suggest any group with high ranking as the location for the new paper. Bah! Classify is a bummer.

That’s pretty much Classify in a nutshell. If groups are designed by topic, it’s likely that the documents filed in each group have similar contextual relationships, because they use terms and patterns of terms distinctive for that topic, so that Classify can assign distinctive contextual relationship patterns to the groups themselves. Each group has high topical coherence in ways that distinguish it from all the other groups. Classify is wonderful. But if the documents contained in each group have low contextual relationship similarity, the groups will have low topical coherence. Classify is a bummer.

transdef · December 16, 2014, 4:30pm

Bill, I see from your reply that I did not write precisely enough. Let me try again:

My experience is that See Also is very good at finding similar documents to the ones I am using as test subjects. The similar documents found had been previously manually placed by me in one specific topic grouping (your Scenario 1, below). I conclude from this that the See Also algorithm is very capable of recognizing the contextual relationships in the topic groups I’ve created.

[Please note that this has nothing to do with metadata or accidental associations. See Also is doing what it was designed to do.]

Classify works differently. Many times, the group that See Also turns up most frequently is shown as the top group. However, even if the only other documents found by See Also are documents still in the Inbox, Classify will still not give that frequent group a green light, as its score is too low. It consistently shows a large number of other groups, often none of which were associated with documents selected by See Also. [These other groups seem to be held over from previous uses of Classify/See Also.]

I’m obviously missing something important here about using Classify. What I’ve been asking–and I need to ask again–is “Why does Classify act differently than See Also?” Is it possible the scoring process is being distorted by prior uses of Classify/See Also?

I’m drowning in data, and need your help to be able to use Classify.

Bill_DeVille · December 16, 2014, 6:41pm

Both Classify and See Also make an analysis of the contextual relationships in a selected document.

In the case of See Also, the contextual relationships in the selected document are compared to the contextual relationships of each and every other document in the database.

In the case of Classify, the contextual relationships in the selected document are compared to the ‘pattern’ of contextual relationships in each and every group in the database.

“Prior use” has nothing to do with the suggestions made.

Although both See Also and Classify present a ranking of suggestions, other suggestions are presented in addition to highly ranked suggestions. I sometimes find a lower-ranked suggestion the most useful one. That’s why the algorithms don’t limit suggestions to just one. Choices are presented to the user.

I gave two examples of how documents might be filed by group in a database.

In the first, where the contents of each document within a group were filed by a topic such as Quantum Physics, it’s highly likely that the existing contents of that topical group will have a pattern of contextual relationships of the terms used in that discipline (vocabulary, frequencies of use of terms and associations among the terms present in the content) such that a newly added document about that topic will be pointed to the Quantum Physics group by Classify. Obviously, if one files documents about other topics into the Quantum Physics group, the group’s pattern of contextual relationships will be reduced in coherency, but if all the documents are topically related, the group’s pattern of contextual relationships will have high coherency, so that Classify works well.

In the second example, I gave an example of non-topical groups such as groups based on client names, rather than on the topical relationships of the contents of the group. In that organizational structure, the group named Mary Smith might contain a wide variety of topics all of which are related only by the fact that they pertain to Mary Smith. Classify will look only at the content of the documents in the group, and won’t find a highly coherent pattern of contextual relationships in that group. There’s nothing “wrong” with this approach to classification. My financial database uses a similar approach. I don’t need Classify to help me file items into such groups, as I know that, for example, any document related to Mary Smith belongs in the Mary Smith group.

Sometimes, when I’m using Classify I may decide to file a document into more than one suggested group. If I select more than a single group the document will be replicated into the selected groups. Again, there’s nothing “wrong” with this approach.

Once in a while I may find that a particular document becomes a “magnet” that makes Classify suggest a location that I don’t consider useful. For example, I once included within a database the PDF user manual for a new car that I had bought. Suddenly, Classify began suggesting the group containing the car manual for a wide range of new documents. Why? That user manual was large and dealt with a great many topics. The solution was to open the Info panel of that document and check the option to exclude it from classification.

Most of my databases are organized by topic, and I find Classify a useful assistant. Some databases are not organized by topic but by, e.g., financial transactions organized by type and date. Classify wouldn’t be useful–but I don’t need it anyway, as I already know where to file a new document in that organizational structure.

Starting out in a new database, if you are using topical groups you must first populate them with related content. As your database grows, Classify will become increasingly useful in suggesting locations for documents related to your topical groups.

transdef · December 16, 2014, 9:16pm

Bill, I’ve asked you very specific questions and you’ve replied with what seem to be canned responses about how you use DT, which aren’t all that relevant to what I asked.

I can tell from the specificity of the See Also suggestions that DT can see the pattern of relationships within at least some of my topical groups. All the suggestions are either from the appropriate group, or from the Inbox. I think it is working just fine.

For reasons you haven’t yet explained, Classify blurs out the suggestions amongst a large number of groups, many of which have no obvious relationship to the document at hand. I’ve never gotten a green light. What’s going on here?

Bill_DeVille · December 16, 2014, 11:43pm

Do you expect Classify to make only a single suggested group location? Why should it? Classify usually suggests several possible locations, based on analysis of the similarity of the contextual relationships among each group’s contents as potential locations for a new document.

DEVONthink doesn’t think like a human, or have the benefit of a human’s education and experience concerning a topic such as Ecology. It is working with the occurrences and frequencies of use of text strings in documents, using algorithms that rate possible similarities among documents (See Also) or group contents (Classify). On the other hand, humans are not wired to “think” like DEVONthink, which can compare a document to tens of thousands of others in the space of a second, or possible filing locations of a new document very rapidly.

In a real sense, organization of documents is for the convenience of the human user, reflecting thought or experience when groups are created and initially seeded with content. DEVONthink doesn’t need database organization to conduct searches, and See Also still works pretty well if there’s no organization at all in a large collection of documents.

Even in a discipline such as Ecology (supposing that’s a group category) there can be important subtopics which may have different contextual relationship patterns, e.g., a subgroup called Diversity, a subgroup called Preditor-Prey Relationships, a subgroup called Invasive Species, etc. I often find it useful to reflect such subcategories of documents in my collection, and also find that Classify can “learn” them quite well if my initial seeding of them has been done well and I continue to make informed decisions about filing as my collection grows over time.

What would happen were I to always accept the top-ranked location suggested by Classify? Especially in the early days of a growing document collection, the groups would probably tend to diverge from my initial views of their purposes. In that case, they would become more satisfactory to Classify, but less satisfactory to me. Instead, I use Classify to provide me with suggestions but supervise it. In my main database, which has been developing for more than 12 years, most of Classify’s suggestions are very good ones. Almost always, I click on one of the top two or three suggested groups to file a new item. Once in a while I decide that none of the suggestions are appropriate, indicating the need to create a new group. That saves me time and effort, compared to manual filing into a database with hundreds of groups.

chrbyr · January 3, 2015, 1:24pm

I was interested to read this exchange because my experience which is short but intense is based on setting up a database from scratch which now contains c9000 items in c6500 client folders. I’ve been setting up a database comprising dates of transactions with clients names e.g. 14-12-25 St Claus.

Initially I imported loads of folders I had created in a virtual hard disk on my Mac containing documents and that was great - reciprocal groups were created housing my documents with the titles of the groups as previously in the form set out above

I then imported loads of emails from Mail in tranches from the last 4 years highlighting them and clicking Auto Classify, thinking that this would do the job. The result was a mess with files being slotted in all over the place. I’ve had to go through groups individually to sort this out and it’s taken me ages

I’m not sure if my question is the same as the original poster but it’s similar. What should I have done differently to avoid this? Is Bill suggesting that I should have dealt with each email individually? There are 6000 of them! I was hoping the AI would do the job

cgrunenberg · January 5, 2015, 2:16pm

Classify does not only use the most similar documents (like See Also does), it uses both an average similarity to all contents in a group and the best similarities. Therefore groups should be topical (e.g. grouping based on time or projects is not ideal) and shouldn’t contain too many documents, otherwise it might be a good idea to add subgroups.

Bill_DeVille · January 5, 2015, 3:50pm

Organization by metadata such as date and client name rather than by topical Content of documents will make the Classify AI much less useful, to the point where I wouldn’t try to use it.

In the group holding documents for St Claus, there will likely be a wide range of topics, based on analysis of the textual Content (body text) of the documents. It might contain correspondence about various topics, invoices, payment records, contracts for various projects or whatever. As a result, the Classify algorithm will not be able to discern distinctive patterns of contextual relationships that distinguish the contents of each group, compared to the other groups in the database. If you invoke Classify, it will probably make suggestions, but you will usually find them not in accordance with your intended filing location. remember that Classify doesn’t look at the Names of documents, only their Content.

If you could ask the Classify AI about your organizational system by client name, it would respond that the contents of your groups appear arbitrary, as it cannot see coherent contextual relationship patterns.

The upside is that, as you are filing documents by their Name (including client name) "and the groups are based on client name organization, you already know where each document belongs. I would use the Groups & Tags panel to drop one of more of the new documents with Name includes “St Claus” into the St Claus group. (To file a big batch of documents, I would do a Name search for each client name I see, rather than dealing with each document singly, and drag all the search results onto the corresponding client name group in the Groups & Tags panel.)

That’s the way I handle filing into my financial database. Each new receipt or banking or investment transaction is filed by type and date. I know already precisely where each document should be located, and don’t use Classify.

But Classify works wonderfully in my research database that contains tens of thousands of documents organized into hundreds of groups by topic/subtopic, based on the text contents of the documents.

chrbyr · January 6, 2015, 9:46am

Thanks, Bill, it’s helpful to know that.

While my expectations may have been misplaced apart from that I’ve been on the right track. In fact once I’ve transferred a file to the relevant group by dragging and dropping, when I use the top hat on the list subsequently for a file in the same group, more often than not it offers the correct group and I can transfer it with a double click so it’s been useful for that.

It was a bit gruesome going through hundreds of groups when I first discovered the situation but I’m wiser now!

Happy New Year!

transdef · January 11, 2015, 5:29am

Is an optimal size for a group? I’d like guidelines on what will make Classify work well.

Bill_DeVille · January 11, 2015, 4:59pm

In my experience the issue isn’t so much that of an optimal size of s group, but of the point that very often I can benefit (as does the utility of Classify) by recognizing that the existence of a large number of items in a group can be a promising motherlode of data for creating subtopical groups.

Example: In my research database dealing with environmental issues, I have 3,899 documents that contain the term “coal”. Indeed, I do have a group that contains documents about coal, although not all the documents that include that term are located inside that group.

Coal remains a major source of heat energy, most prominently today for the generation of electricity. Mining of coal affects the environment and can involve health effects. Burning coal in a power plant results in the emission of a variety of air pollutants e.g., mercury, resulting in health effects and technical and regulatory approaches to control them. More recently, there has been increasing interest in carbon dioxide releases from coal and regulatory approaches to reduce emissions. The solid waste residues from coal burning also present environmental issues, such as disposal of coal ash.

Not surprisingly. in addition to documents focussed on scientific and technical issues related to coal and its uses, my database contains many that deal with policy issues and legislation and–inevitably–politics, economics, risk and cost/benefit assessments.

For that reason, my Coal group has a number of subgroups to help me conceptualize and organize my document collection about it. And many of the documents included in that Coal group are replicated in other groups as well, such as under the topic of Climate Change, another under the topic of Energy (of course, those “big topic” groups have subgroups as well), &c.

Most of my groups contain do not contain subgroups. But I find it useful to use hierarchical organization for “big” topics that clearly lend themselves to subclassifications. When I’ve done that well and consistently, Classify helps me file new content very well.

Ideally, a group that contains subgroups should not itself contain documents. Sometimes I’m messy and break that rule. But when I’ve got time I try to clear out such cases, often finding it necessary to create new subgroups to classify items that didn’t fit in the existing structure.

mrkwr · January 12, 2015, 5:54pm

This may be off-topic, but it strikes me that Classify and See Also should both take more account of any tagging information as well as the text itself. This could analyse the whole tag network, not just look at documents with the same or related (parent/sibling/child) tags.