Suggestion - Automatic Topic classification

Frederiko · February 2, 2015, 11:35am

DT already has the neat ability to auto-classify, auto-group and suggest related items. Auto-classify and auto-group (the difference between the two I havent fully grasped) rely on having a certain number of pre-sorted documents on which the algorithm can base its assumptions.

I wonder if DT couldn’t also have a similar system for automatic topic generation where DT places suggested topics in tags.

There is a very interesting paper on the idea which might be of interest to the DT Devs:

[url]http://amaral-lab.org/publications/high-reproducibility-and-high-accuracy-method-automated-topic-classification//url]

Frederiko

korm · February 2, 2015, 12:41pm

Interesting wish list, though I wonder if it is commercially viable at DEVONthink’s price point. There are certainly a lot of technical challenges to address. Topic modeling assumes that there is a one or more ranges of topics (which might be fuzzy and are usually not discrete) for each corpus. In other words, the challenge for developing topic modeling that is useful to both of us, for example, is that my “personal corpus” will probably map to topics that yours does not. So, which topic sets do the developers choose to use and what is the source?

If you are interested in a concept demo, look at this topic modeling experiment. This demo will work with DEVONthink when applied to an indexed corpus of documents. The OpenCalais semantic web service is another, related, example.

One could use DEVONthink’s existing classification tools by creating a topic model within the database over time. Create groups that represent topics (or topic sets) and add snippets to those topic/groups that are examplars of the topic. For example, make a group for “voucher policy” and add numerous snippets of around 50 words each that are exemplars of statements about that topic. This is not a new technique – many researchers salt their databases with topic groups like this – but it is an example of a do-it-yourself topic model that can emulate the formal models like the LDA model mentioned in the paper @Frederiko cited.

Bill_DeVille · February 2, 2015, 8:17pm

Yes, that is an interesting paper. It represents an approach to classifying by topic large collections of files containing text. The purpose of such classification would be to allow the researcher to winnow down from the collection items related to a topic. In this case, the algorithms would have already tagged files by topic, so the researcher can quickly find potentially useful documents by tag.

Logically, there’s no difference between tags and keywords. Applying such a classification to an item is usually done to make it easier to retrieve by a search, and/or to aggregate or relate items that share the classification.

There are serious logical and methodological problems with applying and using keywords or tags. These can be summarized as problems of consistency, comprehensiveness and context. My own experience as director of a computer information center back in the days when enormous collections of documents could only be searched by keyword turned me into a curmudgeon concerning the return on investment of time and energy in trying to apply such a priori classifications to every document in my databases. I regard tags or keywords as sometimes useful, but I take the time to assign them only for limited purposes, where the investment of my time and energy is likely to be well repaid.

But the cited paper is about assigning tags by the computer, so as to help the researcher avoid reading all those documents and applying topical tags. Great! The human doesn’t have to do the work! It’s likely that an algorithm will be more consistent than would a human in applying tags, but the logical and methodological issues of comprehensiveness and context remain. (Do you really want to see hundreds or thousands of tags per document, depending on nuances of comprehensiveness and context, and that relate not only to the document itself but to others in the database?)

Information science has made a lot of progress in assisting humans to work with the information content of documents. Computers can do some things very quickly, for which the human brain isn’t wired. We have already reached the stage where a human researcher can use a computer synergistically, interacting with the computer’s software to find and analyze information much more easily than the pre-computer past. But true semantic analysis hasn’t yet been achieved, nor is one’s computer trained in disciplines such as chemistry, ecology and so on. In that sense, it remains the human’s responsibility to evaluate information found or suggested by the computer.

DEVONthink and DEVONagent have been suggesting keywords or topics identified by contextual relationships in text content since these applications first appeared. In DEVONthink, open a document to display its content. In the navigation bar immediately above the pane in which the document is displayed there’s a Keyword button. Click it, and a list of suggested keywords is displayed. In DEVONagent Pro’s Digest view of search results a list of topic terms is displayed, and even a graphical display of relationships among search hits by topic terms.

In DEVONthink, Option-click on any single word term in a document. A list of all other documents in the database that contain that term is displayed. (That’s not complicated, of course.)

Indeed, some of the algorithms in DEVONthink and DEVONagent Pro go well beyond simple use of terms contained in a document, such as the keywords listed for that document. See Also, for example, may suggest among a list of similar documents in the database one that may not contain the same keywords, yet is found to be contextually related! Such a suggestion can be unexpected and, if on review I find the suggestion useful, it’s the kind that makes me shout Eureka! – it’s a conceptual relationship I hadn’t thought of. No, See Also isn’t true semantic analysis that can identify a concept regardless of the terms used to express it. That’s a really tough target for information science. But sometimes See Also approaches the results that would be expected from true semantic analysis. Of course, it’s up to the user to evaluate suggestions and recognize the really useful ones.

As time goes on, computer hardware becomes more powerful and software more powerful and sophisticated. Isn’t that wonderful?