The problem of the auto(suggest)-tagging

ngan · March 12, 2020, 9:27am

The topic is more relevant to those who are investigating how to use tags to classify and to retrieve knowledge from long (academic) literature with multiple attributes (field, topic, context, subject, theory, methodology, design, evidence, author, etc.) and short research notes. Moreover, this post is about “is auto(suggested) tagging feasible?” and not about “is it effective to use tags for info-categorisation?”. The answer to the latter is likely contingent on the user’s preference, habits, the purpose and design of the tags, and workflow.

The context. I have been working on-and-off on a script that will suggest a list of possible tags from the existing tag-groups for categorising literature (pdf) and research notes (rich text or markdown). I am not a seasoned programmer nor have any knowledge of AI. Therefore I focus on identifying brute-force matching methods and utilising DT’s internal functions of concordance and see-also.

If someone is reading this post, I look forward to hearing alternative methods and different methodology, within the capacity of a single user, in tackling the problem of auto-tagging.

My findings
First, it is not feasible to precisely auto-tag long and content-complex materials. However, a cocktail of brute-matching methods can create a useful list of tags suggestion. Such shortlist can help the user to narrow down the choices from existing tags and reduce the creation of redundant tags during the categorisation process. Second, shorter cited text and research notes are typically more homogeneous and with fewer attributes. A cocktail of multiple brute-matching methods, together with good practice in file-naming, can produce a rather precise suggestion of tags.

An important implication of the above two findings, as far as knowledge retrieval is concerned, is to consider extracting relevant and homogeneous clusters of short text from a long article and write research notes on each of those clusters before tagging [1]. Finally, the extent of the effectiveness of the auto(suggest)-tagging is dependent on the design of the schemata of the matching words, the fine-tuning of parameters in the brute-matching methods, and the user training on DT’s AI.

The methods
There are three elements in my design of matching process: the matching content (MC) that represents the essence of a document, the matcher (MM) that is used to compare with the matching content, and the matching operator (MO) for the matching condition. In sum, a matching process is to find MM within MC under the mechanism of MO.

I eliminate two designs of MM and MO right at the beginning. I reject merely relying on the tag’s name as the MM. For MO, it’s more complicated. AFAIMA, there are two possible methodological approaches. A typical methodology will likely to start with building a database of frequency distribution and associations between words in the pool of documents, followed by an algorithm (or manual process in my case) to assign or create a list of possible tags to each of those similar combinations of high-occurrence and/or highly unique keywords. I reject this approach because it is not feasible for a single user to prepare for such an infrastructure (the database of keywords and their association). I go for a naive approach. I create a schema of phrases and keywords for each of my more commonly used tags. Those schemas become the individual MM of each tag. In this way, I can improve the MM on the more essential tags and to more tags on an on-going basis.

My methods:
For MC: There are three ways of extracting the essence of a document. (1) the plain-text content, (2) the concordance, and (3) the other documents that are “reasonably” similar to the focal document. The first consideration is that the three MC will require a different design of MM and have different optimum MO and may serve different ontology of taggings. The second consideration is about which elements of design are critical for each type of MC. For (1) the crucial aspect is about the proportion of the plain text is needed; for (2); how concordance should be ranked and which MO to use; for (3) what MC can be extracted from or proxied by those “similar” documents.

The discussion of MM and MO to be continued when I find some more time to write. In sum, the operationalisation of my methodology is rather primitive and straight forward - just a lot of scripting. But it seems that the results serve the purpose of targeted knowledge retrieval from my research notes, but less so on the original literature.

[Note 1] There is an idea for my longer-term project. To tag by parts and utilising the info to tag the whole in a later stage. Obviously, the value of this process is not about immediate categorisation but to achieve effective knowledge retrieval in the long-term.