Relevant, Automatic Tagging

I searched this forum but could not find a solution to the following request: have DTP create tags based on a document’s most frequent keywords. I don’t know if this can be done, the main problem being the useless prepositions, definite article, etc…

When I use Words panel and click Frequency I can quickly see the most relevant nouns/adjectives used, which helps a lot to have a bird’s eye view on the main concepts developed in a PDF article. If such feature is not possible, how to obtain a tag cloud for a given PDF? These are just some suggestions that are coming to mind.

I wonder if that would be less useful in practice than in theory. Imagining how this would work: if auto-tagging is not 100% “useful” 100% of the time, then the database’s set of tags would become increasingly less “useful”. (Define “useful”? Depends on what you’re looking for.)

So, since no automation at this price level can be 100%, then by definition the database is going to get crufty and increasingly in need of pruning and other maintenance. The suggested feature thus might be short term glitz in exchange for long term drudge.

Looking at the “keywords” listed for several of my documents, my conclusion is that automatic creation of tags based upon them would be confusing and often counterproductive.

Here’s a simple example. In one of the documents a street address is listed, 3600 Florida Boulevard. “Florida” is listed as a keyword. But the document has nothing to do with the state of Florida, and to apply that tag would be confusing. For one thing, keywords are single-term and not multi-term. There’s a big difference between a reference to the state of Florida or a location within that state, and a reference to a street address on a Florida Boulevard that’s not in Florida.

And then there are problems with automatic tag generation resulting from the fact that synonyms, similar words and nicknames exist and are often used. A document about Florida could exist in which the term “Florida” isn’t used, but the phrase “Sunshine state” is used. I would probably recognize the relationship of terms, but DEVONthink won’t.

I’m something of a curmudgeon about tags. When I use them, I want their uses to be very precise. I would be unhappy with a tag cloud that is so cloudy that it resembles a London fog, instead. :slight_smile:

That said, although I usually want my searches to be based on very precise handling of the search criteria, there are times I’ll explore the results of pressing the Similar Words button in the full Search window, which may reveal interesting words in context, or spelling variations. For similar reasons, I may choose the Fuzzy search option, which can often include results based on variant spelling, typos or OCR text recognition errors.

Unlike my use of tags, which i want to be precise and predictable, I use See Also and See Related Text in order to look for relationships of concepts that I wouldn’t have thought of. I don’t want See Also to only suggest documents that are obviously topically similar to one I’m viewing. Like Steven Johnson I want a bit of chaos in the way the algorithms operate, surprising me with suggestions that may be valid but unexpected.

Irrespective of the technical feasibility of the idea, there are some more theoretical/practical considerations for sure. Bill has definitely pointed to some major ones with respect to the Florida example.
To provide another example similar to Bill’s I deal with a few subject areas that can be described using what are relatively standard words. One of the best examples would be “Governance” “Government” and “Governmentality”.
I have a large swath of academic literature that deals with theories of government and governance. However, I also have a considerable body of news and government and media documents that refer to governance and government in a completely different context (e.g., “The government of Canada has…” “Government Officials decided…” “Changes in the governance structure of the CFIA…”). While these things are related inasmuch as there are theories of govern*, and there are some real-world cases or instances of govern*, the two shouldn’t be conflated and this relationship isn’t always the case. Consequently I have a tag for academic literature on theories of governance which explicitly excludes documents that simply use those words. All the media/government stuff gets organized based on the actual subject matter that is being addressed. This way my “governance” tag isn’t populated by a swath of irrelevant material that just so happens to frequently include the words “government” or “governance”.

But there is perhaps the broader question of whether tagging based on frequency of occurrence is the best use of tags, or is truly helpful. From an organizational standpoint, I see it running into several issues, not the least the one identified by Bill and reiterate by myself. From an organizational standpoint, tags are likely best manually (or with the assistance of the AI) applied based on whatever schema is best suited to your needs and the content you are organizing. This will offer precision and keep things reasonably tidy.

Search and saved searches might be the best way to discover/organize content based on word occurrence. Searching for a word or phrase of interest will result in the most relevant results being displayed first (usually the ones in which the word or phrase appears the greatest number of times). You will invariably get false-positives (e.g., a search for “government” will return both government documents with a page header that repeats 150 times, as well as an academic article in which the theory of government and governance is discussed), but at least you’ll be getting a broad overview, and this seems to me to be a but of what you are after. If you need to drill down, you use your precisely applied tags. Granted, unlike your request, you’ll need to know, in advance, the term you want to look for, so it isn’t necessarily conducive to discovering new themes, just discovering an a priori defined theme in new places. That being said I think if you want to discover totally new themes, reading the actual documents is likely the best way to go, and failing that, some use of the Concordance feature.

Thus I think tags are best done manually and precisely, as Bill suggests. If you want to go by frequency of appearance, searches are likely the best bet if you know the term you want to find in advance. If you want to discover new terms/themes etc, you’re best to actually read the content, or at least use concordance (database level)/keywords (document level) to discover the words that appear frequently.

Just my 5¢. I’m not trying to dismiss your idea by any stretch, but just offering some additional brainstorming as to how you might be able proceed in the absence of the feature you are requesting.