Does anyone have a method to tag languages in feeds (or documents more generally)?

I use RSS feeds. Some of the feeds are from social media sources like Mastodon, and some of the items coming through are written in different languages. I’d like to detect articles that are not written in certain languages (primarily English) so that I can filter them out. The RSS sources don’t offer a way to only get items written in specific languages, so I’m trying to set up post-processing to filter them out. A relatively simple approach would be to have a DEVONthink smart rule that detects the language used in the content of an item and tags the item appropriately; then I can write another smart rule to delete items that match (or don’t match) the desired language tags.

Which leads me to this question: does anyone already have a scheme for tagging (or even just detecting) the main language in which a document is written in?

Does this RSS Language Codes help?

Thanks for the link. Unfortunately, the problem is that the RSS feeds are the result of searches, so there’s no language info in the feeds themselves. Also, people post in different languages on the same Mastodon servers, so the servers are not an indication of language either.

Update: some further searching revealed that benoit.pointet posted more or less the same question in 2019 but the link to his script link is broken in the Discourse posting, and the answer at the time posted by @cgrunenberg sounds promising but I can’t figure out what the “integrated language detection” is. I must be missing something obvious.

This prompted me to search the DEVONthink user manual for 3.9, and I found this on p. 233 of the PDF version:

md_language: An abbreviation of the detected language in the contents of a file. For a list of values, select a language in the criteria and note the abbreviation to use.

but in my cases, when I look in the DEVONthink metadata in the built-in Language field, the value is empty. And I’m not sure how it’s supposed to work. Does DEVONthink normally fill in the language automatically? If so, does it work for RSS feeds normally? Maybe the problem is that it’s not working for the specific cases I’m trying.

It works over here:

I’ve set German as primary and English as secondary language in macOS system preferences, maybe that’s needed for all languages you want to detect?

1 Like

This seems to work

-- Get dominant language

use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

#set theText to "The dominant language of the string set for the linguistic tagger."
#set theText to "Die dominante Sprache des Zeichenfolgensatzes für den linguistischen Tagger."
set theText to "La langue dominante de la chaîne définie pour le tagueur linguistique."

set theNSTagger to current application's NLTagger's alloc()'s initWithTagSchemes:{}
theNSTagger's setString:theText

set theNSTagger_dominantLanguage to (theNSTagger's dominantLanguage()) as string

1 Like

OK, I figured out what is happening, but I don’t know why it’s happening.

In the DEVONthink preferences, in the Data tab, there is a field called “Language”. I had turned on the visibility of this field,

and then looked at documents in my database. None of them have a value for that field in the UI. So I assumed this meant none had a value for the language.

But if I search based on the language field, it works! So in fact, the language detection in DEVONthink must be working. Something is simply wonky with the user interface (or I’ve done something to screw it up).

Thank you for posting what you did.

I was not aware of NLTag. Thank you for posting this – this is useful to know.

1 Like

That’s just an example (probably to show users what types of Custom Meta Data are available). I deleted them and never experienced any problems.

DEVONthink stores the recognized language internally, there’s also a criterion (md_language) both for searching and smart groups/rules.

The default custom metadata Language is optional and not automatically set and has a different prefix (mdlanguage). It’s of course also possible to change/remove this field.

1 Like