Language-specific searches

kh1 · December 13, 2004, 10:32am

A search feature that I often missed in search engines is language-specific searching, i.e. searching only pages in a given set of languages or excluding a particular set of languages. Google lets me pick a single language, but not multiple languages and not exclusions.

An example for the utility of exclusions: Searching for information about my favourite musical instrument, the oud, yields tons of web pages in Dutch where “oud” means “old”. Adding terms such as “music” eliminates the Dutch pages, but also some of the pages that I want to see. Excluding frequent dutch words is a better approach, but a challenge for those who don’t know Dutch well enough. I’d much prefer searching for “oud outside of Dutch pages” or “oud in German, English, or French pages”.

Such a feature would also be nice in DEVONthink, but more important in DEVONagent.

Maria · December 14, 2004, 7:56am

kh,

I understand your problem – and your love for the oud – but here I would like to add another scenario about search in different languages:

I have my files in 3 to 5 languages, but all about the same topics. So I mostly look for English&French, German, Japanese&Chinese, trying to put those languages together where words look similar or I perform slow OR-searches. It would be great to get the opportunity to search for one topic at once in different languages.

One important drawback: Searches of this kind only work, if pages have a declaration of the language, most pages don’t. In DT, we would have to create metadata of that kind on our own. In addition, some pages are written in several languages.

So I see many many problems.

Maria

kh1 · December 14, 2004, 8:52am

I like the idea of multiple-language search as well, but I don’t see how this could be done realistically. Even supposing that huge dictionaries were available (many search terms are rather specialized), the translation of a term depends on context and no computer program can supply that. I’d rather choose the terms myself in all languages. Storing them in a personal dictionary might be an option though.

The language identification issue looks simpler to me. Even though language declarations are rare, statistical analysis of the text works pretty well for anything consisting of full sentences, or better yet paragraphs. One could even do the identification paragraph by paragraph and flag the unclear ones (headlines etc.) as “unknown”.

cgrunenberg · December 16, 2004, 10:55am

Although that’s not exactly the requested feature, a future release will implement a language filter by defining your preferred languages in the preferences and DA will try to eliminate all pages in different languages. But this will only support the most common languages of course (English, French, German, Spanish etc.).

kh1 · December 16, 2004, 11:42am

That’s already good progress, thanks!