Documents in different languages

mwi-bln · August 26, 2009, 9:54am

I need some advice on best practice to handle documents in different languages. My DT database consist of several hundred documents, half of them in English, half of them in German.

Currently I am mixing all the documents in one database. When doing a search I usually use ‘OR’ to look for the English or German term. But some of the best features (‘See Also’, Wiki links etc.) don’t fit into that multiple language database.

Is there anybody with a similar type of use? Would be interested to know about your work flow to best handle that scenario.

Marius

PS: Since this is my first post I don’t want to miss the chance to say how much DT is useful for me. Please keep that good work!

Declan · August 26, 2009, 1:31pm

I’m sure many people have similar needs.
I’ve wondered if the way to help a bit is to insert at least a few keywords in the comments field. Better than nothing.
But I don’t really know how the comments field is intended to be used… it seems that it could be used for keywords, for tags, for comments (as the name suggests) or anything else. Unfortunately you just get one box to type everything in, so I’m not sure it’s helpful to use that field for all of the above possible uses.
I’m looking forward to hearing some knowledgeable responses to your query.

Declan

Johannes · August 26, 2009, 2:17pm

Most auf my documents are in German but some are in English. Using See Also or classify with them is almost useless. I think there is no workarround (except having separate databases based on language, but that makes it difficult to keep things together by content).

WikiLinks for different languages can easily be solved using aliases (some manual work of course).

On the long run the AI needs to be language aware. I would not think of some kind of automatic translation (like some suggested in another thread). But having a language flag for each file could perhaps help the AI to compare only within the same language.

Johannes

Senhal · August 28, 2009, 8:06am

The main language I have files in is English, but I’ve got quite number in French, German, Norwegian, and a smattering of other languages (Old Occitan, anyone? Thankfully the OCR has a setting for modern Provençal…). I wouldn’t really want the AI to look solely within the subset of documents in the same language as the one under consideration (that is, effectively, what happens most of the time today), as, for me, language and content are for the most part quite separate. Occasionally (e.g., in documents discussing a person with a distinctive name, or where a large number of key terms are the same in at least some languages) the AI can perform decently.

Until machine translation has reached a rather higher degree of sophistication I believe the answer is simple: tags. See, e.g, this post viewtopic.php?f=4&t=8124#p37902 and some others in that thread, which explain why a number of those of us working in several languages are clamouring for tags to be implemented. I can also provide some examples of cases where tags would fill a space sorely lacking even when using ‘English’ OR ‘German’ searches…

gnoli · August 28, 2009, 4:51pm

Yes, it is true, but tag is toooooooo slow to implement …

Bill_DeVille · August 28, 2009, 5:59pm

See Eric’s blog (and read the comments) about tagging at devon-technologies.com/scrip … ess/?p=904

Note that tagging is not ‘flat’. I can tag an article about a dog with the tag, Dogs. And I can tag all articles about dogs under the tag, Canines. And I can tag all articles about canines as Mammals – and so on. If I have a database about living organisms I can create a complete taxonomic system of tags for them. But in that same database I can also have other tags for other purposes, for example, notes about the kinds of wildlife that I observe at my log cabin in the woods of Brown County, Indiana.

In a financial database I can tag an expense record to the month/year in which the expenditure was incurred, to a category of expenditures (business travel, for example), to a specific project for which the expense was incurred, and so on to whatever extent such tags are useful. Then I can do a search for, as an example, all the documentation of expenditures made for a specific project (or for all projects), for a given month/year, or for the full year. There they are! Note that DEVONthink is a document management database, not a ‘numbers’ database. But for tax purposes, documentation is important to justify the numbers in a spreadsheet – which I can also tag in that financial database.

I’ll never spend a lot of time tagging most content as I add it to my database. If I tried to figure out all the possibly useful tags for each new document I would never get anything else done. But the tagging system can be a very powerful and flexible way to filter, aggregate and find information as I need it.

Yes, it can help aggregate documents in different languages in useful ways…

Declan · August 28, 2009, 6:43pm

But what about entering keywords in the comments box? Would that not have the same effect as tagging (except that it does not offer autocomplete of existing tags)?

I’m always cautious about overusing the comment box, with the result that I barely use it at all. I can’t quite figure out what kind of information to put in. Anybody got any thoughts on this matter?

Declan

sjk · August 28, 2009, 7:31pm

That’s topic has been discussed in other threads, e.g. recently: starting here.

twicks · August 29, 2009, 10:57pm

I’m sure, like scruffy moonshiners or hemp growers; the mind boggles at the thought of Snuffy Smith stumbling about your wooded acreage.