Integrating translator for improving AI-capabilities

tormod · May 7, 2009, 8:21am

Hi everyone,

I came up with an idea yesterday after reading this interesting account of DT use:
parezcoydigo.wordpress.com/2008/ … earch-iii/

One problem that is brought up here, and that I am sure I will encounter as well in the future, is to work with DT-databases with multiple languages. In the linked example it is Spanish and English, in my case it will be Swedish and English. Using the AI search and “See also”-features will be of little use when you have multiple languages.

My simple (conceptually) idea is to integrate a translating enginge (for example Google Translate) into DT. By translating all text in the background you can then choose (or it can be automatically guessed) a language to search in, and the results will be based on the translated version of you clippings.

I do not know how to implement this practically or in any detail, but I just wanted to get this thought up for discussion.

Declan · May 7, 2009, 12:30pm

Something like this would be great!

Declan

hendrix · May 7, 2009, 1:45pm

I hadn’t posted anything on this issue before because I had no idea about how this could be overcome. I use DT to manage a collection of documents and references on social services in several countries, so I have documents in Spanish, English, French, Italian and Catalan… and a few others. This has limited the usefulness of searches. So anything that improves things in this field would be very welcome.

Bill_DeVille · May 7, 2009, 5:27pm

Maria, who works with multiple European and Asiatic languages, suggested this years ago.

If all languages had one-to-one correspondences of nouns, verbs, adverbs, prepositions and adjectives (just differently spelled or drawn and differently pronounced) and used the same syntax – the structure that gives a string of words meaning – it would be a relatively trivial exercise to realize a long-standing dream, accurate machine translation of text from one language to another, with preservation of the information content and ‘meaning’ of the translation.

Unfortunately, that’s not the way languages have evolved. There are very often not simple one-to-one correspondences between words in different languages, and there are significant structural differences among them. Even in cases where there appear to be simple correspondences, the context of terms apparent to the author and the reader may be difficult to manage for machine translation. For example, the word “Florence” could refer to Florence Nightingale or to an Italian city (with a different Italian spelling in the latter case).

Google Translator represents years of experience in machine translation from one language to another, including a lot of empirical ‘tuning’ based on vocabularies and contexts. The actual results of translations, in terms of preservation of intended meaning, range from pretty good to awful. The files corresponding to each language are large, and processing takes time, whether by direct access to those files or via Internet access to them.

So DEVONthink doesn’t provide an automatic means of searching across a database comprised of documents in different languages, to pull related content regardless of the language in which it was written.

But the syntax of searches in DEVONthink 2 allows mixing of terms, exact strings and wildcards in highly structured queries of indefinite length.

That means that a user who has a database containing multiple languages, and who is familiar with the correspondences of terms among those languages, can create a query to search across them. This requires user familiarity with the terms used in the various languages, and perhaps long nested disjunctions in the query. If you create such queries, you will probably want to save them as text documents in order to avoid reconstruction each time such a search is needed.

Which brings me to an interesting point. I once talked about how it’s possible to ‘teach’ the See Also artificial intelligence feature via ‘bridges’ between terms that may not exist already in the documents contained in a database. The example I used was the canine family, which includes wolves, dogs, foxes and coyotes. If I were to create a new document that contains a multiply-repeated phrase such as “canine/canines: wolf/wolves, dog/dogs, fox/foxes, coyote/coyotes” and then view a document that’s about foxes but which doesn’t contain the term ‘canine’, See Also may then see relationships to other documents about canines, wolves, etc.

In other words, the act of saving those laboriously constructed queries that search across multiple languages could build bridges between documents in different languages.

tormod · May 7, 2009, 8:09pm

Thank you for your replies.

Bill, your suggestion on learning the program with the help of “translating pages” or “association pages” is a really good idea. I can definitely see myself implementing that.

But, your argument concerning the problems with machine translation is in definitely real, but in my ears it boils down to an argument saying: “it wouldn’t work so great as we would like”. The AI is wonderful, but by no means perfect. The “see also”-function doesn’t really know what I also want to see (or should see), it doesn’t even make educated guesses. It just parses the content of the clippings, by the use of an advanced algorithm.

Putting in another function in this algorithm wouldn’t make it better at everything, but better at some thins. Probably worse at some things, but you see my point.

I can see the problems of implementing it, and it might not be worth it, especially with your suggestion as an alternative. But… maybe.

parezcoydigo · May 9, 2009, 7:01pm

With DT1.5, I did feel like this was a bit of an issue. I would simply do searches in both languages if I needed to find something. With DT2, I’m thinking about segregating the databases into english/spanish - in part because you can have more than one database open.

But, in reading Bill’s description above about teaching the AI using saved queries/query docs, I’m rethinking that.

I haven’t actually jumped to DT2 yet, though, as I’m finishing this book and want to wait for that to make the jump.