storing information in different languages

thoresson · October 31, 2006, 4:23pm

Hi,

I’m evaluating DT Pro as a reference database for web archives I want to save for future work. Some will be in Swedish, but most in English. Should I have a separate database for each, or can I somehow trick DT Pro to find references between languages? If I add a line or two of “freeform fuzzy keywords” to each entry?

/Anders

Timotheus · October 31, 2006, 8:52pm

See devon-technologies.com/phpBB … ht=cavallo

Maria · November 1, 2006, 12:33am

It is an interesting thread but already quite long, so I will add some new ideas and informations here.

Recently I starting building a database on a very limited field of study with information mainly in three languages. It is easy to add similar documents written entirely in these languages into the respective groups manually. Then I started a table for each group where I entered important words in 5 columns for different languages and writing systems, where I do not just enter one word per column but whole groups of similar meaning.

I do this for myself to master the subject in three languages equally, but it would be great to have DevonThink taking these tables as main hint for its AI functions and for relating and classifying topics: Entries in one row mean almost the same. DT cannot do that.

This is a severe drawback in my opinion, and regarding the large number of scholars among the DT Pro users it is a real pity that there seems to be no effort in that direction from the developers. Or is there?

Best,
Maria

eboehnisch · November 1, 2006, 7:08am

While DEVONthink Pro does not come with a built-in thesaurus to find similar words in multiple languages, you can, of course, add keywords in two or more languages to the comment field. This has not much influence to the AI based Classification and See Also functions, though.

Maria · November 1, 2006, 9:53am

Hi,

are you planning to do something for multilingual researchers?

Maria

eboehnisch · November 1, 2006, 10:51am

We are planning many things, but, honestly, the number of multi-language reseachers is much smaller than the number of those asking for, say, multi-database and multi-user capabilities. So, your ideas are on our list, but DEVONthink 2.0 and a teamwork-capable variant are definitely higher on the list. I hope for your understanding, Maria.

Maria · November 1, 2006, 11:15am

Sure, I understand that. Thanks for a clear answer. And I am one of those who are looking forward for multi-database capabilities!

Maria

Birgitt · November 1, 2006, 2:13pm

I guess that most people not living or working in English speaking countries have to collect information in more than one language: their own plus English. Therefore the number of people using Devonthink with more than one language cannot be that small. And what about the developers? They seem to be living in different countries and working with lots of languages. Multi-database support is something I’m looking forward to, but multi-language support would also be wonderful!

Birgitt

Timotheus · November 1, 2006, 8:58pm

This issue, this really fundamental issue has already been discussed various times on this forum, but it seems to me that we’re running round in circles, that we’re not making any progress at all. My impression is that Devon’s chiefs don’t feel any sense of urgence in this respect.

And I must confess that from a company which is partly based in the United States, and partly on the Continent, one could expect perhaps more eagerness to improve its flagship on this point.

Let’s put it blunty, without violating in any way the known truth: in the humanities at least, an academic who reads and collects books and articles in his mother tongue only, is no serious academic, even is that mother tongue is English. This sole fact suffices to prove that a data base application which doesn’t take into account the mutilingual orientation of large parts of the academic world, has some serious shortcomings.

But perhaps our friends at Devon’s don’t have any clear ideas about what could be done in order to make DT the privileged data base application for a multilingual community. Perhaps we should come therefore with some workable suggestions, which could be implemented rather easily.

eboehnisch · November 1, 2006, 10:09pm

Oh, we have a very clear picture on what this would mean. As you pointed out, we’re a multi-langugage multi-continent company. But also a very small one

And, if you read my words carefully, you will see that I have not said that it’s not a useful extension to DEVONthink, but that we don’t have the resources to do everything at once. There are features requested by many of our users that we will satisfy first, e.g. multi-database and multi-user capabilities.

Multi-language features as you described are on our to-do list, they’re just not on rank #1.

Bill_DeVille · November 1, 2006, 10:50pm

I was only half joking when I suggested recently to Eric that we should add a plugin called “Rosetta” to DEVONthink. Oops! That name is taken already.

“Rosetta” would be a huge plugin, I’m afraid, with tables for terms in a number of languages and the ability to see the correspondences between words and/or phrases in multiple languages. So if one document is in German, another in Japanese and still another in Swedish the plugin would enable DT to compare the term usages in those three documents for purposes of searching, classification, grouping or “See Also” detection of contextual similarities.

In its most sophisticated form “Rosetta” would represent a comprehensive approach to machine translation of languages, and that’s difficult enough that DEVONtechnologies doesn’t have the development resources to tackle that approach. That’s still “bleeding edge” computer science.

Maria and others have suggested a much simpler kind of “Rosetta” that would take a list of keywords or tags – perhaps entirely user-generated – and maintain tables that could be used for searching, classification, grouping and “See Also” functions to pull together similar documents written in multiple languages. Eric has noted that one can enter keywords in the Comments field of the Info panel for a document, but this remains of limited utility for some of the artificial intelligence functions.

Even this conceptually simpler form of “Rosetta” would require rather significant development resources. In the most easily implementable form, I suspect that the keyword or tag tables would have to be built in a single language, else one would need a table to set up correspondences between the same keywords or tags entered in different languages.

DEVONtechnologies is a small company and already has a very ambitious set of priorities for application development and enhancement.

So, unless someone can suggest a “quick and easy” development approach to implementing a usable form of “Rosetta” in DT applications, it’s not a project likely to be undertaken in the next year or so.

ndouglas · November 2, 2006, 9:52pm

I suggest that each DTP user buy an English -> Language_X/Language_X -> English dictionary and make a sheet in DTP with two columns – one with the German/Swedish/Swahili/Pig-Latin word, and one with the English word.

I mean, that’s basically what is being requested here, right?

Maria · November 2, 2006, 10:24pm

Hi,

would be great if that worked. Of course, there is no need to buy a dictionary .

I understand Eric’s and Bill’s remark about a small company that has to decide how to use limited resources. I cannot judge how difficult it is to implement a feature into the AI that recognizes the entries in one table row to treat as identical, but I have to hear what the developers say: It seems to be difficult.

If a person works in a certain field of research it is a matter of a day to set up such a list. But as you see from the posts of the developer’s side, there is no use in setting up such tables.

I wish DT good luck with their developments so that they get to this problem as early as possible. DT is already so important for my work, and with the various search options I could along without AI as well.

Best, Maria

Timotheus · November 3, 2006, 5:45am

I’m afraid that Eric’s words are a polite way of saying: “Don’t expect anything of any substance here in the coming years”. If this were the case, I would regret it very much.

We need to make understand DT that someone who is looking for the occurencies of “horse”, will probably also be interested in the occurencies of “cheval” and “Pferd”; and that someone who is interested in finding “horse”, “cheval” and “Pferd”, will almost certainly be interested also in finding “horses”, “chevaux” and “Pferde”.

Is this really so difficult and so time consuming to implement? Maria and others made already some interesting and in my opinion very ‘workable’ suggestions. I really hope we’ll see something of this kind in the near future.

Bill_DeVille · November 3, 2006, 7:20am

Maria and Timotheus, you note that a researcher can rather easily put together lists of corresponding terms in multiple languages, especially for projects that you are working on.

Why don’t you try doing that? I did a little experiment with some sheets and placed them into groups where those terms would be likely to occur. I simply created a sheet with corresponding terms listed in several languages, as Timotheus suggested. I was able to get See Also to pick up some similarities based on the fact that similar terms were in the group, though in different languages. The ranking of such suggestions in my tests were low, but they did show up.

And a little sheet for that classic case: canine, dog, wolf, fox, coyote, etc. would ‘assist’ See Also to find such contextual relationships between documents.

Actually, rich text would be better, as one could Option-click on a selected term to see what documents also contain that term (doesn’t work in sheets, which are plain text).

Instead, simply create a rich text note and populate it with a term repeated in different languages that are contained in one’s database. Timotheus’ example of “horse” in different languages would be a neat test in his database. Find it by searching for the term in any language. Select and Option-click the equivalent term in Hungarian or Greek and those documents that contain the term in that language are displayed in a sliding drawer.

Sure, rich text columnar layouts would be better. So use a competent word processor or spreadsheet that can lay out multiple columns and enter the corresponding terms in multiple languages in that. A PDF from Papyrus 12, Excel, Mellel could be used, and Option-click works.

So users might be able to experiment with such approaches, without waiting on DEVONtech to do something. And perhaps your experiments could give DEVONtechnologies a grasp on how to make it work better.

On the other hand, think what a huge project it would be for DT Pro to do that sort of thing up front. As DEVONtech can’t know what terms or languages might be used in a database, the kitchen sink approach would be a big project. There are hundreds of languages and major dialects. And we have users working in some rather exotic, uncommon languages and dialects.

Maria · November 3, 2006, 8:32am

Bill,

I also wrote that I do that already. And I wrote that I asked Christian whether this has any effect on the AI, and he says, it does not.

Now I will go on reading the rest of your post!

Maria

Maria · November 3, 2006, 8:40am

Bill,

after reading the rest of your post: You encourage us to do useless work. Please don’t, I am busy enough

The See also function works across languages in some cases, naturally, even I get some tiny result in cases where the structure is already perfect. But, as Christian, the one who should know DT best, stated, sheets have no effect on the AI function.

It would be good to assign some priority to concordance sheets or RTFs in which DevonThink’s AI could look for structure and treat certain parts as one.

btw, I did not intend to insist on this, as I wrote in my earlier reply on Eric’s mail, but these remarks made me – angry or something. So I took my time and wrote just an answer.

Best,
Maria

eboehnisch · November 3, 2006, 9:40am

Maria,

As you may know, we are highly interested in solutions that work as automatic as possible. So, if we add something like this, it will definitely be something that uses as-complete-as-possible thesauri instead of relying solely on manually created lists. I believe that lists that you put together manually will never be completely satisfactory. However, does anyone here know relatively complete and maybe even free online thesauri?

ndouglas · November 3, 2006, 1:49pm

I guess it didn’t come across that my other post was sarcastic – I was saying that everyone should get one of the dozens of bilingual dictionaries out there and type the whole thing in while DEVONetc constructions the functionality. I wasn’t referring to people actually using sheets and expecting that to work

Anyway: this is, unless I’m missing something, a colossal task with a large cost in time, labor, reliability, bandwidth, patience, etc.

(of debatable use) dict.org/links.html – DICT is a client run on a Linux (or OS X, presumably) machine that contacts an external server, downloads definitions, and displays them. It might be nice. The standard’s obviously open, so integrating it into DEVONagent (or as a GPL plugin for DEVONagent) might be possible.

freedict.org/ – Somewhat limited, but GPL’ed bilingual dictionaries in XML format. I suppose distributing the databases with DEVONthink might be kind of hefty

download.wikimedia.org/enwiktionary/20061016/ – Downloading the Wiktionary database is another alternative. I think they use fairly standard table layouts and such. It’s in XML. Might be able to parse it with some reg exp or something in order to set which words are synonyms of each other and in which language, and so on. I don’t really know. It’s licensed under the GFDL, which means (unless I don’t understand) that you’d have to make the source for the dictionaries freely available.

Which might not be that big a deal – just a bigass XML file somewhere on your site. What would be a pain in the neck is that the db itself, for English-language, is 34.7 megabytes, and so the size would rack up really quickly if you made DEVONthreepio, fluent in however many forms of communication, and told everyone to download it.

And I’m not sure if you want the bandwidth costs of a DEVONagent plugin or something attacking your site often (whenever a page is saved?) to look up the translations of every word in the article.

In short, it’d be a colossal download and a metric butt-ton of bandwidth for any multi-lingual dictionary. Not to mention the hassles of updating the database, getting people to download it, and so forth. And logistical problems like words spelled the same with multiple meanings, commonly-misspelled words, words that require accents on certain letters that may or may not be typed by whoever input the text, words that are in multiple languages, verb recognition and conjugation, and so forth. And it seems like it would be rather slow, since it would either dump a huge synonym bank into the database for each document you write, everytime you save it, or have to search through an even larger database bank whenever it attempts to classify or “see also” an article.

That being said… I have to say that I like the idea of being able to use each record of a particular sheet as a term followed by synonyms or translations. Users might be able to upload and share their translation sheets. It might also be nice if you could make Aliases for the automatic wikilinks available in the same or a similar way, or disable whether or not a certain article is automatically linked to – all from a central sheet (or group of sheets). It could even be used to add a sort of relational database functionality.

Of course, I know nothing about programming, so this might be completely ridiculous, but it seems to me that making sheets more functional could only be a good thing.

</stops before he mentions that he wants to embed queries in sheets>

Maria · November 3, 2006, 1:50pm

Eric,

thanks for this interesting mail.

(1) “automatic as possible”: This sounds good. As someone who could use Devonthink only for specific searches and manually filing replicants in groups, this is a new world. But it is what I thought of, when I tested DT for the first time, and it would be great to see this coming true!

(2) “complete-as-possible”: Someone understanding a certain field never needs completeness but knows what is relevant. So I am sure I can build relevant lists for my field of study. Still, this is not how DTs AI seems to work. A “Masse statt Klasse” matter where quantity will finally end up in quality?

(3) “free online thesauri”: I know of the Monash university dictionary projects, Leo etc. But I wonder whether these projects can be used to support a commercial software? I will have a look around in the next days. I can say though that in my field of study these thesauri are close to useless: no such vocabulary or wrong translation…

Anyway, great to see that something is going on. I am sure that it does not only concern a minority of researchers in the humanities but medicin, technological development etc. outside the English speaking world as well.

Best,
Maria