Multiple languages and the AI (Classify, See Also)

ostwaldj · August 1, 2013, 12:57pm

I have 1000s of rtf and pdf documents in various languages. The vast majority of the documents are only in a single language (barring proper nouns) - either the document is in English, or it’s in French, or Dutch…

I have separate databases for each language, when I want to find French documents on the same topic, or Dutch documents on the same topic… But I want to keep as much in my main (English-language) database as possible, including image PDFs of foreign language documents that are non-OCRable. (At the moment I have a separate database of mixed-language documents, volumes of correspondence where half of the letters are in French but the other half are in English - not really sure what to do about them yet.)

I know that the Classify feature is based off of the groups, so I have groups for my thematic categories, and these documents are only in English - either my notes, or quotes from English-language documents, or English translations of documents. Groups like Economics>Bank of England>Opposition to.

Assuming that Classify is only affected by the groups, I’ve used the tags as a parallel/intersecting hierarchy, to store where each document comes from (whether an original or a note based off of an original). So I have a group of hierarchical tags for primary sources (e.g. Primary>Archives>British Library>Add MSS 61234>Add MSS 61234 p13.rtf), and I have another group of hierarchical tags for secondary sources (e.g. Secondary>book>biography>McKay Prince Eugene.rtf).

My questions:

Am I correct in assuming that limiting French and Dutch documents to the tags (and not in the groups) will prevent Classify from becoming confused? That is, can you safely keep documents in multiple languages in a single database as long as they aren’t in groups? They can still be tagged and otherwise searched, just not Classified.
‘See Also’ is presumably different, since it ignores groupings and just looks at documents, which presumably means it might be influenced by documents that are only located in the tags. But what kind of effect does this have on the See Also results? I understand that See Also can’t find French documents that discuss the same topic as an English document (DT not knowing that chien is the same thing as dog…) - the AI can’t translate or work across languages.
But will having multiple languages in the same database affect See Also’s ability to find all relevant documents within a single language? Will it, for example, make it difficult to find similar English-language documents based off of a given English document? I’m not sure if this is simply a matter of the number of cognates in the various languages, or what other factors play a role.

Thoughts and suggestions?
Thanks.