OCR languages: what’s the point of the combined-languages options, e.g. Japanese and English?

Hello,

In DEVONthink 3 Preferences > OCR > Primary Language, there are some options that combine two languages, e.g.:

  • Chinese Simplified and English
  • Chinese Traditional and English
  • Japanese and English
  • Korean and English

I also found this in DEVONthink 3 Help’s OCR section:

Note: The primary language and the secondary languages are treated equally.

My question is: let’s say I’m using DEVONthink’s OCR feature to convert a scanned PDF written in a mixture of Japanese and English. In this case, what are the differences between the following two settings?

  1. Primary Language: Japanese and English
  2. Primary Language: Japanese, Secondary Language: English (or Primary: English, Secondary: Japanese, since the two options are treated equally)

Would those two settings yield different OCR results? If not, what’s the point of those combined options?

Could someone provide some insight? I’m confused. Thanks in advance :smiley:

Some possibilities:
It’s an obsolete holdover from the days when Finereader supported only one language per document, and this was a hack to work around common Japanese practice.
It’s faster
It’s less memory intensive
It’s more accurate

Only benchmarking will tell.

1 Like

Hi jerwin, thanks for your input!

If I do benchmark the performances, I’ll keep your theories in mind and post the test results in this thread. :smiley:

Hmm, this is weird.

I just decided I could try contacting ABBYY’s customer support, so I searched for ABBYY’s user manual. The newest version ABBYY FineReader PDF 15’s manual can be found here (PDF download, 5.1 MB). In the Supported OCR and document comparison languages section of the manual on page 288, I didn’t find any options with multiple languages combined, as can be found in DEVONthink’s OCR preferences.

I am not familiar at all with ABBYY FineReader PDF’s product itself, so correct me if I’m wrong here, but could it be that the options with combined languages originate from DEVONthink rather than ABBYY?

Given the state of the documentation, it could be a simple ommission (read: the documentation is a piece of crap). Also, DT is using their library product, which might have options differing from the program.

4 Likes

Hi chrillek, thanks for the input!

That gave me a good chuckle haha. I wasn’t prepared for such low quality when downloading the documentation, I just dived straight into the relevant section and took it for granted that it could be trusted (i.e. detailed and up-to-date).

I did consider the fact, so I did make the (not very solid) assumption that the options would be similar.

Anyways, I hope this thread could receive some official explanation from DEVONtechnologies :smiley:

Here I have attached a PDF and converted plain text with…

As seen in the plain text, Japanese as the primary with English as secondary yields better results.
Japanese only yields no discernible English; English onl, no discernible Japanese.

@aedwards would be more conversant on this but Japanese and English is the best of the bunch.

1 Like

Hello Jim, thank you for the concise and helpful tests! It didn’t occur to me that it would be much easier to compare benchmark results if you choose OCR to txt instead of searchable PDF. The conclusion is very useful, and I’ll keep it in mind when using OCR on applicable PDFs.

A small note, I think you meant “English Only” instead of “English primary and Japanese secondary” in your second test? The OCR result seems to imply so, since it doesn’t contain any discernible Japanese.

You’re welcome :slight_smile:
Actually the second test should be English with Japanese as secondary. However I’m in bed so I can’t double-check just yet. :sleeping:

1 Like

Thank you for the hard work! :smile:

1 Like

ABBYY provided combined languages in older versions of their OCR, these were removed in ABBYY v12. DT3 will still support these by converting them to a primary and secondary language, i.e.

Selected Language: Japanese and English
Primary: Japanese
Secondary: English

The only time this is not the case is if you are running on macOS 10.11 where DT will use ABBYY v11.

Hello Alan, thank you for your insight, that makes things much clearer to me now.

I still have one question: since “Primary: Japanese and English” and ”Primary: Japanese, Secondary: English” are functionally identical, why is it that these two configurations during Jim’s benchmarks yielded different, although admittedly similar, results? Here are the specific two benchmarks I’m referring to:

Thank you for your patience and time.

They should be similar, I will check these results.

1 Like

In Devonthink
Japanese primary English Secondary reads an english O as 〇
English Primary Japanese Secondary doesn’t pick up on the Japanese text at all/

Japanese and English tends to recognize O as O and 〇 as 〇, as appropriate.

But, there’s no difference between methods in Finereader PDF v 15. Moreover, it looks to be even more accurate, at least in terms of not producing garbage characters… It even correctly picks up the ① and ② symbols. I don’t read Japanese, though, so I may have missed something.

Finereader 15.zip (481.5 KB)

1 Like

Hello jerwin, thanks for the reply.

Indeed, hence my previous confusion about Jim’s benchmark result:

I don’t have access to FineReader PDF, but I’m curious about the discrepancies you described as well.

I found why these are different, Japanese and English was mistakenly setting English to the primary language rather than Japanese. I will fix this in the next update.

Japanese primary English Secondary reads an english O as 〇
Japanese and English tends to recognize O as O and 〇 as 〇, as appropriate

The reason why the these produce different results is that the primary language has a greater weighting than the secondary language. So basically if Japanese is the primary language, the charter O is a close enough percentage match to 〇 for it not to look at the secondary language, whereas with English as the Primary language,O will match O however 〇 will not be a close enough match therefore it will look at the secondary language for a match.

4 Likes

Hello, Alan! Thank you for finding the root of the issue.

As I wrote in the original post:

I guess this statement from DEVONthink 3 help is either outdated or incorrect? Either way, please also fix it in a future update. Thanks :smiley:

We will update the documentation

1 Like