OCR languages: what’s the point of the combined-languages options, e.g. Japanese and English?

xurc · November 26, 2021, 8:55am

Hello,

In DEVONthink 3 Preferences > OCR > Primary Language, there are some options that combine two languages, e.g.:

Chinese Simplified and English
Chinese Traditional and English
Japanese and English
Korean and English

I also found this in DEVONthink 3 Help’s OCR section:

Note: The primary language and the secondary languages are treated equally.

My question is: let’s say I’m using DEVONthink’s OCR feature to convert a scanned PDF written in a mixture of Japanese and English. In this case, what are the differences between the following two settings?

Primary Language: Japanese and English
Primary Language: Japanese, Secondary Language: English (or Primary: English, Secondary: Japanese, since the two options are treated equally)

Would those two settings yield different OCR results? If not, what’s the point of those combined options?

xurc · November 27, 2021, 12:07pm

Could someone provide some insight? I’m confused. Thanks in advance

jerwin · November 28, 2021, 12:47am

Some possibilities:
It’s an obsolete holdover from the days when Finereader supported only one language per document, and this was a hack to work around common Japanese practice.
It’s faster
It’s less memory intensive
It’s more accurate

Only benchmarking will tell.

xurc · November 28, 2021, 3:50am

Hi jerwin, thanks for your input!

If I do benchmark the performances, I’ll keep your theories in mind and post the test results in this thread.

xurc · November 28, 2021, 9:04am

Hmm, this is weird.

I just decided I could try contacting ABBYY’s customer support, so I searched for ABBYY’s user manual. The newest version ABBYY FineReader PDF 15’s manual can be found here (PDF download, 5.1 MB). In the Supported OCR and document comparison languages section of the manual on page 288, I didn’t find any options with multiple languages combined, as can be found in DEVONthink’s OCR preferences.

I am not familiar at all with ABBYY FineReader PDF’s product itself, so correct me if I’m wrong here, but could it be that the options with combined languages originate from DEVONthink rather than ABBYY?

chrillek · November 28, 2021, 9:19am

Given the state of the documentation, it could be a simple ommission (read: the documentation is a piece of crap). Also, DT is using their library product, which might have options differing from the program.

xurc · November 28, 2021, 10:37am

Hi chrillek, thanks for the input!

That gave me a good chuckle haha. I wasn’t prepared for such low quality when downloading the documentation, I just dived straight into the relevant section and took it for granted that it could be trusted (i.e. detailed and up-to-date).

I did consider the fact, so I did make the (not very solid) assumption that the options would be similar.

Anyways, I hope this thread could receive some official explanation from DEVONtechnologies

BLUEFROG · November 28, 2021, 5:12pm

Here I have attached a PDF and converted plain text with…

Japanese as Primary and English secondary: JaP-EnS.zip (410.3 KB)
English primary and Japanese secondary: EngP-JaS.zip (403.7 KB)
Japanese Only: Ja only.zip (402.3 KB)
Japanese and English: Japanese and English.zip (399.3 KB)

As seen in the plain text, Japanese as the primary with English as secondary yields better results.
Japanese only yields no discernible English; English onl, no discernible Japanese.

@aedwards would be more conversant on this but Japanese and English is the best of the bunch.

xurc · November 29, 2021, 2:02am

Hello Jim, thank you for the concise and helpful tests! It didn’t occur to me that it would be much easier to compare benchmark results if you choose OCR to txt instead of searchable PDF. The conclusion is very useful, and I’ll keep it in mind when using OCR on applicable PDFs.

A small note, I think you meant “English Only” instead of “English primary and Japanese secondary” in your second test? The OCR result seems to imply so, since it doesn’t contain any discernible Japanese.

BLUEFROG · November 29, 2021, 7:51am

You’re welcome
Actually the second test should be English with Japanese as secondary. However I’m in bed so I can’t double-check just yet.

xurc · November 29, 2021, 7:57am

Thank you for the hard work!

aedwards · November 29, 2021, 9:50am

ABBYY provided combined languages in older versions of their OCR, these were removed in ABBYY v12. DT3 will still support these by converting them to a primary and secondary language, i.e.

Selected Language: Japanese and English
Primary: Japanese
Secondary: English

The only time this is not the case is if you are running on macOS 10.11 where DT will use ABBYY v11.

xurc · November 29, 2021, 11:26am

Hello Alan, thank you for your insight, that makes things much clearer to me now.

I still have one question: since “Primary: Japanese and English” and ”Primary: Japanese, Secondary: English” are functionally identical, why is it that these two configurations during Jim’s benchmarks yielded different, although admittedly similar, results? Here are the specific two benchmarks I’m referring to:

Thank you for your patience and time.

aedwards · November 29, 2021, 12:00pm

They should be similar, I will check these results.

jerwin · November 30, 2021, 7:39am

In Devonthink
Japanese primary English Secondary reads an english O as 〇
English Primary Japanese Secondary doesn’t pick up on the Japanese text at all/

Japanese and English tends to recognize O as O and 〇 as 〇, as appropriate.

But, there’s no difference between methods in Finereader PDF v 15. Moreover, it looks to be even more accurate, at least in terms of not producing garbage characters… It even correctly picks up the ① and ② symbols. I don’t read Japanese, though, so I may have missed something.

Finereader 15.zip (481.5 KB)

xurc · November 30, 2021, 8:13am

Hello jerwin, thanks for the reply.

Indeed, hence my previous confusion about Jim’s benchmark result:

I don’t have access to FineReader PDF, but I’m curious about the discrepancies you described as well.

aedwards · November 30, 2021, 1:05pm

I found why these are different, Japanese and English was mistakenly setting English to the primary language rather than Japanese. I will fix this in the next update.

Japanese primary English Secondary reads an english O as 〇
Japanese and English tends to recognize O as O and 〇 as 〇, as appropriate

The reason why the these produce different results is that the primary language has a greater weighting than the secondary language. So basically if Japanese is the primary language, the charter O is a close enough percentage match to 〇 for it not to look at the secondary language, whereas with English as the Primary language,O will match O however 〇 will not be a close enough match therefore it will look at the secondary language for a match.

xurc · November 30, 2021, 1:20pm

Hello, Alan! Thank you for finding the root of the issue.

As I wrote in the original post:

I guess this statement from DEVONthink 3 help is either outdated or incorrect? Either way, please also fix it in a future update. Thanks

aedwards · November 30, 2021, 1:46pm

We will update the documentation