OCR working incorrectly for Russian documents

pavel.lyakhovsky · January 7, 2013, 1:32pm

Hello all,

OCR stopped working for Russian text some time ago. I have moved to a new Macbook but I am not sure if this is related to the problem or not.

I am using ScanSnap S1300 for importing paper documents to Devonthink Pro Office. Scan setting is 300 dpi for color and 600 dpi for B&W documents. All scanned docs are automatically OCRed. All English documents are OCRed correctly and the text layer can be used for document search and also copied to other applications. Russian OCRed text appears as garbage and cannot be used for search, too.

For example, the highlighted Russian text here:

is recognized as a total garbage:

However it looks to be an encoding problem because same characters are represented as same garbage characters.

I have tried to delete and re-install ABBYY plugin but without any success.

Any help will be appreciated because without normal OCR of Russian text for subsequent search the app becomes useless.

Best regards,
Pavel

pavel.lyakhovsky · January 10, 2013, 11:05am

Any chance to get a response from support team?

jradel · March 9, 2013, 5:49am

I think it’s more than an issue with OCR. I imported some PDF+text documents that are bilingual English + Russian. On the original document I can highlight and copy both the Cyrillic and the Roman characters, it works fine, and I can paste them into the search box and find the document. If I create a merged document, the Roman text still works fine, but the Cyrillic text is now garbage and copies as garbage. Also, if I search for a Russian word, the merged document is not found, only one or more of the original documents that were merged together. A search for an English word finds both the merged document and one or more originals.

This rather defeats the purpose of Mac Unicode support and is quite the disappointment, yes?

megob · May 10, 2016, 2:23pm

I know this topic is three years old but I will try and push it up. I have this problem nearly every day now as I have to OCR a lot of russian documents (mostly PDFs).

I tried to set Russian as the only OCR language, but to now avail. My results sometimes resemble those of @pavel.lyakhovsky, sometimes the encoding just seems to switch around: some letters cyrillic, some latin, some greek.

This is in fact even more astonishing as ABBYY (the furnisher of the OCR, as far as I know) is a russian company.

Did anyone have success in OCRing russian documents, of does the support team have a solution?

cgrunenberg · May 10, 2016, 3:21pm

Mac OS X’s PDF engine isn’t fully compatible to PDF documents created by Abbyy using certain encodings. You could disable the option to enter metadata after recognition, does this improve the results? However, editing/annotating the PDF documents might break the text layer again.

BLUEFROG · May 10, 2016, 3:22pm

Меня зовут Анна Зайцева. Я русская и русский - мой родной язык.Перед приездом в США я закончила Московский Гуманитарный Институт и получила диплом “Лингвист, переводчик”. Сейчас я живу во Флориде.
Я переведу Ваши документы с русского языка на английский и с английского на русский. Использование моего сайта позволяет Вам перевести документы качественно, быстро, удобно и недорого. В течении 24 часов(включая выходные) я вышлю Вам e-mail с переведенными документами в формате PDF. Если у Вас есть вопросы, Вы можете связаться со мной непосредственно.
С уважением, Анна

I have no idea what this means, so apologies if there’s anything offensive.

This is the text from a PDF I just created by doing OCR on a random image with Russian text.

Note, as with ANY language, if the original is poor quality, OCR will struggle.

Frederiko · May 10, 2016, 3:25pm

Have you perhaps tried the full version of ABBY Finereader 12?

BLUEFROG · May 10, 2016, 3:28pm

32px
Все счастливые семьи похожи друг на друга, каждая несчастливая
семья несчастлива по-своему. Все смешалось в доме Облонских. Жена узнала, что муж был в связи с бывшею в их доме
француженкою-гувернанткой, и объявила мужу, что не может жить с ним в одном доме. Положение это продолжалось уже третий день и мучительно чувствовалось и самими супругами, и всеми членами
семьи, и домочадцами. Все члены семьи и домочадцы чувствовали, что нет смысла в их сожительстве и что на каждом постоялом дворе случайно сошедшиеся люди более связаны между собой, чем они, члены семьи и домочадцы Облонских. Жена не выходила из своих
комнат, мужа третий день не было дома. Дети бегали по всему дому,
как потерянные; англичанка поссорилась с экономкой и написала записку приятельнице, прося приискать ей новое место; повар ушел
вчера со двора, во время самого обеда; черная кухарка и кучер
просили расчета. На третий день после ссоры князь Степан Аркадьич Облонский — Стива, как его звали в свете, — в обычный час, то есть в восемь часов утра, проснулся не в спальне жены, а в своем кабинете, на сафьянном диване. Он повернул свое полное,
выхоленное тело на пружинах дивана, как бы желая опять заснуть надолго, с другой стороны крепко обнял подушку и прижался к ней щекой; но вдруг вскочил, сел на диван и открыл глаза.

One more - again with apologies, if necessary . Note line breaks may be odd due to the conversion to plain text.

megob · May 10, 2016, 4:51pm

no apologies needed – you posted an ad, and then world classics ))

no success here. Metadata was off, and is off. Quality is set to accurate, resolution same as scan. In the results not one cyrillic letter (seems to change somehow). PDF quality is not too bad, to (should at least give some results). See attachment

EDIT ok rtf is not allowed as attachment … will try again

BLUEFROG · May 10, 2016, 8:13pm

Just for due diligence, my OCR settings…

megob · May 10, 2016, 8:49pm

same setup here (only Russian as first language). Even with best possible PDF no success.

OSX system language is English.

BLUEFROG · May 11, 2016, 2:28am

Please start a Support Ticket and ZIP and attach the PDF you’re trying to convert. Thanks.