OCR working incorrectly for Russian documents

Hello all,

OCR stopped working for Russian text some time ago. I have moved to a new Macbook but I am not sure if this is related to the problem or not.

I am using ScanSnap S1300 for importing paper documents to Devonthink Pro Office. Scan setting is 300 dpi for color and 600 dpi for B&W documents. All scanned docs are automatically OCRed. All English documents are OCRed correctly and the text layer can be used for document search and also copied to other applications. Russian OCRed text appears as garbage and cannot be used for search, too.

For example, the highlighted Russian text here:

is recognized as a total garbage:

However it looks to be an encoding problem because same characters are represented as same garbage characters.

I have tried to delete and re-install ABBYY plugin but without any success.

Any help will be appreciated because without normal OCR of Russian text for subsequent search the app becomes useless.

Best regards,
Pavel

Any chance to get a response from support team?

I think it’s more than an issue with OCR. I imported some PDF+text documents that are bilingual English + Russian. On the original document I can highlight and copy both the Cyrillic and the Roman characters, it works fine, and I can paste them into the search box and find the document. If I create a merged document, the Roman text still works fine, but the Cyrillic text is now garbage and copies as garbage. Also, if I search for a Russian word, the merged document is not found, only one or more of the original documents that were merged together. A search for an English word finds both the merged document and one or more originals.

This rather defeats the purpose of Mac Unicode support and is quite the disappointment, yes?

I know this topic is three years old but I will try and push it up. I have this problem nearly every day now as I have to OCR a lot of russian documents (mostly PDFs).

I tried to set Russian as the only OCR language, but to now avail. My results sometimes resemble those of @pavel.lyakhovsky, sometimes the encoding just seems to switch around: some letters cyrillic, some latin, some greek.

This is in fact even more astonishing as ABBYY (the furnisher of the OCR, as far as I know) is a russian company.

Did anyone have success in OCRing russian documents, of does the support team have a solution?

Mac OS X’s PDF engine isn’t fully compatible to PDF documents created by Abbyy using certain encodings. You could disable the option to enter metadata after recognition, does this improve the results? However, editing/annotating the PDF documents might break the text layer again.

I have no idea what this means, so apologies if there’s anything offensive. :mrgreen:

This is the text from a PDF I just created by doing OCR on a random image with Russian text.

Note, as with ANY language, if the original is poor quality, OCR will struggle.

Have you perhaps tried the full version of ABBY Finereader 12?

One more - again with apologies, if necessary :smiley: . Note line breaks may be odd due to the conversion to plain text.

no apologies needed – you posted an ad, and then world classics ))

no success here. Metadata was off, and is off. Quality is set to accurate, resolution same as scan. In the results not one cyrillic letter (seems to change somehow). PDF quality is not too bad, to (should at least give some results). See attachment

EDIT ok rtf is not allowed as attachment … will try again

Just for due diligence, my OCR settings…

same setup here (only Russian as first language). Even with best possible PDF no success.

OSX system language is English.

Please start a Support Ticket and ZIP and attach the PDF you’re trying to convert. Thanks.