OCR stopped working for Russian text some time ago. I have moved to a new Macbook but I am not sure if this is related to the problem or not.
I am using ScanSnap S1300 for importing paper documents to Devonthink Pro Office. Scan setting is 300 dpi for color and 600 dpi for B&W documents. All scanned docs are automatically OCRed. All English documents are OCRed correctly and the text layer can be used for document search and also copied to other applications. Russian OCRed text appears as garbage and cannot be used for search, too.
I think it’s more than an issue with OCR. I imported some PDF+text documents that are bilingual English + Russian. On the original document I can highlight and copy both the Cyrillic and the Roman characters, it works fine, and I can paste them into the search box and find the document. If I create a merged document, the Roman text still works fine, but the Cyrillic text is now garbage and copies as garbage. Also, if I search for a Russian word, the merged document is not found, only one or more of the original documents that were merged together. A search for an English word finds both the merged document and one or more originals.
This rather defeats the purpose of Mac Unicode support and is quite the disappointment, yes?
I know this topic is three years old but I will try and push it up. I have this problem nearly every day now as I have to OCR a lot of russian documents (mostly PDFs).
I tried to set Russian as the only OCR language, but to now avail. My results sometimes resemble those of @pavel.lyakhovsky, sometimes the encoding just seems to switch around: some letters cyrillic, some latin, some greek.
This is in fact even more astonishing as ABBYY (the furnisher of the OCR, as far as I know) is a russian company.
Did anyone have success in OCRing russian documents, of does the support team have a solution?
Mac OS X’s PDF engine isn’t fully compatible to PDF documents created by Abbyy using certain encodings. You could disable the option to enter metadata after recognition, does this improve the results? However, editing/annotating the PDF documents might break the text layer again.
no apologies needed – you posted an ad, and then world classics ))
no success here. Metadata was off, and is off. Quality is set to accurate, resolution same as scan. In the results not one cyrillic letter (seems to change somehow). PDF quality is not too bad, to (should at least give some results). See attachment
EDIT ok rtf is not allowed as attachment … will try again