Reconvert PDF: OCR not picking up macrons


I’m going over some PDFs in Latin I converted years ago, but have never got round to reading…

I can’t remember which language I used to do the conversion back then, but the PDF layer isn’t picking up the macrons: or rather, it is picking up the macrons, and it’s converting them to the wrong accents…

I thought this would simply be because I’d not got the right language the first time, so I’ve tried reconverting the text, with the Maori language (it uses macrons, which means you don’t have to mess around with custom keyboard layouts). It still doesn’t work:

Original text:

OCR layer (both the original and the Maori/Latin version I tried today):

Dum Latinus in Italiâ in pâce diûturnâ rēgnat, Trōia seu Ilium, clârissima Asiae urbs, post bellum decem annorum tandem ā Graecis capta est. Graeci enim, cum urbem viexpugnâre nōn possent, dolo usisunt: equum ingentem ē ligno fabricâvërunt eumque militibus armātis complêvêrunt, quibus praefecti erant Ulixes et Pyr-

As you can see, some of the macrons are converted correctly, but others aren’t, in a fairly random fashion.


Is this likely to be an issue with the language of the conversion, or the quality of the conversion?

Is changing the language and re-OCRing an already OCRd text likely to succeed, or is there a way of stripping out the existing layer and starting from scratch?


@aedwards would have to assess this.

From the testing I did, I couldn’t detect them either.

I think language

By default the Latin language recogniser in ABBYY OCR does not support macrons and I assume that Maori would be the same. I will talk to ABBYY to see if there are ways to support macrons.

Thanks Alan!

The Apple Maori keyboard layout does support macrons by default (AIUI they’re part of the language in the way that macrons aren’t for Latin, but I could be wrong), so you’d think that ABBYY would recognise that. It will be interesting to see their response.

Thanks again.

I don’t know what characters ABBYY has defined for Maori, unfortunately being a language that is not widely used there are no examples or information on their support site that I can use as a reference. Once I have an answer I will let you know.

After a bit of playing, setting the language to Latvian works! It looks like it will reconvert existing pdf+texts, too, which is a bonus.

Italiae incolae prīml Aborīginēs fuērunt, quorum rēx Sāturnus tantā iūstitiā fuisse dlcitur ut nec servīret quis- quam sub illo nec quidquam suum proprium habēret,

Yes, it’s not perfect (prīmI should be prīmī) but OCR is never perfect anyway, so this is good enough for my needs, and there’s no need to do any further digging just on my account.

Thanks for looking into it, though — I appreciate your help.