Reconvert PDF: OCR not picking up macrons

brookter · April 28, 2020, 6:54pm

Hi,

I’m going over some PDFs in Latin I converted years ago, but have never got round to reading…

I can’t remember which language I used to do the conversion back then, but the PDF layer isn’t picking up the macrons: or rather, it is picking up the macrons, and it’s converting them to the wrong accents…

I thought this would simply be because I’d not got the right language the first time, so I’ve tried reconverting the text, with the Maori language (it uses macrons, which means you don’t have to mess around with custom keyboard layouts). It still doesn’t work:

Original text:

OCR layer (both the original and the Maori/Latin version I tried today):

Dum Latinus in Italiâ in pâce diûturnâ rēgnat, Trōia seu Ilium, clârissima Asiae urbs, post bellum decem annorum tandem ā Graecis capta est. Graeci enim, cum urbem viexpugnâre nōn possent, dolo usisunt: equum ingentem ē ligno fabricâvërunt eumque militibus armātis complêvêrunt, quibus praefecti erant Ulixes et Pyr-

As you can see, some of the macrons are converted correctly, but others aren’t, in a fairly random fashion.

So…

Is this likely to be an issue with the language of the conversion, or the quality of the conversion?

Is changing the language and re-OCRing an already OCRd text likely to succeed, or is there a way of stripping out the existing layer and starting from scratch?

Thanks…

BLUEFROG · April 28, 2020, 7:04pm

@aedwards would have to assess this.

From the testing I did, I couldn’t detect them either.

Silverstone · April 28, 2020, 7:44pm

I think language

aedwards · April 29, 2020, 8:42am

By default the Latin language recogniser in ABBYY OCR does not support macrons and I assume that Maori would be the same. I will talk to ABBYY to see if there are ways to support macrons.

brookter · April 29, 2020, 10:58am

Thanks Alan!

The Apple Maori keyboard layout does support macrons by default (AIUI they’re part of the language in the way that macrons aren’t for Latin, but I could be wrong), so you’d think that ABBYY would recognise that. It will be interesting to see their response.

Thanks again.

aedwards · April 29, 2020, 2:28pm

I don’t know what characters ABBYY has defined for Maori, unfortunately being a language that is not widely used there are no examples or information on their support site that I can use as a reference. Once I have an answer I will let you know.

brookter · April 29, 2020, 3:58pm

Alan,

After a bit of playing, setting the language to Latvian works! It looks like it will reconvert existing pdf+texts, too, which is a bonus.

Italiae incolae prīml Aborīginēs fuērunt, quorum rēx Sāturnus tantā iūstitiā fuisse dlcitur ut nec servīret quis- quam sub illo nec quidquam suum proprium habēret,

Yes, it’s not perfect (prīmI should be prīmī) but OCR is never perfect anyway, so this is good enough for my needs, and there’s no need to do any further digging just on my account.

Thanks for looking into it, though — I appreciate your help.