OCR woes...

I upgraded from DT Pro to Pro Office today, because I wanted to OCR some PDF scans. The scans are from the local authority’s planning website. Of the couple I’ve tried, they have taken a very long time to convert readable PDFs into a sort of blurry nonsense. For example this PDF was rendered into gibberish.

Any suggestions about what I may be doing wrong?

I experimented with downloading one of those pages, the Application Form, with similar unsatisfactory results.

Although Preview will open the original downloaded PDF, Graphic Converter (which can open multipage PDF files) reports that this is an “unknown” file type, or a broken file, possibly with a JPEG stream. So there’s something wrong with the file as I downloaded it.

I’ve sent everything I could gather about the file to Annard.

Interesting… I get that too with GraphicConverter. It’s fixed by saving it out of Preview under a new name. However, the OCR still then produces gibberish…

This problem is being investigated by IRIS, it seems to be Mac specific. We will bring out an updated maintenance release when we get an updated version from them.
As I told another user by email, your best workaround for now is to keep both versions. The original and the one with the searchable text (because the OCR does work for the examples that I have received so far).

I downloaded, dragged into DTPO, did a Convert.

Came out fine. All looks just like the original, text is selectable.

Did I do something wrong, err, I mean right?

I’m also getting gibberish when I try to convert my scanned PDFs to readable files.

I’m scanning on an HP C6180 and importing the PDFs to DTPO. The PDFs look fine, but when I convert them using Data>Convert>to Searchable PDF, the resulting files looks as if it’s been stretched horizontally.

If I recognize the text directly in ReadIRIS 11.6.1 using the PDF created on the HP scanner, the results are just fine.

Oddly, when I downloaded a PDF from the site linked to the OP, the searchable PDF created in DTPO was fine - although it took an inordinately long time.

Perhaps it’s related to the properties of the PDF file.

Terry

I was beginning to think this was only happening to me. I was scanning straight into to DTP from a ScanSnap. Page 1 would be fine, but page 2 onwards would be unviewable. Eventually I got round to looking at the scanned file and that was fine.

This has had a hidden bonus for me. I now scan to PDF, review the file in Preview, remove any unwanted pages, e.g. blanks where the scanner thought it was double sided or where the extra pages are unnecessary for my needs. This is hopefully reducing the amount of space my files take up. Think I’ll stick with this approach and when the OCR system is fixed I’ll start importing files using OCR again.

I downloaded version 1.5.1 today hoping that my OCR woes would be cured, but was sadly disappointed. Converting PDFs to Searchable documents still renders the underlying PDF a pixelated mess.

BUT THEN, i looked in Preferences and saw that my OCR settings were using a dpi of 150. I was scanning at 400. When I changed the dpi in Preferences to 400 the Conversion worked fine, as did Import>Images (with OCR).

I can’t say that the new version fixed anything since I hadn’t fiddled with preferences in the prior version. But it works.

My prior work-around was to recognize the text directly in Readiris before importing them to DTP.

Terry

I’ve turned the quality down in DTPO and it works fine with my ScanSnap at full speed.