Scan of pdf and OCR

traurig · December 15, 2009, 12:11pm

How work the tool OCR in DTPO2.

I make a scan of an paper and import this paper via OCR in DTPO2.
The original file is an PDF about 1,6 MB
The scanned file is an PDF+Text about 6 MB.

I think the OCR generate an additional file with text and this text has a link to every point in the original.

Is that rigtht ?

When I open the scanned document in DTPO2 what is it.

The original picture or the text ?

Normally text files are smaller as pictures, why is the scanned file bigger as the original ?

Jochen

Bill_DeVille · December 15, 2009, 8:46pm

Think of a searchable PDF as having two ‘layers’ — an image layer, which one sees when viewing the file within a PDF viewer, and a text layer, which lies below the image layer and isn’t visible to the viewer.

A PDF produced by a scanner contains only the image layer, which is a picture of the scanned paper copy.

OCR (Optical Character Recognition) ‘looks’ at the picture of the paper that was scanned, and ‘sees’ the individual characters of text contained in the picture, converting, for example a picture of the letter ‘A’ to a computer text character ‘A’. The result of the OCR conversion will be a searchable PDF containing both an image and a text ‘layer’.

Why is the resulting searchable PDF often much larger than the image-only PDF? That’s because in the OCR procedure the original image isn’t retained intact, but is recreated as a bit map picture of the original image. Most OCR applications use the tools for making the new PDF image layer that are built into OS X, and this isn’t very size-efficient. (Hopefully, we will see more efficiency in the OS X tools in the future.)

It’s true that the added text layer after OCR also adds a bit to the size of the searchable PDF, but this is a minor addition in size, compared to the size increase resulting from re-creation of the image layer.

DTPO Preferences > OCR contains some user-adjustable settings that allow some reduction of the file size of the resulting searchable PDF, by reducing the resolution (dpi — dots per inch) or the quality of images. The result is a compromise, as lower dpi and image quality settings will reduce the viewed or printed quality of the PDF, as well as the file size.

Beginning in public beta 8, there’s a checkbox to retain the original scan settings (primarily resolution) for the OCRed PDF. This will make the resulting searchable PDF look better, but will also result in growth of file size.

So there’s a balancing act, depending upon the user’s needs and preferences, in choosing the resolution and quality of the searchable PDFs after OCR.

traurig · December 16, 2009, 10:43am

Thanks

Very helpful feedback

Jochen

traurig · December 16, 2009, 4:42pm

I’m not so familiär with OCR.

When I have a searchable pdf with text.
If I have in document some digits 6,6,6,6,6 and 5
The OCR have make a mistake and detect 5 as 6

So I have in document only 6,6,6,6,6,6.

If I search to 6.

The search found all 6.

When I go to the wrong 6 what did I see ?
The wrong 6 from text layer or the right 5 from the image layer.?

Jochen

Bill_DeVille · December 16, 2009, 4:57pm

Because the image layer is a picture of the original paper and is a faithful representation of it, you will see 6,6,6,6,6,5.

traurig · December 16, 2009, 5:26pm

Great and thanks.

Jochen