Think of a searchable PDF as having two ‘layers’ — an image layer, which one sees when viewing the file within a PDF viewer, and a text layer, which lies below the image layer and isn’t visible to the viewer.
A PDF produced by a scanner contains only the image layer, which is a picture of the scanned paper copy.
OCR (Optical Character Recognition) ‘looks’ at the picture of the paper that was scanned, and ‘sees’ the individual characters of text contained in the picture, converting, for example a picture of the letter ‘A’ to a computer text character ‘A’. The result of the OCR conversion will be a searchable PDF containing both an image and a text ‘layer’.
Why is the resulting searchable PDF often much larger than the image-only PDF? That’s because in the OCR procedure the original image isn’t retained intact, but is recreated as a bit map picture of the original image. Most OCR applications use the tools for making the new PDF image layer that are built into OS X, and this isn’t very size-efficient. (Hopefully, we will see more efficiency in the OS X tools in the future.)
It’s true that the added text layer after OCR also adds a bit to the size of the searchable PDF, but this is a minor addition in size, compared to the size increase resulting from re-creation of the image layer.
DTPO Preferences > OCR contains some user-adjustable settings that allow some reduction of the file size of the resulting searchable PDF, by reducing the resolution (dpi — dots per inch) or the quality of images. The result is a compromise, as lower dpi and image quality settings will reduce the viewed or printed quality of the PDF, as well as the file size.
Beginning in public beta 8, there’s a checkbox to retain the original scan settings (primarily resolution) for the OCRed PDF. This will make the resulting searchable PDF look better, but will also result in growth of file size.
So there’s a balancing act, depending upon the user’s needs and preferences, in choosing the resolution and quality of the searchable PDFs after OCR.