Embedded Fonts and Image Size in OCR'ed PDF

I am re-processing some previously OCR’ed PDF files (the existing text layer is terrible) and the new versions are much improved, particularly on pages with multiple columns of text.

The file size of the resulting PDF however is dramatically bigger.

When viewed at ‘Actual Size’, the new PDF is much larger and blurry, and Acrobat DC tells me that it also contains a number of embedded fonts that were not in the original PDF.

Multiplied over several documents the folder size is immense.

Is there anything I can do to mitigate this?

Please define “immense” and what you’re comparing it to.

A folder with 12 files went from 75MB to 469MB.

If I extrapolate that across the other folders I’m planning to process, the final folder might be in the region of 11GB.

I’ve initially tried changing the DPI setting but it doesn’t go below 150 DPI. Judging by the ‘Actual Size’ view, I’m guessing the originals are 72 DPI.

DPI of less than 150 dpi are not recommended for OCR, the text recognition can be significantly effected.

The larger file sizes are usually when the file has been resaved after the OCR which can be due to:

  • Transferring annotations
  • Adding metadata

ABBYY has better compression of PDF’s than Apple’s PDFKit however the next update has fixed an issue where the PDF file was resaved when it was not needed to be.

So as long as you the original files do not have annotations or you are adding additional metadata the file size in the next update should be significantly smaller.

A notable example so far is a source PDF at 6.1 MB, which grows to 49.7 MB after the OCR process.

I don’t know what contribution the newly embedded fonts have on increasing the file size, but the images of the pages are obviously resized bigger (and are fuzzy as a result), so presumably that also contributes to the bigger file.

The newly enlarged files are also quite laggy to work with. They load noticeably slower, and page-to-page navigation is also sluggish with the much larger page images.

Just sent you a message with details of how to try the latest OCR beta which may fix these issues

I have similar issues. Can I try the beta as well?

Some interesting observations from some further experiments I did:

  • Turning off Deskew and Page Orientation increases file size a little bit
  • Turning off the compression option decreases file size. My 6.1MB original PDF grows to 31MB instead of 49.7MB - very curious, what is compression for if it makes big files even bigger?

Unfair question: In light of this and the ABBYY license problem, any idea when an update is due that will resolve these issues. I have a bunch of files needing conversion, so it seems prudent to stop for now.

The file sizes are larger files sizes are due to the issue I mentioned in a previous post,

ABBYY has better compression of PDF’s than Apple’s PDFKit however the next update has fixed an issue where the PDF file was resaved when it was not needed to be

The ABBYY licence issue will be resolved in the next update. Fixes for issues with ABBYY’s handling of input PDF files will be included in their next update however I do not have a date for that release at present.

I have a similar behaviour even if the only thing is deleting a blank page from scanning without blank page recognition. But this only effects PDFs I got from third party (scanners I believe).

Documents scanned by my local ScanSnap scanner are totally fine. OCR is not made by DTP3 but the scanner software (which uses Abbyy as well). This is one of the reasons not to upgrade Mojave to Catalina as there the ScanSnap software will stop working (AFAIK).

Thanks, I did not realize that.