Dramatically increased size of PDF after OCR

I’m using Office version to convert Scanned PDF (about 30-50 MB) to PDF+Text to make them searchable.

However, after the process, the resulting PDF is about 3x size! I did for 2 of my zillion scanned PDF and found 44 MB to 150 MB and 32 MB to 120 MB…

I know embedding text will increase the size, but not 3x or even more!

I tried PDF Toolkit to reduce the size but seems I only can do it losing quality, that is a thing I don’t want. What I wonder is the 44 MB PDF size has the same visual quality (under a big zoom) than the 150 MB one, then must be the image compression algorithm.

Is there any thing I can do to get the PDF near the original size without losing image quality?

Thanks in advance.

This is a well documented consequence of the ABBYY OCR engine that DEVONthink Pro Office licenses.
Threads:

There is very little that can be done to reduce file size without a commensurate decrease in image quality.

Adobe ClearText (since rebranded I believe) is one of the only (and I’d say, the best) ways to accurately OCR a file and (usually) decrease the file size. However, the catch is you need an Adobe Acrobat subscription which runs pretty steep.

DEVONthink’s OCR is ver accurate, but does inflate the size SUBSTANTIALLY unless you really scale back the quality. It’s an unfortunate tradeoff. I am not normally able to sacrifice quality since many of the scanned things I have are long form reading, and I can’t tolerate looking at pixelated text for hours. I use the OCR in DEVONthink very rarely for this reason, though periodically I will use it for smaller things that I forget to OCR when I scan them.

My hope is that ABBYY eventually develops a more space-efficient engine that DEVONthink can license, or DEVONthink find’s a different engine to license.

Thank you for your answer. I saw those posts but thought being from 2010 to 2015 something should have been done by Devon or Abby.

Then I’m afraid I’m still going to have those PDF without search capabilities. I cannot afford (well, don’t want to) having my 130 GB PDF library multiplied by 3 or 4.

Oh, man, the height of absurdity is having a 45 MB scanned PDF and got a 1.8 GB size after make it searchable!!!

Completely unacceptable.

This must be resolved by Devon or Abby. :confused: :confused:

Which settings do you use in Preferences > OCR? I’d suggest to use the default quality (75%).

I’m using 100% as I don’t want a quality reduction of the scanned result.

However, only as a matter of essay, I’m going to mark 75%.

@rfog: What resolution is set?

Original, 600 DPI.

Currently I’m testing at 600 dpi/75% resolution, after that will test at 300 dpi 100% and 75% quality. This is one of my heaviest (in image complexity) I have.

It will take some time, as it takes some hours to process the 216 pages document.

600dpi is excessive and should not be used, unless your scan was black and white only. (This is a 1-bit scan, having less data per channel an would compress more easily). In greyscale or color, this would not be a good idea, regardless.

300dpi would be okay if you were intending to have the document printed in a commercial process.

For desktop printing (or onscreen only use), a maximum of 200dpi would be sufficient.

At the end I think you are right.

Once finished this try I will do it at 300 and 75%.

One more question: if I have a scanned PDF of, say, 200 dpi, and I have selected 300 in configuration, the new PDF is generated at 200 dpi or at some kind of 300 dpi interpolated? Just for curiosity.

I am not sure what method it would use but it would be upsampled to 300dpi.

Ok, I did some tests and I’m not happy with the results.

The original file I was converting, was abut 45 MB, 60 pages of A4 at 600 dpi.

Results of making searchable:
600 dpi, 100% quality: 1.9 GB
300 dpi, 100% quality: 1.8 GB
300 dpi, 75% quality: process failed with an error.

As this conversion takes more than 3 hours to complete in my iMAC i7 with 24 GB of RAM, I took another PDF:

46.2 MB of a scanned PDF at 200 dpi. Results:
300 dpi, 100%: 534 MB
200 dpi, 100%: 250 MB
200 dpi, 75%: 70 MB
200 dpi, 95%: 140 MB

Going from 100% to 75% the result is noticeable worse and see with simple eye in a Retina display.

Going from 100% to 95% is not noticeable but the difference in space is big.

However, this kind of compression is still unacceptable for a professional tool.

Don’t scan to a PDF. Scan to a JPG (and you can use high quality, if you’d like) or TIFF (and use LZW compression, if available).

Hello everybody, it’s 2021 now and I’m catching up on these threads … and on converting PDFs I poorly scanned and OCR’d in 2013 outside of DEVONthink 3. I’m currently finding that the scanning process on top of those old PDFs gives me files within +/- 10% file size from the original. Which suggests that the issue noted in this thread is solved.

Does everyone else have that experience? I feel pretty good about file efficiency right now but tell me if I’m wrong.

I’m so glad I found this community. Thank you for all the great articles!

Welcome @Avi

It’s hard to predict the size of output due to the variables involved but we’re glad you’re seeing results you like.

Unfortunately, it hasn’t been solved yet. By practice, sometimes the size of OCRed file decrease, but sometimes it increase, with both cases are going dramatically. As what BLUEFROG says, it’s un-predictable (this doesn’t mean that the size varies from time to time for the same file; but varies from file to file)

The final PDF size can be dependant on whether the file has to be modified after OCR. This can happen when the original document contains annotations or other metadata that need to be transferred to the PDF generated by the OCR. These changes are saved using Apple’s PDFKit which does not provide the same level of compression as ABBYY.

One way to check this is to select the PDF document, open the Document Properties in the Inspector and look at the Creator Field. If the Creator is macOS Version xx.xx Quartz PDFContent the content was resaved after OCR otherwise it will show ABBYY FineReader Engine as below.

1 Like

Not an ideal solution (and I’ve not gotten around to trying it myself) but PDF Squeezer has command line integration so, in principle, one could script it to compress a PDF after OCR if the file size has increased, or gone beyond a certain size.

If anyone’s interested, here’s a smart rule script for automating PDF Squeezer compression:

on performSmartRule(theRecords)
	set theProfilePath to "[ path to .pdfscp file ]"
	tell application id "DNtp"
		repeat with eachRecord in theRecords
			
			set thePath to path of eachRecord
			set theResult to do shell script "/usr/local/bin/pdfs " & quoted form of thePath & " --replace --profile " & quoted form of theProfilePath
			if theResult is "" then add custom meta data 1 for "mdfilecompressed" to eachRecord
			
		end repeat
	end tell
end performSmartRule

Setup-wise, after installing the command line tool within the app, you just need to export your chosen PDF Squeezer settings as a .pdfscp file and add that path to the second line of the script. mdfilecompressed should be set up as boolean in your custom metadata.

The smart rule itself then looks something like this:

Of course, use at your own risk and test before doing any batch processing!

Worth $9.99/month fee?