I’m using Office version to convert Scanned PDF (about 30-50 MB) to PDF+Text to make them searchable.
However, after the process, the resulting PDF is about 3x size! I did for 2 of my zillion scanned PDF and found 44 MB to 150 MB and 32 MB to 120 MB…
I know embedding text will increase the size, but not 3x or even more!
I tried PDF Toolkit to reduce the size but seems I only can do it losing quality, that is a thing I don’t want. What I wonder is the 44 MB PDF size has the same visual quality (under a big zoom) than the 150 MB one, then must be the image compression algorithm.
Is there any thing I can do to get the PDF near the original size without losing image quality?
This is a well documented consequence of the ABBYY OCR engine that DEVONthink Pro Office licenses.
Threads:
There is very little that can be done to reduce file size without a commensurate decrease in image quality.
Adobe ClearText (since rebranded I believe) is one of the only (and I’d say, the best) ways to accurately OCR a file and (usually) decrease the file size. However, the catch is you need an Adobe Acrobat subscription which runs pretty steep.
DEVONthink’s OCR is ver accurate, but does inflate the size SUBSTANTIALLY unless you really scale back the quality. It’s an unfortunate tradeoff. I am not normally able to sacrifice quality since many of the scanned things I have are long form reading, and I can’t tolerate looking at pixelated text for hours. I use the OCR in DEVONthink very rarely for this reason, though periodically I will use it for smaller things that I forget to OCR when I scan them.
My hope is that ABBYY eventually develops a more space-efficient engine that DEVONthink can license, or DEVONthink find’s a different engine to license.
Thank you for your answer. I saw those posts but thought being from 2010 to 2015 something should have been done by Devon or Abby.
Then I’m afraid I’m still going to have those PDF without search capabilities. I cannot afford (well, don’t want to) having my 130 GB PDF library multiplied by 3 or 4.
Currently I’m testing at 600 dpi/75% resolution, after that will test at 300 dpi 100% and 75% quality. This is one of my heaviest (in image complexity) I have.
It will take some time, as it takes some hours to process the 216 pages document.
600dpi is excessive and should not be used, unless your scan was black and white only. (This is a 1-bit scan, having less data per channel an would compress more easily). In greyscale or color, this would not be a good idea, regardless.
300dpi would be okay if you were intending to have the document printed in a commercial process.
For desktop printing (or onscreen only use), a maximum of 200dpi would be sufficient.
Once finished this try I will do it at 300 and 75%.
One more question: if I have a scanned PDF of, say, 200 dpi, and I have selected 300 in configuration, the new PDF is generated at 200 dpi or at some kind of 300 dpi interpolated? Just for curiosity.
Hello everybody, it’s 2021 now and I’m catching up on these threads … and on converting PDFs I poorly scanned and OCR’d in 2013 outside of DEVONthink 3. I’m currently finding that the scanning process on top of those old PDFs gives me files within +/- 10% file size from the original. Which suggests that the issue noted in this thread is solved.
Does everyone else have that experience? I feel pretty good about file efficiency right now but tell me if I’m wrong.
I’m so glad I found this community. Thank you for all the great articles!
Unfortunately, it hasn’t been solved yet. By practice, sometimes the size of OCRed file decrease, but sometimes it increase, with both cases are going dramatically. As what BLUEFROG says, it’s un-predictable (this doesn’t mean that the size varies from time to time for the same file; but varies from file to file)
The final PDF size can be dependant on whether the file has to be modified after OCR. This can happen when the original document contains annotations or other metadata that need to be transferred to the PDF generated by the OCR. These changes are saved using Apple’s PDFKit which does not provide the same level of compression as ABBYY.
One way to check this is to select the PDF document, open the Document Properties in the Inspector and look at the Creator Field. If the Creator is macOS Version xx.xx Quartz PDFContent the content was resaved after OCR otherwise it will show ABBYY FineReader Engine as below.
Not an ideal solution (and I’ve not gotten around to trying it myself) but PDF Squeezer has command line integration so, in principle, one could script it to compress a PDF after OCR if the file size has increased, or gone beyond a certain size.
If anyone’s interested, here’s a smart rule script for automating PDF Squeezer compression:
on performSmartRule(theRecords)
set theProfilePath to "[ path to .pdfscp file ]"
tell application id "DNtp"
repeat with eachRecord in theRecords
set thePath to path of eachRecord
set theResult to do shell script "/usr/local/bin/pdfs " & quoted form of thePath & " --replace --profile " & quoted form of theProfilePath
if theResult is "" then add custom meta data 1 for "mdfilecompressed" to eachRecord
end repeat
end tell
end performSmartRule
Setup-wise, after installing the command line tool within the app, you just need to export your chosen PDF Squeezer settings as a .pdfscp file and add that path to the second line of the script. mdfilecompressed should be set up as boolean in your custom metadata.
The smart rule itself then looks something like this: