PDF--Convert to Searchable question

arathe · May 6, 2009, 1:57am

Hi!

I do a lot of importing PDFs and then converting them to Searchable (via the contextual menu ‘Convert’ command). (I really like this feature !)

I notice that the image (the crispness of the text) often looks slightly less good in the converted file. And I noticed after doing a conversion today that my original file (not converted) weighs in at 6.1 MB – but the converted file, including the new text layer, is only 4 MB. I’m wondering – is the smaller file size due to a lessening of image quality?

And–if this is the case–are there user-accessible settings that control this? Can I opt for larger files after conversion, if I want them crisper and more readable?

Thanks in advance for your help!

arathe · May 6, 2009, 2:34am

OK–one more development.

I found the OCR section under DevonThink’s Preferences. (Since I’m not scanning, I didn’t think these would apply.) Sure enough–If I bump the Quality up to 100% (it was at 75%), the resulting file size when I rescan is no longer 4 MB, but 10MB. Unfortunately, the visible quality increase is marginal.

Here’s a thought… Is there any way to combine the original (imported) PDF with the new text layer generated by DTP’s OCR process? I don’t know if this is do-able – but that way I’d have the crisp text of the original and the readable text generated by DTP, as well!

Pie in the sky?

kmlawson · September 18, 2010, 10:14pm

I also wonder about this - I have a lot of 1.5MB scanned articles and want to add an OCR layer.

However, even if I simply want to leave the images completely alone - the result is always huge PDFs at least several times larger. Is this just an inefficient PDF engine in the OS that re-creates the PDF after OCRing it?

Is there any way to preserve the original size and add a text layer through the OCRing to prevent such a large increase in file size?

If I tweak the OCR settings to lower the quality I just get much less quality images with still a larger file size.

tabuny · December 31, 2010, 2:24am

I posted on this issue of PDF file size soon after DTPO was first introduced, so I am glad to see that other people are raising it. I have given up on using DEVONthink Pro Office for OCR to create searchable PDFs, even thought this is one of the key features of the product. The file sizes it produces are just too big. For example, here are some test scans to PDF that I did with my Fujitsu ScanSnap S1500M:

Black & White Scan, 400 dpi:
Academic paper, 24 pages total. 16 pages paragraph text, 8 pages tables and graphs.

Raw Scan, no OCR: 4.5 MB
OCR’d via “Convert to Searchable PDF” in the Fujitsu FIle Option Tab: 4.6 MB
OCR’d via Adobe Acrobat 9.x: 4.7 MB
OCR’d via Fujitsu’s “Scan to Searchable PDF” Application Option: 19.0 MB
OCR’d by DEVONthink Pro Office: 22.0 MB

Color Scan, 200 dpi:
Full color magazine article, 9 pages total. 6 pages primarily color photos, 3 pages primarily text.

Raw Scan, no OCR: 3.5 MB
OCR’d via “Convert to Searchable PDF” in the Fujitsu FIle Option Tab: 3.6 MB
OCR’d via Adobe Acrobat 9.x: 3.6 MB
OCR’d via Fujitsu’s “Scan to Searchable PDF” Application Option: 6.9 MB
OCR’d by DEVONthink Pro Office: 8.9 MB

As you can see, DTPO produces by far the biggest files, and with black & white scans it is especially wasteful.

I am doing all my OCR via the Fujitsu software for original scans. For PDF’s already on my computer that I need to convert to Searchable PDF, I use Adobe Acrobat, because the file size stays small.

It’s a real shame that this feature is so poorly implemented in DTPO.

If anyone has suggestions or anyone from DEVON Technologies can explain the issue, I would appreciate it.

sjk · December 31, 2010, 4:38pm

With default Preferences > OCR settings Convert to Searchable PDF shrinks my Image Capture scans significantly more, e.g. 6.5MB original down to 330KB PDF+Text version.

Checking “Resolution: Same as scan” and leaving Quality at 75% produced an 800KB PDF+text version of a 6.5MB original. Haven’t done a 100% quality test, nor closely inspected for quality differences between all versions but the smaller ones look good/readable enough and preferably sized for emailing.

avatar · January 2, 2011, 1:16pm

I converted a PDF (non-OCRd imaged text) of 172 pages, 5.9Mb in size, via the “Convert to searchable PDF”. The OCR settings are, I think, still set at the default, being 150dpi, 75% quality, recognition “automatic”.
It certainly caused my MacBook Pro (2.53 GHz Intel Core 2 Duo) to heat up quite a bit while it was doing it.
The resultant OCRd file is 27.5Mb in size.
I’m amazed at the OCR quality. The OCRd file looks word-perfect, but, given that the OCR is probably not always 100% accurate, I’d have to go through the entire file to make sure.
However, I don’t know if the resultant file size is “too big” or not.

Bill_DeVille · January 2, 2011, 2:35pm

avatar, that’s not a bad file size for a 172-page document after OCR.

You will see variabilities in OCRed file size depending on the original. The more color or gray-scale images, the larger will be the file size. Unless images are important, I usually set for black & white scans. If there are very important images but only in one or two pages of a multipage document, I may do the main scan as black & white, then scan the image page(s) in full color as a separate “run”, then insert those pages later into the searchable PDF.

The routines built into OS X for rasterizing a page are not as efficient as those in Adobe’s Acrobat Pro application, so files are often larger when OCRed using ABBYY in DT Pro Office. But I think the accuracy of the ABBYY OCR is better than Acrobat’s.

avatar · January 2, 2011, 4:44pm

Thanks for the image tip. The accuracy is indeed amazing!

Paul_G · June 9, 2011, 3:21am

I’m getting some very strange file size issues when I convert a pdf to searchable.
I have a 9 page B&W pdf. The initial file size is 1.4Mb.
I convert to searchable, and the resultant file size is 25.1Mb!
How on earth can it be ending up as 17 times larger?

My OCR settings are 100% quality, resolution same as scan - which to me implies the graphic part should be just the same as in the original. That leaves the text ‘overlay’, and I can’t see how that can be so enormous! I could believe it ending up double the size, but this is ridiculous.

It’s not just this file - this is just an example.

This is really screwing things up for me - I’m trying to shift to using DTPO to organise my files, but this sort of massive file size inflation makes it impractical.

Can anyone tell me how to get DTPO to stop bloating my files?

Bill_DeVille · June 9, 2011, 4:21am

Paul_G:

I’m getting some very strange file size issues when I convert a pdf to searchable.
I have a 9 page B&W pdf. The initial file size is 1.4Mb.
I convert to searchable, and the resultant file size is 25.1Mb!
How on earth can it be ending up as 17 times larger?

My OCR settings are 100% quality, resolution same as scan - which to me implies the graphic part should be just the same as in the original. That leaves the text ‘overlay’, and I can’t see how that can be so enormous! I could believe it ending up double the size, but this is ridiculous.

It’s not just this file - this is just an example.

This is really screwing things up for me - I’m trying to shift to using DTPO to organise my files, but this sort of massive file size inflation makes it impractical.

Can anyone tell me how to get DTPO to stop bloating my files?

Sure. Do NOT check the option to keep the original scan resolution, which will produce huge files.

I usually scan with Preferences > OCR set for 150 dpi and 50% image quality. The result will be searchable PDFs that are somewhat larger than the original scanner output image. For items such as receipts, invoices, etc. where FAX quality is acceptable, I set the dpi to 96, which results in searchable PDFs that are significantly smaller than the original scanner output image.

Paul_G · June 9, 2011, 4:43am

Thanks for that. It seems surprising though - I must be misunderstanding something. I thought that keeping original scan resolution would result in the same sort of file size - two files, same image, same file format at the same resolution should be the same size, right? And then add on a text layer, which shouldn’t be much more?

It’s almost as if the original pdf is being converted to a massive graphic - it’s not, is it?

And if it is, is there a way to keep the ‘image’ layer as per the original? I have other (similar) files that have been sent to me as already searchable pdfs, and they are under 1Mb, which would imply that they aren’t using a massive image file underneath - if they can do it, why not the searchable pdfs created by DTPO?

jorgeg · June 9, 2011, 7:12am

There’s something wrong with converting an existing PDF into a searchable PDF. Even if you set the “Same as scan” and 100% quality settings on the OCR preferences tab, the resulting file shouldn’t end up being that many times bigger, there’s absolutely no reason for that.

So, I made a few experiments and came up with the conclusion that there’s a problem with the OCR engine when converting an existing PDF. The test is very simple and you can try it too.

For example, I scanned a page with nothing but text on it. I scanned it using the standard OSX “Image Capture.app” and saved it as a .jpeg file. It is a 400dpi scan and it weighs 2.2 MB.

You can try the same steps with a similar image.

Import the image into DEVONthink.
Go into DT preferences and set OCR resolution to 200 dpi and 75% quality setting (or something similar).
Select the imported .jpeg file and select the “Data -> Convert -> to searchable PDF” action

The OCR process took only 10 seconds on my computer, and the resulting searchable PDF weighted at 568 KB. That’s a lot smaller than the 2.2 MB of the original image file. The quality of the generated searchable PDF is quite good too.

Now, here’s the second part of the experiment. We want to make a non-searchable PDF of the original image file, and then try converting that PDF to a searchable PDF and compare with the previous experiment. The result should be the same right?; or very close.

Select the imported image file (the one that weighs 2.2MB) and make a duplicate; select “Data -> Duplicate”.
Now we have two .jpeg images, each 2.2 MB in size.
Select both images and select “Data -> Merge”
DT will create a normal non-searchable PDF containing both images. As expected the PDF weighs 4.4 MB.
Select that PDF and make sure you can see the sidebar with the two thumbnails of the two pages.
Select one page within the sidebar and right click to select the “Delete Selected Page” action.

Now you have a PDF with just one page with the original image. As expected the file size goes down to 2.2 MB, the same as the original image. Now it’s time to OCR this file.

With this new PDF selected launch the “Data -> Convert -> to searchable PDF” action again.
… and what do you think? The OCR process took 6 minutes!, I clocked it. Compare it to the 10 seconds it took to OCR the .jpeg file.
When the OCR finished, the resulting file weighted 5.2 MB. Again, compare that to the 568 KB of the conversion from the .jpeg file.

I’ve tried this a few times with different files and with PDFs that I got from other places. The OCR process always takes a lot of time to convert non-searchable PDFs into searchable ones, and the file sizes always increase at least two-fold.

This hasn’t happened to me when converting .jpeg images; on these cases the OCR process is very quick and the resulting searchable PDFs are a lot smaller and with good quality.

There’s definitely something wrong with OCRing existing PDFs, but I don’t know if it’s a problem with the ABBY engine or OSX’s Quartz engine.

pacito · June 9, 2011, 8:26am

I’ve noticed this too: ocr taking a long time, and, my new pdf file has sometimes been so large that DTPO couldn’t import it (not enough memory) - up to 1.5 gb’s!

Bill_DeVille · June 9, 2011, 2:06pm

jorgeg, If you check the box in Preferences > OCR to retain the resolution of the original scan, a new image will be rasterized from the original using Apple’s Quartz code, and it will be much larger than the original PDF output from a scanner.

The default settings in Preferences > OCR are 150 dpi and 50% image quality. These settings are a compromise between file size and view/print quality and I use them for most OCR conversions. For many documents such as receipts I use 96 dpi and 50% image quality, and the searchable PDFs are smaller than the original scanned imaage and are about the view/print quality of FAX.

There may be some confusion about the meaning of the 50% image quality setting. When the scanned image is analyzed and converted during OCR, images such as graphics and pictures are converted to JPEG images, which are set at 50% quality. If you are familiar with JPEG image processing, for many purposes the 50% quality setting is ‘good enough’ and considerably reduces file size.

If you have PDFs that you would like to make smaller in file size, a utility such as PDFShrink can be used.

Of course, if you wish to incorporate images from a PDF in a printed publication and wish to have maximum quality in that image, keep the original scanner output for that purpose.

jorgeg · June 9, 2011, 7:29pm

So the problem is with how Quartz extracts the image from the PDF?

For the test I described above, I did not select the “Same as scan” option, and had a 200dpi and 75% quality setting.

I repeated the test with a setting of 150dpi and 50% quality. Now the converting process took longer!, 8 minutes vs the 6 minutes of the previous test. And the resulting file was still larger, just 1 MB larger but still I don’t think that’s acceptable.

And leaving aside the time it takes and the size of the resulting file, the other annoyance is that the resulting PDF renders very slowly and scrolling feels very jumpy and heavy. Even displaying the thumbnails for the pages take a lot of time, near a second for each page.

Again this only happens with converting PDFs into searchable PDFs. Converting normal images works super fast and the results are very small files with great quality (OCRed in 10 seconds and a 568KB file), and these PDF render fast and you can scroll through the pages smoothly.

I dont think this huge difference in performance should be an acceptable behavior.

In my particular situation, this is very annoying since I deal more often with existing PDFs rather than scanning documents myself. And all these PDFs for which I don’t have the original images take a lot of time (hours!) to convert, and what’s worse is that their jumpy scrolling and slow rendering irritates me a lot.

Paul_G · June 10, 2011, 8:34am

Thanks Bill. That certainly explains what’s happening. But as a layperson, my first thought is “Why on earth would it do that?”. It seems ridiculous. There’s already an image in the original scan - I mean, that’s the point, it’s an image without any text, that’s what we’re trying to change. Why replace this low-size, clear image in that pdf with either a gigantic clear image or a low size degraded image? If that’s what’s needed for the OCR to work, fair enough - but even then, couldn’t you then discard the new image after OCR and merge the text with the original small & clear image instead?

It’s a pretty serious usability issue for me:

I need to share my documents, so they need to stay looking good, not degraded.
I need to email my documents, so they can’t be massive sizes.
I need to find my documents, so I need to OCR them.
I’d imagine lots of people have those same needs.

So - is there any way around it? Any way of scanning in documents and getting a searchable result that is not degraded, and file size only marginally larger than the original scan? (allowing for the fact that a text layer has to be added in - but text doesn’t take up much space). Could I do it with a different OCR? (I’d pay!) Or OCR outside of DTPO? Or is it just something that is universal in OSX OCRing? (For that matter, could I do it better in - gasp horror - Windows?)