File size of scanned and searchable pdf docs

HJ1 · December 31, 2008, 8:17am

I upgraded DTPO 2 some days ago, after having used 1.5.x with pleasure for some years, and have an issue concerning the file size of scanned docs. I use a ScanSnap.

As an example I did the following: I have scanned two pages from a magazine. The file that is put on disk is about 884 KB. Then the OCR starts.
OCR preferences:

Convert to searchable PDF is switched on
Searchable PDF Set attributes is switched on
Image resolution is 150
Imgae quality is 50%
The resulting file size in DTPO is 5,3 MB!

If I switch off the Set attributes option, then the file size is only 699,2 KB. But, in this case I can’t set the file name etc.

The quality of the OCR in both cases is exactly the same (number of recognized or unique words).

So, what is happening when switching on the Set attributes option??

I have also snanned the full magazine; in DTPO 1.5.x that used to be a file size of about 35 MB; it now becomes 280 MB!!

I’m sure you can understand that both speed of the system and harddisk space are not positively influenced by this, and that I still want to be able to set file attributes upon scanning and importing docs.

Please let me know what to do about this.

Henk-Jan

Bill_DeVille · December 31, 2008, 5:41pm

Thanks for noting this. I haven’t fully tested for different file sizes with and without Preferences > OCR - Set attributes checked. But with scans of two documents I did find that the size of the PDF+Text file was roughly 50% larger if Set attributes was checked.

Personally, I always do my ScanSnap scans with Set attributes unchecked.

Why? 1) Because if I’m scanning a series of documents, everything comes to a grinding halt when that Set attributes panel pops up, waiting for me to do something.If Set attributes is turned off, the queue moves right along, performing OCR on each scanned document without a pause. 2) Because it saves me typing time. Almost all of my scanned documents contain their title in text. I find it more convenient to Control-click on that selected title in the PDF+Text, and choose the contextual menu option, Set Title As. I normally don’t add keywords. But if I were to do that, I would add them to the Comment field — and would be able to inspect the document to choose keywords, rather than have to remember or guess when that Set attributes panel pops up in the midst of a multi-document scanning session.

I’ll note to Annard the possibility of a file size difference related to the preference, although I didn’t see the amount of difference that you saw.

lh99 · January 15, 2009, 4:23am

I seem to be having a similar issue. I just got a ScanSnap S300M, and although I’m impressed with the speed and size of the scanner, I’m a little concerned about file size.

Previously, I had been using an HP Flatbed/ADF scanner to PDF bills and other text documents. I used Acrobat Pro to perform the scan and OCR the file. A typical (mostly text) file using this setup would be approximately 40kB per page.

In contrast, using the ScanSnap and DTPO, I’m getting approximately 500kB per page on a similar black and white document. What’s strange is that if I scan the document into DTPO without OCR, it comes in at a reasonable size (still more than double Acrobat at 100kB), but running OCR in DT after the fact again results in a 5x increase in file size. I can’t figure out how adding plain text to a PDF via OCR could increase the file size so much. This is with Set Attributes unchecked.

I’ve also tried scanning the same document into Acrobat using the ScanSnap and the exact same settings. The text/image is noticeably crisper than the same document scanned into DT, plus the file size is only 60kB/page.

Am I doing something wrong in DT? I’ve tried adjusting all of the OCR settings, and the only way I’ve been able to approach Acrobat’s file size is by reducing the resolution so low that the document is barely readable.

cturner · January 15, 2009, 3:53pm

Interesting. Prior to the 2.0beta release, I had found that DTPO made smaller OCR’ed PDFs than Acrobat.

I just did a comparison between 1.5.4 and 2.0beta. I had had “Set attributes” checked previously, but unchecked it for this test. The screen shot below shows the files sizes for the “un-OCR’ed” PDF, and then for 1.5.4, and 2.0beta:

The 1.5.4 OCRs are smaller than what 2.0beta is currently producing. As a comparison I did two OCRs in Acrobat 8.0. “antliff-2007.pdf” was 17.8mb, and “delanty-05.pdf” was 9.3mb, which made them larger than 1.5.4, but smaller than 2.0beta.

I’m not a real genius with Acrobat, although I use it between my SnapScan and DTPO, so someone else might be able to get smaller files sizes than I am able to.

HTH, Charles

lemuba · January 16, 2009, 5:52pm

First of all, I´m impressed about DT and use it now every day since Version 2.0 came up now.

However, also I do not like to accept the current file size of the scanned PDFs - I may confirm following values (all 300 dpi scans, a single page, and black&white):

Device: Canon Pixma MP780, using the Canon MP Navigator SW:

Searchable PDF Size (PDF/OCR done by MP Navigator) : 68 KB

Scan via ExactScan (Compressed PDF/Low quality): 580 KB!!!
Optimized afterwards by Acrobat Professional 8.0: 44 KB!

Would be great if the internal DT Scan/PDF creation routine could be improved. 13 times the file size of what is finally necessary blows up my harddisk unnecessarily - hmm

It would be also great if someone from Devontechnologies could give us users a feedback on this topic.

Best regards,

Lemuba

HJ1 · January 16, 2009, 9:18pm

Further, I have the files here available so the Devon team could check the files and results for themselves; just let me know where to send any file.

I must say that I have downgraded back to 1.5.4 in order to avoid both rescanning lateron and the big files. I’ll wait for the final release.
Apart from this, DTPO is really fantastic!

HJ

lemuba · January 16, 2009, 10:16pm

To HJ.

I hope that the Mods/Devontechnologies will participate this discussion now and wake up
We both (and others) play with similar discussions/file sizes.
I will stay so far with version 2.0 as I´m still in the start up with DT and have not so many pages per day - however, this needs to be solved asap - hopefully

What I “believe” to have found out so far - if you scan as me by e.g. ExactScan/Compressed PDF/lowest quality, the displayed quality of the final PDF page (image) is still much toooo good. If I take then this file including the recognized text and compress/optimize afterwords this file by using Acrobat Professional, I may reduce the file size by approximate ten. This is not important for one page, but ten pages make then around 5 MByte difference. Of course, the displayed quality is more rough, but still good enough and readable.

So I believe that the creation of the final PDF needs the option for a higher compression - or in other words, a lower display quality.

Not forgetting that we discuss about a Beta Version and that I´m more than thankful about this piece of great SW!!!

Sorry for my bad English, but it is my first language

Bill_DeVille · January 17, 2009, 12:10am

Patience!

cturner · February 13, 2009, 12:28am

Hi all-

I pulled out my CD copy of ABBYY Finereader for Scansnap 3.0 Mac Edition, and wanted to see what it could do for me. I believe the most recent version is 4.0. No upgrade option on the ABBYY site.

Too bad it only works on Scansnap documents.

Here are some comparative OCRed PDFs. I don’t tend to fiddle a lot with settings to get better results:

So DTPO 1.5.4 still produces the best OCR for me at this point.

Best, Charles

Bill_DeVille · February 13, 2009, 1:29am

Charles, hold onto the image-only strogatz-1993 PDF and try it with DTPO2 pb3.

cturner · February 13, 2009, 4:29pm

Ok-

I just installed DTPO pb3 and ran the OCR against the “strogatz-1993.pdf.” It came in at 3.8mb, right where DTPO 1.5.4 put it as well.

I’m pretty happy!

Best, Charles

lemuba · February 13, 2009, 8:37pm

Yes!!!

Filer size, quality is perfect now.

Thanks!!!

roberthoodphd · February 13, 2009, 10:51pm

I’ve experimented with OCR on 2 files.

What’s great–
*accuracy–much better than DT version 1.x, and better than Acrobat Pro

speed
resource use (doesn’t hog machine resources compared with Acrobat Pro)
interface in DT

What’s not great–
file size: e.g., 40mg scan converts to 250-300mg with OCR (even with set attributes turned off)

I know others have mentioned this. So I tried a few files scanned in at about 30-40 megabyes (they are 200-300 page scans of books that I’m converting to electronic versions)

On the one hand: hard drive space is inexpensive. On the other: I would really like to see DT compete with Acrobat on file size. Everything else is great–but the file size is larger than I would like.

wrbudge · February 14, 2009, 5:28am

I’m also getting huge file sizes for PDF+Text files using DTP beta 3 and ABBYY.
An example: 14 page PDF no OCR = 1.5 MB;
PDF+OCR ABBYY = 3.4 MB (with poor resolution);
PDF+OCR Acrobat = 0.9 MB (with great resolution).

HJ1 · February 23, 2009, 9:16pm

Hi all, I now have installed the DTP 2.0 pb3 with ABBYY engine. Of course very curious about what happens with the OCR processed files.

I used these settings, each time for the same file; every time the source scan is a two-page doc of 1,1 MB.

Import into DTP 1.5.4, results in 1536 recognized words
(at 160 dpi, 50% quality)
import into DTP 2.0 pb3.0:
2.1 import at 160 dpi, 50% quality. File size: 669 kB; # recognized words: 1352
2.2 import at 180 / 50. File size: 766 kB ; recognized words: 1393
2.3 import at 220 / 50. File size: 1020 kB; recognized words: 1380
2.4 imprt at 180 / 70. File size: 971 kB; recognized words: 1384

Importing at 180 50 seems to work best. However, the number of recognized words is about 10% less than with the IRIS OCR engine used in previous DTP versions. What has been corrected is the huge file sizes.

Another thing is the time the OCR process now takes. I haven’t checked in detail (with a stopwatch) but the OCR process takes (much?) more time than IRIS did.

Conclusions:

the new version does not inflate the file size
the new ABBYY engine does not recognize as much as IRIS did
the new engine takes much more time

Are these findings in line with other experiences?
Do you know why there are 10% less recognized words (whch is quiet a lot I think)

Thnx,
HJ

annard · February 23, 2009, 10:41pm

In our labs we now have the same workflow as the original Abbyy code that performs much better compared to IRIS (in our tests of course, YMMV) and we hope to release it ASAP. The file size is also a lot smaller now (this was mentioned in the release notes).
As to the speed: since we have chosen to limit the memory use of the OCR process it means that more disk access is needed. We think this trade-off is worth it since IRIS could completely stall a machine during OCR. Now your machine is still responsive during this process.

glw · March 30, 2009, 3:48am

The real question is why the nice, crisp, losslessly compressed 1 bit (black and white) original image in the original scanned PDF has to be replaced with a lossy compressed 8 bit representation in the pdf+text version - That’s where the big increase in file size comes from.

Maybe there’s a technical reason it has to be that way, but OCR’d PDFs would be so much better at every level if the resulting PDF could retain the original image.