Optimal Fujitsu ScanSnap S510M Settings for OCR

m021478 · October 5, 2008, 6:20am

Before I send the next couple of weeks scanning a ridiculous number of documents using my Fujitsu ScanSnap S510M scanner, I wanted to make sure that I have my Fujitsu ScanSnap Manager Settings configured properly, as well as my DEVONthink Pro Office OCR Settings…

Can someone please have a look over my ScanSnap Manager & DT Preferences in the screenshots below, and get back to me to confirm that they look fine, or that they don’t…Thanks!

farm4.static.flickr.com/3250/291 … 189f_o.png

farm4.static.flickr.com/3164/291 … f343_o.png

farm4.static.flickr.com/3125/291 … 4a52_o.png

farm3.static.flickr.com/2258/291 … 3f1a_o.png

Let me know if you’ll require information about any of my other settings, or about my workflow in general, in order to be able to tell me accurately one way or another if my settings are configured properly…

Thanks!

Bill_DeVille · October 5, 2008, 6:05pm

Whether or not those settings are “optimal” depends in large part on your intended use of the PDFs. For most of my scans, they would not be optimal.

The choice of Best (Slow) together with automatic color detection in ScanSnap Manager’s settings will ensure that any pages that contain color will be scanned at 300 dpi, which is a good choice in that case – good OCR accuracy should result. Pages that do not contain color will be scanned at 600 dpi by ScanSnap.

Comment: Of course, the higher the scanning resolution the slower the scanning procedure. If color isn’t important to me (or isn’t present in the copy to be scanned) I’ll choose Better (Faster) and set for Black & White scanning. That will result in 300 dpi scan resolution. If the copy is clean and has high contrast, I’ll get acceptable OCR accuracy. If the copy material has small font and/or low contrast, I’ll kick scanning resolution up to Best (Slow) to improve OCR accuracy.

In other words, rather than a “one size fits all” strategy, I’ll adapt settings to the copy material.

The default settings in ScanSnap Manager > Compression and in DTPO Preferences > OCR > Image Resolution & Image Quality represent compromises related to file size and to “acceptable” visual and printed quality.

Your modification of the Compression settings in ScanSnap Manager will only affect copy containing color, but in that case will greatly increase the file size of the temporary page files sent to DTPO by the scanner. That can use up free hard drive space at an amazing rate.

In the same way, your modifications of the default settings in DTPO Preferences > OCR > Image Resolution & Image Quality will result in much larger files than the default settings. During the OCR and page-by-page re-rasterizing of the PDF received from ScanSnap Manager, temporary files are produced. Once again, for OCR processing of a large PDF, a great deal of free hard drive space may be used temporarily.

I’ve seen user reports of running out of hard drive space in the process of scanning and OCRing a large document, even though there were quite a few gigabytes of free HD space before the procedure. The higher the resolution and image quality settings, the more likely this might happen.

Comment: The default compression and image quality settings in ScanSnap Manager, in combination with the Best (Slow) setting, will result in a viewed or printed PDF that looks very satisfactory alongside the original copy, unless one wishes to send the output for professional printing, in which case the quality might be further tweaked, like your settings.

The default settings in DTPO Preferences > OCR > Image Quality & Resolution are a compromise between file size and view/print quality.

What are my objectives in scanning and OCRing material into my DTPO database? Primarily, they are to add the searchable textual content to the database; to have a readable image layer of the PDF, and if necessary to have a readable printout of a captured PDF. The default settings in DTPO are roughly equivalent to a fax of the original copy, good enough to read or print for most of my purposes.

Many of my scans are of correspondence, contracts, invoices, bills and the like – the inevitable detritus of everyday existence that results in piles of paper – that I need to keep. Scanning those into a database at the default settings lets me get rid of all that paper, and my database allows me to find anything I may need, and read it or print it clearly enough.

One of my continuing projects is scanning and OCRing some of my old books and papers that were written before the days of personal computers. Generally, the default settings produce clear readable copy from black & white scans. But I might not be satisfied with graphical material on some pages. Rather than do a complete rescan at higher resolution or image quality, I rescan those pages with higher quality DTPO preference settings and replace those pages in my OCRd PDF. That lets me keep reasonable file sizes.

In other words, feel free to adapt settings to the nature and use of the materials. Scanning and saving everything at high quality is usually overkill.

m021478 · October 6, 2008, 1:21am

My objectives in scanning and OCRing material into my DTPO database are to have a searchable (and if necessary re-printable) copy of everything I scan, because intend on using my shredder on almost all scanned documents after they’ve been scanned and backed-up.

In other words…my objective is to create what can be considered an archival copy of any and all documents I scan…

Bill_DeVille · October 6, 2008, 3:51am

So the issue boils down to what is meant by “archival copy”.

Scholars working on scans of the Dead Sea Scrolls need extremely high-resolution scans, with huge file sizes. Tiny details must be clearly visible at high magnifications of the image. Why? Because there have been arguments about seemingly minute variations in the shapes of characters, some of which may be incomplete. My laptop drive could hold few such scans.

But most of the documents I scan need only meet the requirement that I can easily read them whether onscreen or when printed. Subjected to high magnification, they would look blurry. But I don’t worry about possible forensic examinations of the characters in a scan of letter that was produced on a typewriter. All I need to worry about is whether or not the letter will be readable in my database. A letter scanned in black and white at 300 dpi, and saved into my database with the default OCR settings of DTPO serves my needs. I have a searchable, readable and printable archive of that letter.

Once in a while, in order to handle images with low contrast, or small font text I may increase the resolution and/or image quality settings in DTPO. That will balloon the the resulting file size, sometimes amazingly so. To avoid that as much as possible, one can separately scan and save one or just a few pages at higher quality, rather than all pages, then insert the higher quality pages using Acrobat or Preview.

It’s all about compromise. The higher the resolution and image quality, the longer the scan will take. The higher the resolution and image quality in DTPO settings, the larger the file will be, and the fewer the PDFs I can fit on my laptop’s 200 GB hard drive. Depending on my setting choices, the same paper copy might result in a PDF of 500 KB or 42 MB.

Cameron · November 21, 2008, 3:55am

I just bought a ScanSnap 300M and this thread answered all questions I had about proper settings. Thanks for the fantastic response, Bill! Great thread.

jptherrien · January 15, 2009, 10:02am

Perhaps, but I want this to be a “fire & forget” thing as possible, so as to have a fully (again as possible) automated workflow.

I have to scan a lot of paper documentation, which I need to keep/archive as well for legal purposes (aggg!!!), but I need to able to reprint them with “decent” quality if needed (part of my BCP strategy), and I need the OCR to be reasonably accurate for the search engine to work correctly.

I’m using a S300M in “Best” mode and sometimes the automatic color detection doesn’t work correctly (color documents are not detected as such), so I have forced the “color mode” (or whatever it’s called) by default in ScanSnap Manager.