Create small size high quality black&white OCR'ed PDFs from scans

A lot of my documents inside DT are black and white (not gray or color) scans of text documents.

Info: The scans of my documents are PDFs containing 1-bit images.

Issue

When DT OCRs a PDF the integrated ABBY FineReader software creates 8-bit JPEG2000 encoded images inside the PDF resulting in large PDF files (since 1-bit to 8-bit is a huge difference) with reduced quality (more “blurry” looking) compared to the original PDF file.

Solution

ABBY FineReader inside DT doesn’t have this issue when I convert a multipage TIFF image (instead of a PDF) into an OCR’ed PDF file. This will produce a OCR’ed PDF containing 1-bit JBIG2 encoded images with a much smaller size than the 8-bit JPEG2000 images as well as a better quality (scans stay “sharp” and don’t get “blurry”).

Using the following technique I was able to reduce file sizes of e.g. a 35-page black & white OCR’ed PDF file from 14 MB to 2 MB and at the same time maintaining better quality.

Here’s a quality comparison.
Top: my method described here – resulting in the small size PDF
Bottom: the default method resulting in the large PDF

So here’s how I do it

  1. Create a scan resulting in a black and white PDF file (I typically scan with ScanSnap and create 600dpi scans) → all my black and white scans are written into a folder called “bw”

  2. I’m using Hazel to check this folder for incoming PDFs, then convert PDFs into multipage-TIFFs using the free GhostScript software via command line and finally deleting the original PDF file


    // Here's my bash script
    
    query=$1
    
    resolution=600
    
    ghostscriptApp=/usr/local/bin/gs
    tiffName=${query/.pdf/.tiff}
    
    $ghostscriptApp -o "$tiffName" -sDEVICE=tiffg4 -r$resolution "$query"
    
  3. The same folder contains a second rule waiting for TIFFs and sending these to DT (using a folders UUID as target) including OCR’ing them


    // Here's the applescript I use
    
    tell application "Finder"
    	set _path to (the POSIX path of theFile as string)
    	set {_name, _extension} to {name, name extension} of theFile
    	set extension hidden of theFile to false
    end tell
    
    tell application id "DNtp"
    	set theGroup to get record with uuid "ENTER HERE YOUR UUID"
    	set theImport to ocr file theFile to theGroup
    end tell
    

Additional info concerning my workflow

The target OCR folder inside DT is actually added to DT as an “external folder”. So it’s not inside DT’s database. It’s indexed.

So after DT is finished OCR’ing the PDF the OCR’ed PDF will be visible inside the “external folder”. From there I’m running again my Hazel rules to check the PDF contents and then renaming them based on the OCR’ed info and sorting them into the final folders inside DT.

If you are interested let me know so I will write more about it, but I guess there are already threads about it.

2 Likes

Cool tipp. Re the blurriness of the 8-bit version created by Abby’s software: That’s to be expected. They probably “invent” gray values at the border between black and white, which blurs these borders.
BTW: There’s a canned automator action to convert PDFs to images which can be configured to produce TIFFs (not arguing against Ghostscript, of course). There’s also an automator action to apply a QuartzFilter to a PDF. That can be used to convert a color PDF to a b/w one. I have no idea if that makes it smaller though, and/or less blurry. I just stumbled upon these two.

The main reason for the blurriness is in my opinion the change from 1bit to 8bit encoding. Therefore, each pixel can contain not just 2 states (black/white) but 256 states and therefore the used JPEG2000 codec can never be as precise as the JBIG2 codec for 1bit images. This is quite clear.

The Automator action is not as good, since it produces not a single multipage TIFF but multiple single TIFFs :wink: And it also doesn’t produce 1-bit TIFFs. It produces 8bit grayscale TIFFs.

1 Like

Development would have to look into the possibility of preserving the bit depth on the OCR’d output.
@aedwards: I don’t know if this is possible or not.