A lot of my documents inside DT are black and white (not gray or color) scans of text documents.
Info: The scans of my documents are PDFs containing 1-bit images.
Issue
When DT OCRs a PDF the integrated ABBY FineReader software creates 8-bit JPEG2000 encoded images inside the PDF resulting in large PDF files (since 1-bit to 8-bit is a huge difference) with reduced quality (more “blurry” looking) compared to the original PDF file.
Solution
ABBY FineReader inside DT doesn’t have this issue when I convert a multipage TIFF image (instead of a PDF) into an OCR’ed PDF file. This will produce a OCR’ed PDF containing 1-bit JBIG2 encoded images with a much smaller size than the 8-bit JPEG2000 images as well as a better quality (scans stay “sharp” and don’t get “blurry”).
Using the following technique I was able to reduce file sizes of e.g. a 35-page black & white OCR’ed PDF file from 14 MB to 2 MB and at the same time maintaining better quality.
Here’s a quality comparison.
Top: my method described here – resulting in the small size PDF
Bottom: the default method resulting in the large PDF
So here’s how I do it
-
Create a scan resulting in a black and white PDF file (I typically scan with ScanSnap and create 600dpi scans) → all my black and white scans are written into a folder called “bw”
-
I’m using Hazel to check this folder for incoming PDFs, then convert PDFs into multipage-TIFFs using the free GhostScript software via command line and finally deleting the original PDF file
// Here's my bash script query=$1 resolution=600 ghostscriptApp=/usr/local/bin/gs tiffName=${query/.pdf/.tiff} $ghostscriptApp -o "$tiffName" -sDEVICE=tiffg4 -r$resolution "$query"
-
The same folder contains a second rule waiting for TIFFs and sending these to DT (using a folders UUID as target) including OCR’ing them
// Here's the applescript I use tell application "Finder" set _path to (the POSIX path of theFile as string) set {_name, _extension} to {name, name extension} of theFile set extension hidden of theFile to false end tell tell application id "DNtp" set theGroup to get record with uuid "ENTER HERE YOUR UUID" set theImport to ocr file theFile to theGroup end tell
Additional info concerning my workflow
The target OCR folder inside DT is actually added to DT as an “external folder”. So it’s not inside DT’s database. It’s indexed.
So after DT is finished OCR’ing the PDF the OCR’ed PDF will be visible inside the “external folder”. From there I’m running again my Hazel rules to check the PDF contents and then renaming them based on the OCR’ed info and sorting them into the final folders inside DT.
If you are interested let me know so I will write more about it, but I guess there are already threads about it.