"Combine into Single Document" & Bitmap images

hangaround · December 11, 2012, 7:09pm

After importing my scans with Acrobat during the last months I decided to give DT another try.

Thanks for making the “Combine into Single Document” checkbox work.
Unfortunately it is still unusable if you are scanning in bitmap mode (“Text”): Bitmap scans that are bundled into “Documents” get jpeg-compressed upon import (we all know that applying jpeg compression to bitmap images is one of the most stupid things a program could do …)

To reproduce it do the following:

Scan a document in bitmap mode (“Text”) as Tiff with “Combine into Single Document” checked.
Scan the second page of the document.
Send the whole “Document” to DT.

–> It gets OCRed and jpeg-compressed.

– Scanning pages as single pages and bundling them afterwards in ImageCapture into a “Document” yields the same result.
– Merging them after import in DT is not an option because this adds an overhead of about 300% (no idea why).

Single scanned pages that are sent to DT as single pages get compressed correctly (CCITT).

hangaround · January 9, 2013, 9:36am

Hmm, ~1 month passed and not a single reaction or statement …

Normally I don’t like to bump my own post.
Is the problem not reproducible? Or am I the only one scanning text documents?

Any answer or statement would be much appreciated.

Thank you

cgrunenberg · January 16, 2013, 2:00pm

The image conversion is currently necessary due to shortcomings/issues (and necessary workarounds) of the Abbyy engine but we’ll try to improve this in the future.

hangaround · January 16, 2013, 2:30pm

Thanks for the info.

As the actual behavior renders text import via scanner virtually non-functional for anything but single-page docs I’m happy to hear that you will try to “improve” this.

I hope it gets the appropriate priority on your todo list; after all it’s a feature I paid for with DT Pro Office – and it used to work somehow in the past (IIRC, long time ago …).

mfuggle · May 14, 2013, 12:10am

This issue is causing significant additional workload. I am trying to get to a paperless office but scanning multi pages documents to PDF and checking the ‘Combine into a Single Document’ doesn’t work. As a result I am having to combine documents manually as a second stage of the process. This is a significant failing and not what I thought I had oaid for.

Cheers
Martin Fuggle

Bill_DeVille · May 14, 2013, 1:26am

Here is where I sing the virtues of the ScanSnap. ScanSnap Manager can be set to output a multipage paper document inserted into the document feeder as a multipage PDF.

The maximum capacity of the feeder is 50 pages. Suppose I want to scan a 200 page stack of paper as a single PDF? I check an option in ScanSnap Manager to ask whether the scan should be continued after it has run through the content of the feeder (in that mode, just insert another stack of paper and hit the Scan button). The effect is to allow me to keep adding batches of the 200-page stack until all have been scanned, then click the ‘Finish’ button to output the 200-page PDF.

Or suppose I’ve got a stack of 1-page forms to scan? I configure ScanSnap Manager to produce a PDF for each page, then drop a stack of 42 forms into the feeder and hit the Scan button. 42 1-page PDFs will result. (What if the forms are n-page duplex? No problem, there’s a configuration for that.)

So I avoid having to split or merge PDFs after processing.

hangaround · May 14, 2013, 1:53am

A ScanSnap is certainly a nice thing. If had to scan stacks of 50 or 200 pages regularly I certainly already would have bought one too.

My “stacks” are mainly text (b/w) documents of 5, 10 or maybe 15 pages, hardly worth a ScanSnap. Nevertheless I would like to get each “stack” in 1 PDF and not with jpeg compressed text.

I think this is not asking for too much.

hangaround · August 7, 2013, 2:14pm

Since the post above DT went through at least one update and still no fix for the bitmap scanning workflow. This is rather disappointing, especially since a working integration of text scanning should be one of the more important features of DT (paperless office).

Currently I can work around this deficiency by using only OS X’s default Image Capture interface and DevonThink, w/o any third-party software.

Please see the following “manual workflow” as a proof of concept that a sound workflow is feasible on Mountain Lion, everything is already there, you just have to implement it properly into DT:

[size=150]Manual workflow
[/size]

Open Image Capture (the one from OS X, not DT).
In Image Capture set
– “Kind” to “Text”
– “Resolution” to “300 dpi” (or higher)
– “Format” to “TIFF”
– the checkmark “Combine into single document”
Scan two pages

We get a LZW-compressed two-page TIFF B (monochrome) file of about 160KB

In the Finder drag the TIFF file from your scan folder into DT’s inbox
Right-click the file in the inbox and choose “Convert to Searchable PDF” from the context menu

DT does his OCR job and as final file we get a two-page PDF; the images contained in the PDF now are CCITT-compressed monochrome bitmap images, file size is about 80KB

This is an acceptable outcome for scanned text pages.[^footnote 1]

Now let’s compare this with what happens when we use DT’s implemented workflow (substantial differences are marked in bold):

[size=150]DevonThink’s workflow
[/size]

In DT open Image Capture via the “Import from Scanner or Camera…” menu
Choose the same settings as above
Scan two pages

If we look in DT’s Scans folder we now see two one-page TIFF files; the format is LZW-compressed TIFF G (grayscale), file size about 140KB each (=280KB total)

In DT’s preferences, under “OCR”, set a JPEG quality of 60%
Go back to DT’s Image Capture window, select your scanned document in “Documents” in the left sidebar. The document contains the two TIFF pages.
In the bottom right corner click the “Send to” icon

DT does his OCR job and as final file we get a two-page PDF; the images contained in the PDF now are JPEG-compressed RGB color images, file size is about 1100KB

The result of this comparison:

If we choose the first (“manual”) workflow (scan with OS X’s Image Capture --> drag in Finder to DT --> convert to searchable PDF) we get our two scanned pages in a correct 80KB losslessly compressed PDF file.
If we choose the implemented DT workflow we get our two scanned pages in a 1100KB PDF file with JPEG artifacts.[^footnote 2]

[size=85]The fatal fault seems to happen after scanning with DT’s Image Capture: the TIFF is saved as 8bit grayscale image instead of 1bit (it’s as if the “Text” setting in the scan interface was ignored).
The second fault, the conversion to RGB JPEG, is probably a consequence of the first.
[/size]

I hope it is a bit clearer now that I don’t ask for super-complicated additional features. It’s all doeable by means that are already on-board with OS X. Just a question of a thoughtful implementation.
Please correct me if I’m wrong.

[^footnote 1]:Please note that this is still suboptimal. With JBIG2 compression we could achieve a file size of 40–50KB without visible loss; but I won’t ask for this because probably it isn’t doeable without third-party libraries.

[^footnote 2]:We could reduce the JPEG artifacts to get a near-lossless quality by setting JPEG quality in the preferences to 90%, but file size will then go up to 2MB or more.