Quality issue of PDF after converting to 'searchable PDF'

Ian_Hocking · February 9, 2008, 6:05pm

Hi there

This is my first post to the forum, so I hope I get the etiquette right! Apologies if not.

Overall, great product, but I’m having one niggling issue:

When I import a scanned book page into DevonThink Pro Office - whether as JPEG, PDF or TIFF - DTPO happily creates a new database entry with a preview that shows a sharp, clear scan. Great. However, when I right click the database item and select ‘Convert to searchable PDF’, the resulting document is so heavily pixelated that it is unreadable.

I’ve tried pushing up the quality and DPI settings in OCR (in the Prefs) to 100%, but to no avail. The new doc is still unreadable. I can select text within the searchable document and paste this into a text editor - the text is perfectly correct, so I guess the invisible text layer added by the OCR engine is fine. But why is the PDF now heavily pixelated?

Confusingly, I scanned some pages of a book only a couple of days ago and didn’t have this issue: conversion produced a same-as-original clarity in the converted PDF, and the OCR text layer was fine. Can anyone help?

(I don’t know if it’s relevant, but when I try to perform an OCR on a PDF, DTPO stops responding and virtually locks up my computer - after I finally got Activity Monitor to run, I saw that the RDE process (the OCR engine) was hogging 1.5 gig of my RAM…until I murdered it.)

If anyone can help, I’d appreciate it…

Cheers
Ian

PS My system is 2GHz MacBook Pro running Leopard, everything patched and up-to-date.

annard · February 9, 2008, 9:21pm

This is a known issue with the current version of the OCR software. We have informed IRIS and await an update from them. As a workaround you can keep both the original scan and the OCRed PDF in the database so you could find the text and in the future convert them again.

Ian_Hocking · February 9, 2008, 9:35pm

Ok, thanks, Annard.

Ian_Hocking · February 11, 2008, 6:35pm

Me again.

I’ve just finished scanning in a book that is useful for a book I’m writing. As expected, the OCR process made a mess of it, and I’m finding your workaround (i.e. search the text using the find function) not really workable. For example, if I search for ‘Kamo’, I get a hit within the scanned document, but because the OCR’d doc is mush I can’t see any context, or indeed the info I’m looking for. I have to open up the un-OCR’d doc and try to coordinate the two. This is a bit of a struggle with easy, single-word searches, but ‘Kamo’ appears a lot in the book I want to search, and I need to hop around it - this just isn’t possible.

I know this is an issue with a subset of the app that you’re not responsible for, but can you think of any other workaround? Are there particular circumstances that cause the mush? The main reason for getting DTPO was the excellent OCR facility I experienced with the trial, and that didn’t mush my scans. Did I OCR in a different way the first time round?

Any tips would be greatly appreciated.

annard · February 11, 2008, 6:55pm

I think it has to do something with the later versions of Mac OS X or QuickTime because those changed. Other than that I do not have a solution at the moment I’m afraid. Have you tried scanning with a higher dpi? Or in colour? The next maintenance release will have an updated OCR engine but that one had some files (although less) that caused identical problems.

Ian_Hocking · February 11, 2008, 10:45pm

Thanks for another swift response, Annard. I’ll try some combinations and see what I come up with.

Ian_Hocking · February 18, 2008, 6:30pm

Just in case it’s useful for anyone, I’ve experimented and found that this combination works:

Import a colour TIFF, at least 300 dpi, at 3325 by 2423 resolution for an A4-sized scan

Basically, it looks as though the OCR engine doesn’t like either the overall image size (in megs) or the resolution to be above a certain threshold.

tcrombez · February 23, 2008, 7:25pm

I don’t think the OCR engine doesn’t like high resolution files. Using DTPO, I import a lot of scanned book pages (typically batched in PDFs of about 20 pages) in 600 dpi. These files can easily weigh 15 to 20 MB. The results are quite satisfying to me.

The only (minor) difference I see, is that I use the “Import…” function to OCR the pages, not the “Convert to searchable PDF”. But that shouldn’t really make a difference, or should it?

Ian_Hocking · February 23, 2008, 7:30pm

Thanks, tcrombez, I’ll try the import function too.

As an update, I’ve been having a lot of success using multi-page TIFF files. These aren’t small, but since I’ve just upgrade the HD in my MacBook Pro it’s OK for the time being.

I’m using the VueScan software to produce the multi-page TIFF, then dragging them into DevonThink. The OCR seems happy enough with them, and the resulting PDFs are reasonably sized.

mpm · February 27, 2008, 4:29am

Not really a question here, just recounting my own experience and hoping that IRIS can make improvements to the OCR engine for our benefit.

I just upgraded to the DTPro Office version a few days ago to explore the OCR. My target PDF files are chemistry patents in the pharmaceutical industry, some of which can be quite large, a couple hundred pages and around 30 MB in size.

I have looked through many of the old posts regarding the OCR process and the resolution of the imported image. After some testing I settled on 96 dpi to save on space. But seeing the size of my DT database after importing just 8 patents I began having serious doubts about the scalability of converting patent files in this way.

The ability to convert the images files to searchable text is really quite wonderful when it works. The “layering of the text on the original image”, if that’s correct to say, is also very helpful. I believe I can get text dumps of these patents, but they will not be optimally formatted or contain the in line tables or figures. I can now quickly locate keywords regarding biological activity that are buried amongst a large amount of compound enumerations.

After coming to grips with the length of time it takes for the OCR process to complete a job, I did fine with PDFs of 2 - 12 MB. It even worked for a 28 MB file, 212 pages with 628 KB of extracted text.

But today I have been unable to OCR a second 28 MB PDF of around 200 pages. I archived a lot of stuff onto an external disk to make space. Nevertheless, the OCR took 16 GB of free disk space down to 0 KB, as reported by the Finder. One potentially relevant item: I used Preview to rotate a few of the pages in the original patent and then saved it. This was to get some important figure labels in the correct left-right orientation for the benefit of the OCR engine. I don’t know if that introduced any problems for IRIS. I let the engine run for several hours, but eventually it just consumes all my disk space and I assume it will not be able to finish under those circumstances. I will have to try the original PDF without the page rotations to see if that is the problem.

Anyway, I hope that the memory consumption and speed issues with the OCR engine can be improved by IRIS.

Even with that, it seems that the forthcoming DTP with its new storage architecture will be essential for scalability.

mike

TPKeelan · March 15, 2008, 2:39am

I downloaded version 1.5.1 today hoping that my OCR woes would be cured, but was sadly disappointed. Converting PDFs to Searchable documents still renders the underlying PDF a pixelated mess.

BUT THEN, i looked in Preferences and saw that my OCR settings were using a dpi of 150. I was scanning at 400. When I changed the dpi in Preferences to 400 the Conversion worked fine, as did Import>Images (with OCR).

I can’t say that the new version fixed anything since I hadn’t fiddled with preferences in the prior version. But it works.

My prior work-around was to recognize the text directly in Readiris before importing them to DTP.

Terry

MattyG · April 16, 2008, 1:51pm

I know that the developers can only do so much about issues with the OCR engine, but is there any way to (I believe the correct terminology would be) add a text layer without touching the actual layer of a PDF? Let’s say that I have downloaded an old article that is not searchable. I have tried many combinations of settings for image quality and resolution in DTPO (up to 400 dpi and 100% quality), and they all result in reduced image clarity and often increased file size. If I already have a PDF, there really should not be any reason for DTPO to touch the image layer unless I am trying to optimize file size. I realize that it uses a completely different engine, but in Acrobat, for example, I can OCR a PDF without having the program touch the actual image layer.

(if it would help, I could upload a screenshot comparing image quality of multiple settings with the original file)

annard · April 16, 2008, 3:37pm

The IRIS OCR engine that we use will always change the original image, so there is not much that we can do about it. For us it’s a black box: we give it an image of some sorts and out comes a layered PDF file. What happens inside is black magic.

Acrobat does all the work except the OCR themselves (they seem to use a library from IRIS that does only a small part of what we want). We don’t want to be experts in (dis)assembling PDF files, so we are not going to do that.

Our approach to OCR is fire-and-forget. If you need more control you will need to use Acrobat or the full-blown ReadIRIS application where you can fine-tune the process completely.

N.B. The next maintenance release will contain another update to the OCR engine. IRIS claims that this update solves the originally reported problem in this thread. And I can confirm it with the test files some of you have sent to us.

aboon · November 20, 2009, 12:38pm

To date - november 2009 - still a serious issue. If you trash originals after import with OCR from an existing PDF even with 600dpi color at 100% quality, you will have only one new original with dramatically reduced readability and increased file size as compared to the source file. This is not a preview effect; exporting as files and reopening in Acrobat shows what happened with your source file.

So, you cannot rely on DT(O) alone as your single paperless scientific journals archive, if you read on screen or with e-readers (see atached examples with different settings; import with OCR, and subsequent export).

Bill_DeVille · November 20, 2009, 9:06pm

There were issues discussed above that are not the same as the issue that I believe you are raising today.

The ABBY OCR engine in DTPO2 produces better quality images after conversion to the DTPO2 defaults than did the IRIS OCR engine used in DTPO 1.x with equivalent dpi and image quality settings.

When scanner output or an image-only PDF is subjected to OCR, the original image is not retained. The image is recreated, and the creation of a new bitmap almost always results in a significant increase in size of the image layer — a larger file size. (I’d love to see Apple tackle that, one of these days.)

That’s why the default resolution and image quality settings in DTPO2 Preferences > OCR are 150 dpi and 50% image quality. This represents a compromise between view/print quality and the file size of the stored PDF after OCR. The view/print quality would improve at higher dpi and quality settings, but at the penalty of a rapid ballooning in file size. (Note: In the current public beta 7 and earlier DTPO2 releases, it can be counterproductive to scan at high resolution and also set the DTPO2 OCR preferences to a high resolution and image quality – the result may be unreadable PDFs, especially under Snow Leopard. This issue will go away in the next release.)

I do most of my scans using a ScanSnap scanner at the ‘Best’ setting, with automatic color recognition. The effective scan resolution is 600 dpi for black & white copy and 300 dpi for copy that contains color. We recommend 300 dpi (or a bit better) to allow the OCR software to accurately recognize text characters, especially those in small font sizes.

I find the resulting searchable PDFs in my databases easy to read, and with readable print output. If the original paper copy is ‘clean’ — without blemishes such as coffee stains or handwritten underlining or highlighting — OCR recognition is excellent.

Yes, there is some degradation of the view/print quality caused by the compromises to keep file sizes down to a reasonable level. But I don’t get complaints from others to whom I’ve sent searchable PDFs exported from my databases.

If, however, I planned to publish in a book or article images of PDFs resulting from scans, I would probably retain the original scan image for that purpose.

I also have to say that not all scanners are equal in scanned image quality, even nominally at the same scan resolution. As is the case with other gadgets such as digital cameras, printers or audio equipment, similar published specifications don’t necessarily indicate similar quality.

aboon · November 20, 2009, 11:31pm

Thank you for this clear and fast response. I tried one thing after your mail: convert to searchable pdf with Acrobat Pro, and then import into DTO as files and folders without OCR. The result is a DTO-searchable pdf with original-pdf readability, and a filesize reduced! from ca 400 to ca 250kb.

I am still wondering the advantage of DT OCR over converting with Acrobat.

Arthur

MDAnderson · November 21, 2009, 5:20am

I’m afraid I don’t have any useful advice to add to this thread, except the observation that 90% of the time Devonthink does what I expect and want it to do but the remaining 10% of the time it produces very faded, grey, chunky and terrible PDFs.

When I end up with one of these terrible scans, which seem to have no consistent reason to end up that way, no amount of fiddling with larger file sizes, slower OCR or adjusting DPI inside devonthink has ever solved anything, it only makes the files much bigger and equally awful.

In those cases, 100% of the time for me so far, I have gotten much better clarity, smaller file size and no problems by using Acrobat Pro.

My low-tech solution has been: use devonthink to automate everything as much as possible and when it won’t work, do it manually with Acrobat Pro 9, which always seems to work.

Yes Adobe is terrible bloated awfulness, but with version 9.2 they seem to have reached that magic build # where everything is working until they break it again a few updates from now

aboon · November 23, 2009, 9:19pm

What do you think of:

OCR with Adobe for all files, then:
‘Files and Folders’ import to DT?

It doesn’t cost any extra time, it saves almost half of file sizes, and you keep the good screen readability.

Arthur

Bill_DeVille · November 23, 2009, 9:30pm

If that works for you, great!

I’ve got Acrobat Pro, and it does a reasonable OCR job.

But I find that I generally get better accuracy using DTPO/ABBY OCR. I’m satisfied with the view/print quality of ABBY searchable PDFs, and I’ve got lots of hard drive space, so I go for the convenience of direct scanner output to DTPO.