pb3 OCR generates black background

robballan · February 14, 2009, 4:11pm

Odd, the new ABBYY OCR engine replaces white backgrounds in PDF documents with black, causing the text blocks to appear disconnected. Can this be overcome?

Bill_DeVille · February 14, 2009, 4:27pm

What was the source of the image that was OCRd? If the original image is imported into a DTPO2 database, what was its Kind (see Info panel)?

I’ve never seen such a problem with any scans I’ve made, but several users have sent in PDFs that came from sources such as JSTOR or a historical newspaper company, and that resulted in scans with one or all pages of the OCRd PDF rendered as black pages. Annard has sent samples of PDFs (both the original image and the OCRd result) to ABBYY for analysis.

JRPars · February 14, 2009, 6:36pm

What I’ve found is that if I were to OCR a file that already is pdf + text, the black background occurs. If the file is simply pdf, OCR converts without black background.

mitchellm · February 14, 2009, 7:54pm

I’ve just run into the black background issue also. My situation occurs with a PDF (not PDF + Text) document downloaded from a financial institution. I wanted to convert to searchable. I get a resulting black background for all of these PDFs.

I don’t get the black background (so far) for other PDFs that I’ve converted via OCR, and I did several such conversions yesterday of image-only PDFs of research articles.

So in the end, I have no idea why the PDFs today are different from the PDFs of yesterday,but clearly something is different. The only difference I’m noting is today’s PDFs have some color in them, yesterday’s were all black and white only.

robballan · February 15, 2009, 4:42am

My original document is an archival NY Times web page (a PDF), which I saved to DTPO2 by dragging the link. I then loaded the page in DTPO2 and ran “capture PDF”, which creates a PDF+text file inside the database. And then I ran “convert to searchable PDF”. Result: black background areas.

If I examine the “captured” PDF file in Acrobat, it reveals that the file is composed of several separate image blocks (headline, body copy, NY Times logo). These images remain as in the original. But the border areas separating them turn black after OCR’ing.

KP1 · February 15, 2009, 3:58pm

Me, too. Any fix?

Here’s an image:

Bill_DeVille · February 16, 2009, 2:40pm

Annard has referred several such files to ABBYY and is waiting for a response.

annard · February 16, 2009, 7:23pm

I think I have a fix, but please send me the original document (not the converted one!) to support@devon-technologies.com so I can check them.

ionos · February 16, 2009, 7:35pm

I attached a sample file that leads to black results with bug #348206.

Best,
-i

robballan · February 26, 2009, 10:24pm

Looks like pb3r2 fixes it.