correctly OCR'd pdf files imported strangely (s p a c es)

Hilko · February 8, 2007, 7:52pm

I just ran into a strange DT problem. It’s been mentioned before, but I haven’t found a ‘proper’ answer yet.

In my quest to use DT for all my research papers, I first had to buy Adobe Acrobat to turn image pdf’s into recognisable text, and to be able to comment and edit those papers.

I was happy to discover the OCR process worked like a dream; I could do searches within Acrobat with no trouble whatsoever.

Not so much in DT, though. Once imported, I found that DT puts spaces in words, in seemingly arbitrary places. Obviously, when searching for ‘attitude’ in a text doesn’t work, if DT’s indexes it as ‘a t titu de’.

It’s been mentioned that changing the index settings ‘might’ help, and sometimes only slightly. The problem is that it doesn’t always work, and when it doesn’t, this seems completely arbitrary. I simply don’t have the time to manually test for every paper if it is indexed correctly.

So, my question: How would I go about solving this issue? Because I have Acrobat, I can change most anything about a pdf (barring certain security settings). Surely someone must have figured out what exactly causes DT to not be able to index text in a pdf that works perfectly in Preview and Acrobat (Reader)?

There is a certain urgency for this problem, as I don’t want to be forced to re-import and edit all the papers that I’m about to import. Please help!

Bill_DeVille · February 8, 2007, 9:33pm

Hi, Hilko. The problem you have encountered is not uncommon. It isn’t caused by DTPO’s indexing of the text content, but by OCR errors in converting an image-only PDF to PDF+Text.

The reason is that no OCR software can read and convert the text in an image-only PDF as accurately as can a human. The overall accuracy of OCR software has improved dramatically over the last ten years or so. But I don’t expect OCR to attain the human level ability to read and convert an image of text to computer-readable text within the next ten or twenty years, if ever.

If the scanned paper copy is “clean” (no marks or blemishes), was scanned at an adequate resolution for OCR (usually at least 300 dots per inch), the text image is a “standard” font that’s 12-point or higher, there are not mixed fonts near each other, and there aren’t graphics closely juxtaposed to text, you should expect very good accuracy, few if any errors.

I’ve scanned and OCR’d hundreds of pages from paper with no errors at all in the resulting PDF+text. That’s the ideal case, where the original paper copy meets all of the above criteria.

But I’ve scanned thousands of pages from paper that contain errors resulting from OCR problems, because the criteria listed above were not met in the original document.

In the vast majority of cases, transforming paper to searchable PDF+text in my database still makes the information content of those paper documents more accessible to me, even with some OCR errors. Overall, I’m delighted by the results.

I always send OCR’d material to my database as PDT+Text. That’s because the PDF+Text format is “self-validating”. Even an original paper copy in poor condition, leading to numerous OCR problems, can be viewed and printed (and interpreted by me), so that it will likely make sense to me. But in such a case, I’ll probably add notes to the document’s Comment field to help DTPO find it for me (just as I do if I send handwritten notes to my database as PDF images). If I need to extract the text from that document using Data > Convert > plain or rich text, I can edit errors by looking at the “original” image layer in the corresponding PDF+Text document. Even on my MacBook Pro screen, I can place the text version side-by-side with the PDF+Text version as an aid to proofing and error correction.

I believe I’ve bought and tested every OCR application available on the Mac since the earliest ones (I’ve owned just about every version of Acrobat Pro, including version 7.x but haven’t tested version 8 yet). In my own experience the IRIS 11 engine used in DTPO is overall the fastest and most accurate I’ve used.

The training and pre-editing features in some OCR software has never seemed useful to me. I’ve tried them. In my experience, training for one document is likely to lead to even more errors in the next document I wish to OCR, because the next scan will probably contain different fonts, etc. Pre-editing can be very time-consuming and (to me) irritating, so I don’t bother. It’s easier to edit a text conversion later on.

Bottom line: One must accept that while OCR can be very useful, the current state of the art is not error-free for a number of reasons – and it will be a long time, if ever, before error-free OCR might be attained.

Near the top of my wish list would be an application that would allow me to correct OCR errors in a PDF+Text document without changing the existing image layer. I’m not aware of such an application, but I hope someone is developing it.

Hilko · February 8, 2007, 9:47pm

After some extensive searching and tinkering, I discovered that acrobat is partly to blame by often misreading gaps between letters as spaces.

Here’s the problem though: Within acrobat I can search for specific words with no trouble. When I open the exact same file in preview or import it to DT, searching for these words does not work. Does acrobat use some intelligent method to filter ‘fake’ spaces withing words? Surely there must be a way to save a pdf, retaining these ‘fixes’?

Again, even if the acrobat OCR would be flawed, somehow searching within the application works fine. And yet, in preview and DT it does not, because of these odd spaces.

Bill_DeVille · February 8, 2007, 9:55pm

Could you attach one of those PDFs in an email to Support? We’ll take a look at it.

Hilko · February 8, 2007, 10:17pm

I just send a mail with the attached ‘example’ pdf.

Hilko · February 8, 2007, 10:44pm

If this problem cannot be solved, would a possible ‘hack’ be to export the pdf from acrobat to ‘text’ (which does result in text with no ‘phantom spaces’) and then put the text in the comments box of the pdf in question? Are there any drawbacks to doing this?

Hilko · February 8, 2007, 11:03pm

Sorry for the double post, but after trying this, here are my findings:

For searches, this solution works. It searches the comments field. However, for the See Also and Classify function it does not.

A possible workaround is to dump the plain text file in the same group as the original ‘garbled’ pdf, and (preferably) give it the same name. This way, for any pdf that tends to put spaces inside words, or (as I’ve also found) pdf’s that glue words together, one could use the ‘parallel’ plain text file to accurately use the See Also and Classify functions.

This does feel like an inelegant hack, though…

Edit: I just tested it, and the results are very good. Two papers on a similar subject, that in their original pdf form don’t rank closely in ‘See also’ appear right below each other when the ‘parallel’ plain text version is used.

Bill_DeVille · February 9, 2007, 12:41am

Just as a followup.

Hilko’s idea of ‘attaching’ a good (or edited) text capture to the database in cases where there are substantial errors in the text layer of a PDF is a good one. For example, give the same name to the text file as the name of the PDF, adding an additional character such as appending ‘1’ to the text file name.

I found that Hilko is using Acrobat 8 as to OCR PDFs. There are new changes in Acrobat 8 that seem to have caused compatibility problems when those PDFs are read as PDF+Text, using Apple’s PDFKit. Hundreds of extra spaces are added between letters in a document.

Anyway, when I ran OCR using DTPO on a sample PDF provided by Hilko, DTPO produced fewer actual OCR errors than did Acrobat 8.

Hilko · February 9, 2007, 12:48am

I do hope you’ll append this ‘plug’ for DTPO with some help on my other findings . Considering that many of the documents that have problems are actually OLD research papers, and the fact that I had these problems before installing Acrobat 8, I think my meddling with OCR only exposed the problem and its possible source (for me anyways), rather than cause it.

Perhaps others could check this out as well? I find that many research papers show this problem, particularly the older ones in human sciences, and especially when the spacing bettween words is already a bit ‘vague’. You might not have noticed, since these papers, if they’re “ocr’d” allow you to select text and often correctly index a number of words. You can see if the problem exists by clicking on the keywords button (and seeing either strung-together words or half words), converting to plain text, or using the concordance to find odd words.

Edit: You know, you could just give me a free upgrade to DTPO and solve all my problems .

Bill_DeVille · February 9, 2007, 1:57am

Hilko has uncovered some very real problems with PDFs in his database.

The first, extra spaces showing up when PDFKit or Preview examines the text of PDFs OCR’d using Acrobat 8, is certainly real (and confirms some of my fears about Acrobat 8 ). Such PDFs are searchable by Acrobat itself, but many words can’t be searched in DT Pro or in Preview because they see the words as broken up by spaces between letters. So we have a new kind of compatibility problem, and I hope Apple can modify PDFKit and Preview to get around it.

The second problem he discovered is that many of his older PDFs from social sciences sources contain many ‘mashed together’ words that also result in search problems. (I ran OCR on one of those with DTPO and got very good results.)

We physical scientists sometimes tease social scientists, and I couldn’t resist telling Hilko that my old PDFs from reputable science journals have ‘clean’ text and search very well. (We really shouldn’t tease social scientists, I suppose, as their subject matter usually isn’t as simple and easy to address as the things we work with.)

In any case, the worst PDFs I’ve ever looked at came from mathematics journals. So is it OK to tease mathematicians?

talazem · February 15, 2007, 11:26pm

As I mentioned in a previous post, I’m having this same problem. It is not viable, I believe, to use the hack mentioned above, even if I did have that kind of time.

I hope the good folks at DT can find a solution to this; in order for DT to remain usable, this issue must be addressed, one way or another. Acrobat is the standard, whether we like it or not. Acrobat is not going away any time soon. Most PDFs people will get will come from it. I know DT relies on what Apple gives it in the form of AppleKit; but if Apple doesn’t change its AppleKit, it will still be up to DT to work, one way or another.

The “AI” features of DT such as “See also” are the major draw for many users (though there are myriad, uncountable “niceties” in dealing with documents and notes in DT that only deep usage unveils). However, the AI is rendered useless when this problem occurs; DT basically just becomes another folder hierarchy without it. DT is a real “Pro” app in my opinion; but in order to be and remain so, it has to do what it states it does no matter what – regardless of the “pre-made” technologies it might have used before. The users of the software, people who effectively live in DT, need it to.

Bill_DeVille · February 16, 2007, 2:00am

talazem:

As I mentioned in a previous post, I’m having this same problem. It is not viable, I believe, to use the hack mentioned above, even if I did have that kind of time.

I hope the good folks at DT can find a solution to this; in order for DT to remain usable, this issue must be addressed, one way or another. Acrobat is the standard, whether we like it or not. Acrobat is not going away any time soon. Most PDFs people will get will come from it. I know DT relies on what Apple gives it in the form of AppleKit; but if Apple doesn’t change its AppleKit, it will still be up to DT to work, one way or another. With the userbase of DT growing, and it being a real “Pro” app in my opinion, it has to do what it states it does no matter what – regardless of the “pre-made” technologies it might have used before. The users of the software, people who effectively live in DT, need it to.

Hi, talazem. You are quite right in that users of DT Pro need to be able to effectively search PDF documents, which are currently among the most ‘universal’ of file formats.

Actually, there are a number of flavors of PDFs; some because of file changes introduced by Adobe in different versions of their software, and some related to the specific needs of creators of PDF, such as differences in color management and so on for ‘pre-press’ files.

If you look at the PDF version in files downloadable from journals, they are usually created as an older version, so that potential compatibility problems are reduced. For example, the PDF versions of articles in this week’s issue of Science are version 1.4, and many journals continue to use version 1.3 to distribute PDF documents.

PDFs produced by printing as PDF under Mac OS X are version 1.3 or 1.4.

There is a problem when PDFs OCR’d by Acrobat 8 are opened under Preview or the PDFKit viewer in DT Pro. Searching and indexing don’t work properly. I expect that Apple will soon react to this with changes to Preview and PDFKit. It would be a major development job for DEVONtechnologies to replace and code an alternative to PDFKit.

For the moment, the vast majority of PDFs “out there” remain compatible with PDFKit and Preview. But the problems introduced by Acrobat 8 need to be addressed soon.

I’ve held up on upgrading to Acrobat 8 because I saw some early reports of problems such as incompatibility with the ScanSnap Manager driver.

eboehnisch · February 16, 2007, 9:43am

We may have an option to edit the “hidden” text layer of OCR’d PDFs in a future version. However, when you scan with 600 dpi and higher, the results should be satisfactory. With the Fujitsu ScanSnap, 300 dpi are not enough. You need 600 dpi. But that’s all up to the OCR engine, and we already use the best we can get on the Mac, that is ReadIRIS 11.0.