Problem extracting text from PDF

acl · November 20, 2008, 8:55pm

Hello, I use DTP to organise papers for work (physics). These tend to be in pdf format. I have the following problem: Some of these papers (almost all preprints I download from arxiv.org and some of the papers I download from online journals) confuse OS X (and DTP), so that the text is read with incorrect spacing. If there are more spaces than necessary, I don’t care as it does not affect my searching. But if spaces are missing, this renders searching completely useless (not to mention “See Also”!).

I have tried pdftotext and PDFKit and, while the precise spaces that get lost differ, the problem does not disappear. Also, I have checked with Acrobat, and it reads the files correctly (that is: I open a pdf with Skim, Preview or DTP and cut some text, then paste it into a text document: lots of words run together; repeat with acrobat: no such problem. Can also do this with “export to RTF” or whatever). So clearly the problem is not with DTP itself.

My current solution is to OCR the files with DTPO; this works, but is slow, produces ridiculously large pdfs (even with image resolution 150dpi and quality 75%, which produces ugly results but readable text). However, this loses colour (not important) and cross-references etc (bad).

I have tried using acrobat pro to save as “optimized pdf” and making it compatible with earlier versions of acrobat (that’s what you can choose there), but it doesn’t help.

The best solution I’ve found so far is to use acrobat pro to export as images, then reimport and OCR that; this has the same disadvantages as doing OCR in DTPO, plus I don’t see how to automate it; however, the resulting files are much smaller.

So to finish this long-winded post, does anybody have any tips on how I could make OS X (hence DTP) read my text pdf files correctly without exporting to images and the doing OCR again?

Thanks in advance!