I’m glad you raised this issue, as it has led me to a very pleasant discovery: under OS X 10.6.1 I found that the problem of hard line endings at the end of each line of converted or clipped text derived from PDFs has gone away. In a number of tests of scanned PDFs converted or clipped text viewed as RTF can be reflowed in width (no longer limited to the line width of the PDF) and maintain paragraph flow. It was not necessary to remove extra Returns at the end of each line.
The downside was that there was occasional run-together of a word at the end of a line with a word at the beginning of the next line. Fortunately, the Spell Checker in TextEdit flags those words, and in most cases double-click results in a suggestion of properly separated words.
But I still remembered fighting those hard line endings only a few months ago, when I had extracted quotes from several PDFs. So, checking on another of my Macs that’s still running Leopard, sure enough, text conversions or clippings from PDFs still had hard line endings.
Thank you, Apple! Now, could you look at that little problem of run-together words?
Alarik, like you I’ve been scanning some of my old publications, written years ago, into computer-readable format.
I’ve got several OCR applications, some of which let me output scanned images in various formats, including Word, RTF, Open Office, HTML and variations of PDF output. For example, ReadIRIS Pro 12 offers three variations on PDF output: text, text + image, and image + text. The first two of those PDF options make the converted text read by OCR the frontmost image. The last recreates the original scan image as ‘what you see’, plus an underlying searchable text layer (like DTPO2’s ABBYY PDF output).
The other options provide smaller final file sizes than a searchable PDF which displays a picture of the original document, and provide easier editing as well.
But I always end up choosing the searchable PDF that displays an image that’s faithful to the original document, even though it takes more storage space.
Why? Because I don’t want to see errors such as garbled characters when I’m reading the document. Some of my old documents have blemishes (a coffee stain, perhaps) or a handwritten annotation. These can cause character recognition errors. If I’ve got a document with errors and don’t have the original available, I may not be able to correct such errors with confidence. It’s rather like having a copy of a contract that I can’t be certain is really faithful to the original contract; I don’t like that.
So I pay a penalty in storage space to keep a faithful rendition of the original document in the image that I see. If I need to extract the text and then find OCR errors, I’ve always got the original as the basis for correcting such errors. I save myself the need to manually edit a great many documents in that way.