Deleting carriage returns at the ends of lines from scanned

Alarik · October 17, 2009, 11:17pm

Hi,

I am trying to scan text pages and to run ocr upon them.

I have a new multifunciton HP printer/scanner, and there are difficulties getting it to work correctly. Suffice it to say, I am able to scan multiple text pages and to open them in DTthink pro office. The images can be converted to searchable pdfs, and these, in turn, can be changed into rtf files. Great, so far.

However, pdfs appear to put carriage returns at the end of each line. This renders the converted file virtually unusable–I have way too many pages to convert to spend the time erasing thousands of carriage returns.

What on earth am I doing wrong?

The IRIS ocr software that came with the HP is quite opaque, and there is no easy/automatic way of scanning the whole page as a single text document, that is, without the carriage returns at the end of each line.

I am reluctant to spend a lot of money on an application till I know if I can make it work in the fashion I require.

I am running Leopard 10.5.8.

Image Capture, by the way, is unable to make use of the HP scanner component, I’m not sure why–it’s to do with the mysteries of Twain.

Thank you! Sorry to pester with a gnarly, no doubt familiar problem!

Alarik

Bill_DeVille · October 18, 2009, 12:32am

You are not doing anything wrong. Those hard line endings are inevitable.

If you check your PDFs by clipping text from them, you will find that every line of text has a hard ending. That’s the nature of the beast.

It’s a bit of a nuisance when a multiline quotation is clipped from a PDF, but the extra returns can be removed manually from a few lines, or with the aid of a Service (e.g., one to remove line endings provided by the free Word Service from DEVONtechnologies). Of course, if you are using a Service to remove line endings (perhaps from an RTF conversion of a long document), you may first need to insert an extra Return between paragraphs.

I’ve got thousands of pages of searchable PDFs, and I let them sit in my database with all those hard line endings. I can search them and read them; their information content is useful. Only when I need to extract a quotation and find that I need to strip line endings do I mutter about that.

Alarik · October 18, 2009, 3:59am

Bill,

You amaze me with your speed and ubiquity in the forum. I’m new to this place, but I see your comments everywhere.

I understand what you are saying. I think I need a dedicated ocr program that will do the job properly; a long time ago, I had TextBridge Pro running on os 8, and that worked fairly well.

My dilemma is that I wish to scan an old ms. of mine, many hundreds of pages. Going by way of pdfs is perhaps not worth the candle(s).

I wanted to try the ocr program whose engine is embedded in DEVONthink office pro’s system–but they do not provide a trial.

Really, I thank you for your time!

All the best,

Alarik

Bill_DeVille · October 18, 2009, 5:35pm

I’m glad you raised this issue, as it has led me to a very pleasant discovery: under OS X 10.6.1 I found that the problem of hard line endings at the end of each line of converted or clipped text derived from PDFs has gone away. In a number of tests of scanned PDFs converted or clipped text viewed as RTF can be reflowed in width (no longer limited to the line width of the PDF) and maintain paragraph flow. It was not necessary to remove extra Returns at the end of each line.

The downside was that there was occasional run-together of a word at the end of a line with a word at the beginning of the next line. Fortunately, the Spell Checker in TextEdit flags those words, and in most cases double-click results in a suggestion of properly separated words.

But I still remembered fighting those hard line endings only a few months ago, when I had extracted quotes from several PDFs. So, checking on another of my Macs that’s still running Leopard, sure enough, text conversions or clippings from PDFs still had hard line endings.

Thank you, Apple! Now, could you look at that little problem of run-together words?

Alarik, like you I’ve been scanning some of my old publications, written years ago, into computer-readable format.

I’ve got several OCR applications, some of which let me output scanned images in various formats, including Word, RTF, Open Office, HTML and variations of PDF output. For example, ReadIRIS Pro 12 offers three variations on PDF output: text, text + image, and image + text. The first two of those PDF options make the converted text read by OCR the frontmost image. The last recreates the original scan image as ‘what you see’, plus an underlying searchable text layer (like DTPO2’s ABBYY PDF output).

The other options provide smaller final file sizes than a searchable PDF which displays a picture of the original document, and provide easier editing as well.

But I always end up choosing the searchable PDF that displays an image that’s faithful to the original document, even though it takes more storage space.

Why? Because I don’t want to see errors such as garbled characters when I’m reading the document. Some of my old documents have blemishes (a coffee stain, perhaps) or a handwritten annotation. These can cause character recognition errors. If I’ve got a document with errors and don’t have the original available, I may not be able to correct such errors with confidence. It’s rather like having a copy of a contract that I can’t be certain is really faithful to the original contract; I don’t like that.

So I pay a penalty in storage space to keep a faithful rendition of the original document in the image that I see. If I need to extract the text and then find OCR errors, I’ve always got the original as the basis for correcting such errors. I save myself the need to manually edit a great many documents in that way.

Alarik · October 18, 2009, 6:34pm

Bill,

Your note arrived opportunely. I’ll be visiting the Apple Store next week (to have the not-so-super SuperDrive replaced on my MBP) and just maybe I’ll buy a copy of Leopard. I’m not keen to start in with a new system but, still, if that solves the problem, as you describe it, I just may take the plunge.

Thank you.

Good luck/have fun with your old publications and mss.

All the best,

Alarik

Bill_DeVille · October 18, 2009, 7:02pm

Just to be clear, I’m now running Snow Leopard, OS X 10.6.1. Leopard, OS X 10.5.x, still has those pesky hard line endings in text clipped or converted from PDFs.

Alarik · October 18, 2009, 8:04pm

I understand. I was only waiting for the first few iterations of Snow Leopard before jumping in.

I’m waiting to see if IRIS support can solve this problem; if not, I’ll up my ante and buy DEVONthink Office Pro and, along with Snow Leopard, go the way of turning images into pdfs and pdfs into my execrable old-fart prose. If I can get the big fancy DEVON along with an OCRer then I’d rather go for that than some wiggy optical eccentric character failing recognition program.

I’ve long owned DEVONthink 1.x but was never enough of a user to warrant the extra bump up in investment. However the capacity to use multiple databases and OCR make this attractive.

Bill, Thank you for the help and the good information.

Alarik