OCR to plain text?

R_Barre · February 16, 2008, 3:45am

If I import a scanned document, is it possible for DT to save it as plain text, and lose the original document’s formatting? This would be helpful to me in dealing with documents whose text is formatted in columns or odd fonts.

Bill_DeVille · February 16, 2008, 4:13am

The OCR process will result in a PDF file.

Data > Convert > to Plain text can then be used to produce a plain text document.

Because OCR errors can happen, especially if the original copy contains unusual fonts or has blemishes, that PDF version may prove invaluable for correcting errors.

Adobe’s addition of Return after each line of text in a PDF can be irritating when one wants an excerpt to wrap normally in a word processing document.

R_Barre · February 17, 2008, 11:29pm

Yeah, I just tried Convert to Plain Text and was appalled at how much correcting I’d have to do. So I’ll probably just stick with PDFs.

But if I want to place one section of a PDF in a group and another section of the PDF in another group, what’s the best way to split them up?

eboehnisch · February 24, 2008, 7:49pm

You can open the PDF externally in Preview or Acrobat via the contextual menu for splitting it up.

R_Barre · February 25, 2008, 9:40pm

“You can open the PDF externally in Preview or Acrobat via the contextual menu for splitting it up.”

Not exactly sure what this means, but I have yet to find a way to split a .pdf in any way OTHER than at a pre-existing page break. If I want to create a new .pdf starting in the middle of page 1 and ending in the middle of page 2, I can’t seem to be able to do it. I have to take all of pages 1 and 2.

Is there some way to split .pdfs OTHER than at pre-existing page breaks?

eboehnisch · February 27, 2008, 7:58am

As PDF is a page-oriented format, not even Adobe Acrobat, the mother of all PDF editors, can do this. Sorry.