Accurately Scanning a Few Pages or Paragraphs in DTPO

takoma · January 14, 2007, 1:39am

My most common scanning need is to scan 1-10 paragraphs or 1-10 pages from a book or magazine/journal. After they are in DTPO, I use the search function to find the sections I want, and pull paragraphs out to use for writing and webpages – so for both searching and copying I want the content to be exactly what is in the book or journal/magazine.

I’m not finding an easy way to do this with the DPTO scanning function and a Canon Lide 70 scaner. When a page is scanned in, it give me a PDF image of the page, but compared to a dedicated OCR program such as Omnipage, it is difficult to select several paragraph in the middle of a multicolumn journal page, and difficult to keep readjusting the scan area to just give me the right hand page or the left hand page of a book. (Yes, facing book pages can be scanned together, but then, as best I can make out, it is not possible to select in DTPO a paragraph on just one page.)

Also, I haven’t yet found an easy way to proof the DTPO scan. I’ve discovered that if I select all or part of the PDF scan in DTPO and use the service Deventhink Pro – Take Rich note, I can create a file with the underlying text, which is good. However, I also find that the IRIS OCR engine frequently incorrectly identifies a number of characters and words in the text, even when scanning from new books with uncomplicated layout. Correcting them is much more painstaking than proofing in Omnipage.

I am also finding that if I work just with the DTPO scanned pages, often searches don’t work correctly if the word in the pdf has been hyphenated across a line break or if the phrase extends across a line break.

The conclusion I’m coming to is that DTPO scanning works well for large number of PDF or single column documents – I’m thinking of Billie deVille and the 100,000 documents in his DTPO databases. The loss in accuracy is counterbalanced by the ease of bringing in a huge number of documents.

However, for people like me who periodically want an accurate rendering of a few paragraphs or pages from a book or journal, a better workflow might be to use the zoning and proofing functions in a dedicated OCR program like Omnipage, and then bring it over to DTPO as a rich text file.

I am still a novice with regard to both the MAC and DTPO, so I might well be missing one or more tools in DTPO which would make it easier to use the scanning function in DTPO for the work I am doing.

Comments and suggestions would be appreciated.

Thank you,

Mitchell

Bill_DeVille · January 14, 2007, 6:52am

Hi, Mitchell.

If you need to extract several paragraphs from a multi-column or 2-page scanned and OCR’d document, try launching it under Preview.

Preview makes it easy (after OCR) to capture a block of text from a single column in a multicolumn document, or a single page from a two-page scan. Preview allows you to ‘draw’ a text block and then copy it to the clipboard. To do that, choose the text tool, and hold down the Option key when selecting a block – such as a column – of text. Paste the clipboard into a rich text document. You will probably wish to add an extra Return at the end of each paragraph, then select the text and choose Services > Format > Reformat.

Finally, you can use spell-check and correct OCR errors.

This will make you appreciate all the more publications that use a 1-column layout. It’s essentially equivalent in the amount of time and effort to using OmniPage. But the OCR with the IRIS engine is faster and more accurate than the OmniPage engine. If you only occasionally need to clip some text from a PDF, it will be much faster to simply OCR the pages in DTPO and then clip text as needed with the above procedure, rather than painstakingly do this for each column and page in OmniPage. (I’ve also got OmniPage, but haven’t used it for some time.)

Your comments about search problems in some PDFs - especially using the Phrase operator in DT Pro Office – when there are hyphens and line breaks is true whether the scan has been made in DTPO or any other OCR software.

Remember that if the purpose of including the OCR’d PDF in your database is to be able to view or print the information, the image layer of the PDF will be trustworthy, much more so than an OCR conversion directly to RTF. If the original copy is clean, it’s highly probable that you will be able to find the PDF via searches or See Also. If you need occasionally to copy some text for a quote or footnote, that can be done using Preview, or by simply selecting Data > Convert > to Rich Text while viewing the PDF, then selecting and correcting, if necessary, the desired text.

But if you need absolute fidelity in the text layer prior to OCR, you may spend hours checking. correcting and proofing a multipage document. I find it much quicker to check documents such as court records after OCR in DTPO, by converting the resulting PDF to plain text and checking it for errors. In such a legal document one can then make notes about any errors in the Comment field of the Info panel. But in checking out one such example running to 150 legal-size pages, the only OCR errors were those on the first and last pages that were errors in reading the printed words on the clerk’s official stamp. (Although the OCR was done on a faxed copy of the original, the court record was easy to OCR, because it was entirely in upper-case Courier type.)

Remember, to be strictly accurate, not to correct typos in the original copy. That court record mentioned above had several typos.

takoma · January 15, 2007, 8:42pm

Dear Bill,

Many thanks for the timely reply.

I’m realizing one of my problems may be the habits I’ve built up over many years of using text-orientated free-form databases. I’m just used to working with the text – highlighting, annotating, and adding comments. I know there are ways to do much of this with the PDF files, but I’m not fully comfortable with (or knowledgeable of) many of these features.

I like the Select - Option Key in Preview – that does make selecting one section from a multicolumn page possible. I’m finding it does the same thing right in DTPO – so if all one wants to do is select the paragraph, there is no need to go to even go to preview.

Initially I was confused about your recommendation to “select the text and choose Services > Format > Reformat.” Then I figured out this was part of WordService by Devonthink. Once I downloaded it and installed it, then I could do as you suggest.

I think the larger issue is, as you suggest, what one wants to do with the content one has brought into the DT database. If one uses DT primarily to find the needle in the haystack using searches and see also with image files, then it is probably not worth the effort to pull out and correct the underlying text.

However, if one has a smaller haystack and wants to extensively comment on the content, break down the content into multiply documents, and use sections in other applications, then a rich text file with the underlying text may be more useful than the image file, and therefore it may be worth the effort to create the rich text file.

As I work more with DTPO, I’m finding that deciding whether to use the DTPO scanning engine or a dedicated OCR to create the rich text file depends a lot on the characteristic of the scanned material. Simple layouts, such as your court material – 150 pages, one column, in upper case Courier – flow really well in DTPO.

On the other extreme, when I used DTPO to scan in facing pages from the beginning of a book chapter with a small typeface , the result was a mess. Apparently IRIS OCR was confused that the first page began lower down than the second page – the paragraphs ordering was jumbled and whole sections were gibberish. Using a dedicated OCR program I could catch the confusion as it occurred and force the paragraph ordering, if needed.

Perhaps a future version of DTPO will give the user additional options regarding how much control they want of the OCR process and whether a rich text file is wanted in addition to (or even instead of) the scanned image.

Thanks again for your help. DTPO provides so many tools and functions, it is easy for a newcomer to miss parts that would facilitate the job he or she is trying to get done. I’ve learned a lot from you and from experienced users who participate in the forums.

Comments and suggestions appreciated.

Mitchell

Bill_DeVille · January 15, 2007, 10:13pm

Hi, Mitchell. Thanks for telling me that the current version of PDFKit allows one to Option-drag to select a block of text in a multicolumn document page displayed in DT Pro. I hadn’t checked that out, as early versions of PDFKit didn’t have that feature.

I’m delighted to find that out.

Yes, you are correct that when OCRing a page with blocks of text, the order of blocks often differs in the text layer from their order in the image layer. So if you use Data > Convert to produce a text version of the PDF it will take some editing to put the text elements into proper order.

That’s not a problem if you are reading or printing the PDF. But it can be frustrating if you need to edit and use the converted text.

As far as correcting OCR errors, I still think it’s faster to view the PDF and converted text document pages side-by-side to correct OCR errors if I’m going to use the text as a quote in another document, although of course that doesn’t actually correct text errors in the PDF document itself.

The developers of Papyrus have hinted that a future version will allow one to read normal PDF documents. I’m hopeful (though still skeptical) that one could then directly correct errors in the text layer. We’ll see. I’m using Papyrus 12 because it has a hybrid PDF file type that lets me produce fully editable PDFs.