Best way to OCR PDF files?

WritingStudio · July 5, 2005, 8:26pm

I’ve recently upgraded to DTPro and need to get serious about learning this program. The primary use I plan for it is to keep and track research, which at the moment is 70% in PDF files that are images of old journal articles.

Does anyone have a good way to OCR (or equivalent) the PDF files so they can be searchable? The PDF2TXT feature doesn’t work for these files, unfortunately. I have Omnipage 11 on an old Wintel system, but anything: 1) more recent, 2) more accurate, and 3) for a Mac (w/ Tiger) would be great to hear about.

TIA

Bill_DeVille · July 5, 2005, 9:06pm

There are several OCR approaches for the Mac ranging from some fairly expensive pro/semipro scanner/software combinations to general purpose software for a range of scanners.

I’ve got several OCR packages, including OmniPage Pro X. The most recent in development seems to be ReadIRIS 9, which works pretty well for me. Accuracy is generally good enough that I don’t bother to edit, and saving as the PDF image with ‘hidden’ text lets me read the document accurately, while DT has a reasonably good copy of the text for search purposes. Small type is likely to be bollixed in the OCR translation, but of course I can read it onscreen in the original PDF image.

Most consumer scanners are fairly slow. Be sure to use one that can accept a FireWire or USB 2 connection, as that helps.

WritingStudio · July 5, 2005, 10:04pm

Hi, Bill,

Thanks for getting back to me.

My biggest problem is with PDFs that I’ve downloaded, not ones I’ve scanned myself. (I do have an Epson 3200 Perfection scanner w/ firewire that I use for scanning stuff myself. It’s a sweet little machine. )

It seems that since I own Omnipage 11 I’m eligible for the upgrade price, which puts Omnipage Pro X and ReadIris Pro 9 in the same ballpark, dollar wise. I’ll do some closer investigating of those two since I need something that can handle as wide a tolerance range as possible for conversion.

The ‘hidden text’ with original PDF image is exactly what I’m hoping for. (I’m assuming here that DTPro can search the ‘hidden text’ portion of the PDF.)

Learning curve is lookin’ mighty steep, but DTPro seems to be the only thing in sight that will give me any semblance of control over this enormous stack of info-bits I’ve accumulated over the last few years. Your help is appreciated.

Bill_DeVille · July 6, 2005, 12:57am

I lean towards ReadIRIS Pro 9 for its handling of PDFs, both reading them and for the PDF+text output option.

My scans are done at 300 dpi or higher so that there’s a good image for OCR. Unless I want to preserve a document at good resolution for printing, I usually process the PDFs after OCR to reduce file size, using something like PDFShrink’s ebook settings.

Multipage document scans can be done as a series of images into the OCR software, then OCR’d and output as a multipage PDF. For single-page documents, I just adjust the setting to automatic OCR and that’s it.

One other trick. I like Index imports of PDF files. (I wish that could be done for PDFs imported directly into the database.). I have a folder named “123 Index to DEVONthink”. After reducing the file size of OCR’d PDFs, I dump them all into that folder. The trick is that I’ve set up a folder action script that automatically indexes new content to DEVONthink Pro. (Several Folder Action scripts were included on your DEVONthink/DEVONthink Pro download disk image. Go to the Help menu in the Finder, select Mac Help and search for folder actions.)

It’s possible that DEVONtechnologies may in the future add features to make scanning and OCR even easier.

ChemBob · July 6, 2005, 4:34am

I’m out of town and not online as much as usual, but here’s what I’d suggest given that you already have .pdf files of your documents. Get Adobe Acrobat Professional and use the OCR capabilities in it. It opens the .pdf, you tell it to OCR and it keeps the original image and the hidden text. I’ve had really good luck with it and it is what I always use for this purpose.

ChemBob

WritingStudio · July 6, 2005, 7:53pm

Bill, I downloaded the trial of ReadIris 9 and it works great! The few tests I did came in as “PDF+text” and I was able to search them. The best part was how fast it was with very little interaction on my part. All those scores of PDFs no longer look quite so daunting.

ChemBob, Thanks for the note on Acrobat. I checked the website and while I have used the program in the past on Wintel machines (I think the last version I owned was 5), the Mac version seems a bit, uh, shorted - a lot of the features are Win only - as well as a demo available only for Windows. I do miss the ability to highlight, make notes, etc., that the full Acrobat program gives me. If I’m reading the Adobe site correctly, only the ‘professional’ version allows me to create PDFs that can be annotated in Reader 7. I do love that ability, so I’ll keep an eye on any updates they have for the Mac version.

Many thanks!

One problem solved… now I have to go figure out sheets and records.

(I’m assuming there’s a manual in the works but that it’s not ready yet. What I really need is “DTPro for the Clueless” but somehow I don’t think that’s going to show up on Amazon any time soon. )

Bill_DeVille · July 6, 2005, 8:21pm

The DT Pro Tutorial database provides some documentation, illustrations and tips. Play with it for a while.

howarth · July 6, 2005, 9:31pm

The Mac version of Adobe Acrobat Professional certainly allows highlighting and comments. If you are running Tiger, you may now do the same in Preview. I always open PDF files with Preview, because it’s fast, whereas Acrobat Reader or Professional seem to take forever.

ChemBob · July 7, 2005, 7:47am

Hmmm, well I admit I haven’t checked in a while but Adobe has always had a policy of maintaining absolute parity between their programs that are available both on the Mac and the PC. This has been to a fault actually, with Adobe not taking advantages of features that would have increased the Mac version’s worth relative to the PC version. Maybe this has changed but I’ve used Acrobat 7.x really effectively in my projects lately. I’ve used it to OCR pdfs and been amazed at its accuracy. I received a BIG (too big, LOL) report from Los Alamos for review where all the chapters were in .pdf format. I used Acrobat to highlight, comment, and edit every aspect of those files before sending them back to Los Alamos. So I don’t know what I might have been missing in the Mac version but, whatever it was, I found everything I was needing to do these jobs. Oh, another thing I do with it…when I save a Word doc to .pdf it starts a new document every time I change the page orientation from vertical to wide (horizontal, whatever, for a table, e.g.). So it is possible to end up with quite a number of .pdf files from a single Word document. Acrobat allows me to assemble those .pdf files into the single .pdf they should have been, prior to sending the document to my client.

Anyway, I hope you find something that helps you get your work done efficiently, as we are all striving to do.

ChemBob

Bill_DeVille · July 7, 2005, 5:10pm

ChemBob:

I generally agree with most of your comments, but you just made one that I don’t agree with.

Adobe has never given the Mac versions of Acrobat anything like parity with the Windows versions. Check out the feature list of Acrobat and Acrobat Professional and you will see that many (especially the ones I would like to have!) are Windows only. That’s still true for Acrobat Professional 7. The Mac version should be cheaper than the Windows version.

I’ve used Acrobat to put together thousands of pages of material in hundreds of technical reference documents. There were times when I simply had to resort to Windows to get things done right. The last time I had to do something like that, I used Stone Studio’s Create, which let me stitch together a document from PDF, Word RTF and RTFD elements, with full and easy control of layout, pagination and headers and both PDF and HTML output (which looked and printed identically). And I can edit the Create document! (There’s a learning curve, but Create and PStill are all Cocoa code.)

I use Acrobat these days for Web page captures, because the resulting PDF has working hyperlinks. It does that better than anything else I’ve got. But comes the day when Apple lets me save a Web Archive to PDF with hyperlinks retained, and Acrobat is gone, as far as I’m concerned.

toddm · August 19, 2005, 4:43am

Is there a way to OCR PDF files without a scanner? I have hundreds of PDFs that can not be searched. I was hoping that there was a program out there that could take my already-existing PDFs and make them searchable, without the need for a scanner. Sorry if this is a dumb question.

rollo · August 19, 2005, 5:14am

I use two ways. One is to open the file in Omnipage Pro X, you can OCR it there with a high degree of accuracy and save it as Text, RTF, Word or others, or to use Acrobat Pro with “Save As…”, and you can save it in RTF or Word formats.

Rollo

ChemBob · August 19, 2005, 1:51pm

Todd, the posts above your query in this thread address this in some detail. Perhaps you can find the information you are requesting in the above posts.

ChemBob

toddm · August 19, 2005, 3:01pm

Thanks I will try one of those methods. Wow this forum is helpful!

rickl · August 22, 2005, 2:05am

My main theme haunting these forums was supposed to be getting to know better the software I already own rather than buying new stuff. But reading this thread has already got me wanting Create and Acrobat Pro! My impression is that Create isn’t terribly popular among the mainstream Mac community (“mainstream” being defined here as people like us who use DT, Circus Ponies Notebook, Mellel, Bookends, and Omni software, with minor additions and substitutions allowed ), but checking out the website it looks awesome. If time allows, Bill, any elaborations on Stone software would be most welcome.

WritingStudio · August 22, 2005, 7:06pm

Toddm,

My original inquiry was about PDFs that already existed, not ones I was scanning. The Readiris 9 program I bought has been great for this, allowing me to OCR and and then search within DT Pro while at the same time keeping the image file intact so I can see the original. So far, it has worked just the way I wanted it to. I went with Readiris9 because of: 1) easy and did what I wanted it to do without a lot of fuss, and 2) price was cheap - I downloaded the demo version and toward the end of the demo period they sent me an email saying I could buy the whole package for the upgrade price. Very nice. (I may still get Acrobat down the road, but at the moment I use its features so little that I can do it on my old Win XP system and transfer the files to my mac.)

I’m a long way from being proficient in DT Pro, but this has allowed me to get a foothold in an otherwise overwhelming product. I’m learning as I need it rather than trying to absorb it all at once. At this point, it’s more about corraling the chaos than controlling it, but since I had info all over my hardrive (and elsewhere) it’s been nice to start to have it all in one place. (Spotlight has helped, especially for quick and dirty one-off searches, so between the two I’m starting to feel that I haven’t “lost” too much information. Remember that last scene in Raiders of the Lost Ark? That’s how I felt my information was being stored without DT Pro. I knew I had the info, just didn’t know where!

(Although the COMMENTS addition that has been wished for/discussed elsewhere here would be fantastic for me.)