"Automating" OCR of multiple files?

Since upgrading to Office, I’ve been slowing OCRing a lot of the non-searchable PDF articles and books in my collection, and it’s been working great.

One thing I do quite often, is split up a large PDF file into 50 page chunks, using a 10-line Python program I wrote. What I do then, is Import>Images (with OCR) on all the files at once. It would be nice to just sit back and let that run, except I can’t, because after each file, it asks me to confirm the new filename information, author, etc.

Is there anyway to tell DT to just use whatever default values it comes up with, instead of my having to press enter for each one?

It’s not a major problem, I don’t mind keeping my eye on it, and pressing Enter every few minutes, but sometimes I just want to start it as a big batch job and leave to do something else for awhile.

Thanks,
Jay P.

Nice suggestion. It would make a good option to allow auto processing of multiple files.

Hi, Jay. Go to DTPO Preferences > OCR.

Uncheck the “Set attributes” option and your queue of PDFs will chug along uninterrupted.

Perfect, thanks so much!

Jay P.

Jay,

Care to share your 10-line Python program?

revheck

Sure, no problem. It requires pyPdf (pybrary.net/pyPdf/) which may or may not work with the version of Python installed on OS X by default. I use a more modern version of Python (which you can just get from python.org, if the default version doesn’t work).

splitter.pdf (Takes the filename in as an argument)


import pyPdf
import sys

input = pyPdf.PdfFileReader(file(sys.argv[1],"rb"))

for ps in range(0,int(input.numPages/50.0)+1):
    output = pyPdf.PdfFileWriter()
    start = ps*50
    end = start+49
    if start == int(input.numPages/50)*50:
        end = input.numPages - 1
    for page in range(start, end+1):
        output.addPage(input.getPage(page))
    output.write(file("file-" + str(ps) + ".pdf", "wb"))

merge.py (takes a list of .pdfs to merge, I usually just do *.pdf)


import sys
import pyPdf

files = sys.argv[1:]

output = pyPdf.PdfFileWriter()

for f in files:
    input = pyPdf.PdfFileReader(file(f, "rb"))
    for page_number in range(0, input.numPages):
        output.addPage(input.getPage(page_number))

output.write(file("result.pdf", "wb"))

Something to note about merge.py: If the original document is over 500 pages, then merge.py might run into problems, because of the way I name the files in splitter.pdf.

My workflow so far is to split a .pdf with splitter.pdf, then import all the resulting files into DT Pro. Then, I export the resulting OCR’ed files back to the file system somewhere, and run merge.pdf to turn them back into a single, searchable pdf.

Hope this helps! Sorry the splitter.pdf code is so messy, normally I’d do a cleaner job than that :slight_smile: