Since upgrading to Office, I’ve been slowing OCRing a lot of the non-searchable PDF articles and books in my collection, and it’s been working great.
One thing I do quite often, is split up a large PDF file into 50 page chunks, using a 10-line Python program I wrote. What I do then, is Import>Images (with OCR) on all the files at once. It would be nice to just sit back and let that run, except I can’t, because after each file, it asks me to confirm the new filename information, author, etc.
Is there anyway to tell DT to just use whatever default values it comes up with, instead of my having to press enter for each one?
It’s not a major problem, I don’t mind keeping my eye on it, and pressing Enter every few minutes, but sometimes I just want to start it as a big batch job and leave to do something else for awhile.
Sure, no problem. It requires pyPdf (pybrary.net/pyPdf/) which may or may not work with the version of Python installed on OS X by default. I use a more modern version of Python (which you can just get from python.org, if the default version doesn’t work).
splitter.pdf (Takes the filename in as an argument)
import pyPdf
import sys
input = pyPdf.PdfFileReader(file(sys.argv[1],"rb"))
for ps in range(0,int(input.numPages/50.0)+1):
output = pyPdf.PdfFileWriter()
start = ps*50
end = start+49
if start == int(input.numPages/50)*50:
end = input.numPages - 1
for page in range(start, end+1):
output.addPage(input.getPage(page))
output.write(file("file-" + str(ps) + ".pdf", "wb"))
merge.py (takes a list of .pdfs to merge, I usually just do *.pdf)
import sys
import pyPdf
files = sys.argv[1:]
output = pyPdf.PdfFileWriter()
for f in files:
input = pyPdf.PdfFileReader(file(f, "rb"))
for page_number in range(0, input.numPages):
output.addPage(input.getPage(page_number))
output.write(file("result.pdf", "wb"))
Something to note about merge.py: If the original document is over 500 pages, then merge.py might run into problems, because of the way I name the files in splitter.pdf.
My workflow so far is to split a .pdf with splitter.pdf, then import all the resulting files into DT Pro. Then, I export the resulting OCR’ed files back to the file system somewhere, and run merge.pdf to turn them back into a single, searchable pdf.
Hope this helps! Sorry the splitter.pdf code is so messy, normally I’d do a cleaner job than that