Looking for some newbie advice

tauchris · September 3, 2012, 3:22pm

Hi all,

I have a set of about 24,000 documents, all scanned .TIFF files, individual pages. I need to build a searchable database of these documents, and I plan to use DEVONthink Pro Office (which I have been using for personal document management for a while now, and love it!)

Two issues I’d appreciate some head-start/pointers on:

I need to import all these .TIFF files into DEVONthink, and run OCR on them to generate a searchable index. I started doing that through the GUI, and quickly realized that I would need to hang around to click “Okay” for each of the 24,000 documents, one at a time. Obviously, THAT is not going to happen. Can someone suggest a straightforward scripted approach do doing the import+OCR indexing?
All of the .TIFF files contain single pages from original paper documents (which are long gone). I have a catalog file that shows me how many pages were in each original file, so I have the data I need to merge files into multi-page documents. Is that easier to do before I import into DEVONthink (in which case, what tools would you suggest for merging .TIFF documents?) Or if it’s easier to do inside DEVONthink, can someone suggest a scriptable way to do that?

Thanks for any advice!
Chris

korm · September 3, 2012, 4:03pm

In DEVONthink > Preferences > OCR uncheck the preference “Enter metadata after text recognition” and DEVONthink will stop pestering you.

You might want to hold off merging your documents because for some uses (e.g. See Also & Classify) the discrete OCRd pages might have an advantage.

A script that looks in your catalog and determines how to merge a given document is going to be complex and likely to fail from time to time, IMO. Personally, I wouldn’t trust it. The worse thing is that the catalog is wrong, or things the TIFFs are out of order or named incorrrectly, and the script would start munging together pieces that don’t belong together, and you’d need to browse every document to figure that out.

I’d suggest you do not import all the TIFFs into DEVONthink, index them instead, and run OCR against the indexed TIFFs. The resulting PDF+Text will be imported into the database – which is what you want – but not the TIFFs. You won’t need the TIFFs in your database – right? Make sure the setting in Preferences to trash the original is OFF, so your valuable collection isn’t at risk. I assume you have at least two backups of the TIFFs - one of them is off site?

Before working this process with 24,000 documents – try it on a dozen or so. DEVONthink’s ABBYY OCR software can render pretty lousy resolution on graphics - regardless of the resolutions set in OCR preferences, in my experience. Over here, I prefer Acrobat X Pro for OCR. YMMV.

See also this recent dialog: indexing v. Add to database question

tauchris · September 4, 2012, 2:16pm

Thanks, @korm. Sounds like wise advice. I don’t think I’m going to pop for Acrobat X, though. ABBYY will just have to do.