TIFFs and Text files - import from relativity?

I have a set of single-page TIFFs with text files containing the OCR’ed text extracted from the TIFF images, and bearing the same name as the corresponding image file. There is also a delimited-text “load” file that contains metadata, including information about which images constitute which documents:

EFI0000001,PROD001,.\PROD001\IMAGES\IMG001\EFI0000001.tif,Y,,,4
EFI0000002,PROD001,.\PROD001\IMAGES\IMG001\EFI0000002.tif,,,,
EFI0000003,PROD001,.\PROD001\IMAGES\IMG001\EFI0000003.tif,,,,
EFI0000004,PROD001,.\PROD001\IMAGES\IMG001\EFI0000004.tif,,,,
EFI0000005,PROD001,.\PROD001\IMAGES\IMG001\EFI0000005.tif,Y,,,5
EFI0000006,PROD001,.\PROD001\IMAGES\IMG001\EFI0000006.tif,,,,
EFI0000007,PROD001,.\PROD001\IMAGES\IMG001\EFI0000007.tif,,,,
EFI0000008,PROD001,.\PROD001\IMAGES\IMG001\EFI0000008.tif,,,,
EFI0000009,PROD001,.\PROD001\IMAGES\IMG001\EFI0000009.tif,,,,
EFI0000010,PROD001,.\PROD001\IMAGES\IMG001\EFI0000010.tif,Y,,,1
EFI0000011,PROD001,.\PROD001\IMAGES\IMG001\EFI0000011.tif,Y,,,5
EFI0000012,PROD001,.\PROD001\IMAGES\IMG001\EFI0000012.tif,,,,
EFI0000013,PROD001,.\PROD001\IMAGES\IMG001\EFI0000013.tif,,,,
EFI0000014,PROD001,.\PROD001\IMAGES\IMG001\EFI0000014.tif,,,,
EFI0000015,PROD001,.\PROD001\IMAGES\IMG001\EFI0000015.tif,,,,
EFI0000016,PROD001,.\PROD001\IMAGES\IMG001\EFI0000016.tif,Y,,,4
EFI0000017,PROD001,.\PROD001\IMAGES\IMG001\EFI0000017.tif,,,,
EFI0000018,PROD001,.\PROD001\IMAGES\IMG001\EFI0000018.tif,,,,
EFI0000019,PROD001,.\PROD001\IMAGES\IMG001\EFI0000019.tif,,,,

The columns are, left to right:

Unique identifier,
group identifier,
file path,
[Y/-] where & Y indicates the first page of a multi-page document,
n/a,
n/a,
page count.

Finally, there is a another file with substantive metadata, which uses odd delimiters:

þProd Beg BatesþþProd End BatesþþProd Beg AttachþþProd End AttachþþPage CountþþEmail SubjectþþFromþþRecipientsþþCCþþBCCþþSent DateþþSent TimeþþReceived DateþþReceived TimeþþSubjectþþFile NameþþAuthorþþCreate DateþþCreate TimeþþModified DateþþModified TimeþþCustodianþþFile ExtensionþþOriginal PathþþMD5 HashþþTextFileþþNativeFileþ

This is a format designed to be imported into a commercial document database called “Relativity”, which is commonly used by attorneys. I’d like to import the docs into DEVONthink Pro Office as a database, but without the ability to distinguish documents and link the text to the TIFFs, there is little use.

Is this feasible?

-M.

So…

I think I’ve figured out how to combine the TIFF files into documents, using the tiffutil command I never knew existed. Because those documents should have the same name as the corresponding text files, it shouldn’t be too hard to find the image when I get a hit in the text file. It would be great, though if there was a way to automagically link the two.

For the curious, what I did is, through the magic of RegEx, I rearranged each block of the data file from this:

EFI0000001,PROD001,.\PROD001\IMAGES\IMG001\EFI0000001.tif,Y,,,4
EFI0000002,PROD001,.\PROD001\IMAGES\IMG001\EFI0000002.tif,,,,
EFI0000003,PROD001,.\PROD001\IMAGES\IMG001\EFI0000003.tif,,,,
EFI0000004,PROD001,.\PROD001\IMAGES\IMG001\EFI0000004.tif,,,,

to this:


tiffutil -cat EFI0000001.tif EFI0000002.tif EFI0000003.tif EFI0000004.tif -out EFI0000001.tif

et voilà! instant shell script!

I don’t know if running a 250,000-line script will make my macbook blow up, or how long it will take, but if all goes well I should have a complete set of documents by morning. As long as I didn’t make any typos.

Meanwhile, I can’t figure out how to get DEVONthink to show me more than the first page of any multi-page tiffs.

I’d be crazy to try to import 300,000 images into DEVONthink with OCR, right?

If tried in one single batch… yes. This would need to be split up into many, many batches - or be run on a server running DEVONthink with absolutely nothing to do for a good oong while. Alternately, you could use a third-party OCR app, but 300,000 files will take a VERY long time, in any case.

300,000 files at 30 seconds / page (assuming one page per file, and a FAST OCR processor). Thats about 105 days flat out, no breaks. Got a lot of peanut butter and bread standing by for energy while you monitor the process? :mrgreen: