TIFFs and Text files - import from relativity?

temjeito · August 15, 2015, 4:50pm

I have a set of single-page TIFFs with text files containing the OCR’ed text extracted from the TIFF images, and bearing the same name as the corresponding image file. There is also a delimited-text “load” file that contains metadata, including information about which images constitute which documents:

EFI0000001,PROD001,.\PROD001\IMAGES\IMG001\EFI0000001.tif,Y,,,4
EFI0000002,PROD001,.\PROD001\IMAGES\IMG001\EFI0000002.tif,,,,
EFI0000003,PROD001,.\PROD001\IMAGES\IMG001\EFI0000003.tif,,,,
EFI0000004,PROD001,.\PROD001\IMAGES\IMG001\EFI0000004.tif,,,,
EFI0000005,PROD001,.\PROD001\IMAGES\IMG001\EFI0000005.tif,Y,,,5
EFI0000006,PROD001,.\PROD001\IMAGES\IMG001\EFI0000006.tif,,,,
EFI0000007,PROD001,.\PROD001\IMAGES\IMG001\EFI0000007.tif,,,,
EFI0000008,PROD001,.\PROD001\IMAGES\IMG001\EFI0000008.tif,,,,
EFI0000009,PROD001,.\PROD001\IMAGES\IMG001\EFI0000009.tif,,,,
EFI0000010,PROD001,.\PROD001\IMAGES\IMG001\EFI0000010.tif,Y,,,1
EFI0000011,PROD001,.\PROD001\IMAGES\IMG001\EFI0000011.tif,Y,,,5
EFI0000012,PROD001,.\PROD001\IMAGES\IMG001\EFI0000012.tif,,,,
EFI0000013,PROD001,.\PROD001\IMAGES\IMG001\EFI0000013.tif,,,,
EFI0000014,PROD001,.\PROD001\IMAGES\IMG001\EFI0000014.tif,,,,
EFI0000015,PROD001,.\PROD001\IMAGES\IMG001\EFI0000015.tif,,,,
EFI0000016,PROD001,.\PROD001\IMAGES\IMG001\EFI0000016.tif,Y,,,4
EFI0000017,PROD001,.\PROD001\IMAGES\IMG001\EFI0000017.tif,,,,
EFI0000018,PROD001,.\PROD001\IMAGES\IMG001\EFI0000018.tif,,,,
EFI0000019,PROD001,.\PROD001\IMAGES\IMG001\EFI0000019.tif,,,,

The columns are, left to right:

Unique identifier,
group identifier,
file path,
[Y/-] where & Y indicates the first page of a multi-page document,
n/a,
n/a,
page count.

Finally, there is a another file with substantive metadata, which uses odd delimiters:

þProd Beg BatesþþProd End BatesþþProd Beg AttachþþProd End AttachþþPage CountþþEmail SubjectþþFromþþRecipientsþþCCþþBCCþþSent DateþþSent TimeþþReceived DateþþReceived TimeþþSubjectþþFile NameþþAuthorþþCreate DateþþCreate TimeþþModified DateþþModified TimeþþCustodianþþFile ExtensionþþOriginal PathþþMD5 HashþþTextFileþþNativeFileþ

This is a format designed to be imported into a commercial document database called “Relativity”, which is commonly used by attorneys. I’d like to import the docs into DEVONthink Pro Office as a database, but without the ability to distinguish documents and link the text to the TIFFs, there is little use.

Is this feasible?

-M.

temjeito · August 15, 2015, 10:03pm

So…

I think I’ve figured out how to combine the TIFF files into documents, using the tiffutil command I never knew existed. Because those documents should have the same name as the corresponding text files, it shouldn’t be too hard to find the image when I get a hit in the text file. It would be great, though if there was a way to automagically link the two.

For the curious, what I did is, through the magic of RegEx, I rearranged each block of the data file from this:

EFI0000001,PROD001,.\PROD001\IMAGES\IMG001\EFI0000001.tif,Y,,,4
EFI0000002,PROD001,.\PROD001\IMAGES\IMG001\EFI0000002.tif,,,,
EFI0000003,PROD001,.\PROD001\IMAGES\IMG001\EFI0000003.tif,,,,
EFI0000004,PROD001,.\PROD001\IMAGES\IMG001\EFI0000004.tif,,,,

to this:


tiffutil -cat EFI0000001.tif EFI0000002.tif EFI0000003.tif EFI0000004.tif -out EFI0000001.tif

et voilà! instant shell script!

I don’t know if running a 250,000-line script will make my macbook blow up, or how long it will take, but if all goes well I should have a complete set of documents by morning. As long as I didn’t make any typos.

Meanwhile, I can’t figure out how to get DEVONthink to show me more than the first page of any multi-page tiffs.

temjeito · August 15, 2015, 10:15pm

I’d be crazy to try to import 300,000 images into DEVONthink with OCR, right?

BLUEFROG · August 15, 2015, 10:26pm

If tried in one single batch… yes. This would need to be split up into many, many batches - or be run on a server running DEVONthink with absolutely nothing to do for a good oong while. Alternately, you could use a third-party OCR app, but 300,000 files will take a VERY long time, in any case.

korm · August 16, 2015, 11:48am

300,000 files at 30 seconds / page (assuming one page per file, and a FAST OCR processor). Thats about 105 days flat out, no breaks. Got a lot of peanut butter and bread standing by for energy while you monitor the process?