I have a set of single-page TIFFs with text files containing the OCR’ed text extracted from the TIFF images, and bearing the same name as the corresponding image file. There is also a delimited-text “load” file that contains metadata, including information about which images constitute which documents:
EFI0000001,PROD001,.\PROD001\IMAGES\IMG001\EFI0000001.tif,Y,,,4
EFI0000002,PROD001,.\PROD001\IMAGES\IMG001\EFI0000002.tif,,,,
EFI0000003,PROD001,.\PROD001\IMAGES\IMG001\EFI0000003.tif,,,,
EFI0000004,PROD001,.\PROD001\IMAGES\IMG001\EFI0000004.tif,,,,
EFI0000005,PROD001,.\PROD001\IMAGES\IMG001\EFI0000005.tif,Y,,,5
EFI0000006,PROD001,.\PROD001\IMAGES\IMG001\EFI0000006.tif,,,,
EFI0000007,PROD001,.\PROD001\IMAGES\IMG001\EFI0000007.tif,,,,
EFI0000008,PROD001,.\PROD001\IMAGES\IMG001\EFI0000008.tif,,,,
EFI0000009,PROD001,.\PROD001\IMAGES\IMG001\EFI0000009.tif,,,,
EFI0000010,PROD001,.\PROD001\IMAGES\IMG001\EFI0000010.tif,Y,,,1
EFI0000011,PROD001,.\PROD001\IMAGES\IMG001\EFI0000011.tif,Y,,,5
EFI0000012,PROD001,.\PROD001\IMAGES\IMG001\EFI0000012.tif,,,,
EFI0000013,PROD001,.\PROD001\IMAGES\IMG001\EFI0000013.tif,,,,
EFI0000014,PROD001,.\PROD001\IMAGES\IMG001\EFI0000014.tif,,,,
EFI0000015,PROD001,.\PROD001\IMAGES\IMG001\EFI0000015.tif,,,,
EFI0000016,PROD001,.\PROD001\IMAGES\IMG001\EFI0000016.tif,Y,,,4
EFI0000017,PROD001,.\PROD001\IMAGES\IMG001\EFI0000017.tif,,,,
EFI0000018,PROD001,.\PROD001\IMAGES\IMG001\EFI0000018.tif,,,,
EFI0000019,PROD001,.\PROD001\IMAGES\IMG001\EFI0000019.tif,,,,
The columns are, left to right:
Unique identifier,
group identifier,
file path,
[Y/-] where & Y indicates the first page of a multi-page document,
n/a,
n/a,
page count.
Finally, there is a another file with substantive metadata, which uses odd delimiters:
þProd Beg BatesþþProd End BatesþþProd Beg AttachþþProd End AttachþþPage CountþþEmail SubjectþþFromþþRecipientsþþCCþþBCCþþSent DateþþSent TimeþþReceived DateþþReceived TimeþþSubjectþþFile NameþþAuthorþþCreate DateþþCreate TimeþþModified DateþþModified TimeþþCustodianþþFile ExtensionþþOriginal PathþþMD5 HashþþTextFileþþNativeFileþ
This is a format designed to be imported into a commercial document database called “Relativity”, which is commonly used by attorneys. I’d like to import the docs into DEVONthink Pro Office as a database, but without the ability to distinguish documents and link the text to the TIFFs, there is little use.
Is this feasible?
-M.