Importing Database files to Devonthink Database - Legal etc.

gelbin · April 26, 2014, 3:13am

Hi all,

I am trying to figure out a way to (probably through applescript) be able to import a set of PDFs and modify their metadata within DTPro based on a data file from a legal document database (such as concordance). The database file basically has a row of fields such as begpage endpage createdate author etc. and I want to capture that data when I import the files into DTPro. As it stands, there are probably not the right amount of fields in the DTPro database to get all these fields, but I could at least get the most import data from these files and import them. Can a script modify all the data fields? I may reach out directly to eric to see if this is something that they would like to support as I know they certainly target the legal market, but I wondered what people on the boards thought. thanks.

korm · April 26, 2014, 10:32am

Not enough information to know how to program this … but …

of the metadata mentioned (guessing at their exact definitions): “createdate” can be set in a script. “begpage”/“endpage” have no equivalent properties in DEVONthink so you would have to store them in a comment, or not at all. “author” can be set but not easily.

And, once you have imported this library with associated metadata – what would you want to be doing with the database? Your use case(s) would also affect what a programmer would design.

Have you considered using a reference manager – for example Bookends, which works well with DEVONthink?

Frederiko · April 26, 2014, 7:52pm

gelbin:

Hi all,

I am trying to figure out a way to (probably through applescript) be able to import a set of PDFs and modify their metadata within DTPro based on a data file from a legal document database (such as concordance). The database file basically has a row of fields such as begpage endpage createdate author etc. and I want to capture that data when I import the files into DTPro. As it stands, there are probably not the right amount of fields in the DTPro database to get all these fields, but I could at least get the most import data from these files and import them. Can a script modify all the data fields? I may reach out directly to eric to see if this is something that they would like to support as I know they certainly target the legal market, but I wondered what people on the boards thought. thanks.

Hello gelbin,

I manage a number of constantly updated trial litigation databases (20 000+ documents each) with Devonthink so I have constructed workflows to deal with scenarios like yours. There are limited options to adding metadata to documents as you cannot programatically modify the document metadata itself, only the metadata stored internally by Devon.

Some of the fields typically found in legal indexes are easy to deal with. An applescript can read in a csv with a discovery schedule and allocate the fields appropriately -

Dates
Devonthink has two useable date fields, creation date and date added. Typically I will use the creation date for the date the document is credited as per the discovery schedule. The date added field is useful for items like email attachments which must have both the date of their email (so they appear correctly in sort order) but which will have their own logical date of creation. This allows for two views of a document - the date it was transmitted and the date it was created.

People
People such as authors, participants, observers and correspondents are best dealt with by storing them as tags, perhaps with a prefix to distinguish their use in other contexts. So the author of a document might be tagged ‘au-jsmith’, the participant as ‘pt-jsmith’ and so forth. This allows for easy creation of smart groups to extract the documents relevant to only jsmith whether he be author or participant

Document numbering
Document numbering (such as the item number in a discovery schedule) should be added as a prefixed leading zeroed number as a prefix to the document description. Keeping a consistent numbering scheme makes it easy to extract the document number and document description for purposes of preparation of indexes and such things.

Some fields are more complex to deal with …

Page Numbering
A peculiar characteristic of litigation that distinguishes it from most other fields involving document management, is that the numbering of documents is constantly changing and being added to. Different jurisdictions around the world deal with this problem differently but typically the problem is this:
A document starts off with no number. When a document is incorporated in a discovery schedule it will get a page number specific to that discovery schedule. The document will be numbered yet again when it is used in an interlocutory application. At some point the document may form part of a trial bundle at which point it will receive yet another number. All along the way these documents are being referred to in trial briefs, affidavit and various forms of submissions and the reference must be to the correct page number in context. Using conventional techniques, a few plaintiffs and defendants with a couple of thousand documents with discovery schedules, supplementary discovery schedules, requests for discoveries flowing back and forth at a furious pace and you have a job which can have attorneys and clerks pulling there hair out for weeks at a time all the time while trying to understand how the case is evolving and not miss a beat. Devonthink is better than Redbull and Prozac for dealing with this. It is however a topic all of its own which may not be relevant to you.

This topic has already probably become much too esoteric for this forum but if you would like to discuss it more please send me a private message.

Frederiko

gelbin · April 27, 2014, 2:53am

thanks to both of you. The data sets that I am looking at are standard Concordance output sets. Tags seems like a good way to somewhat deal with the issue, though then the number of tags may get lengthy. It seems tags are easily applied through applescript. I understand that some of the metadata fields in DTPro are harder than others to update by script, so i need to look into it. Once again, we are talking about databases with millions of pages, and thus hundreds of thousands of documents. A challenge…but…

gelbin · April 27, 2014, 11:48pm

Hi again:

Let me give an example. When documents are searched for and produced by a company, one of the important pieces of information is who’s file the document was from - the custodian. This information can be conveyed as part of the concordance data file that contains metadata from the document set being produced. I want to be able to capture this information in some way in DTPro (which is cheaper and way better than the standard litigation databases).

I am hoping that I can translate the data provided to fields within DTPro, or as suggested, to use tags for those things for which there is not a good place to stick the data. For example, I use the spotlight comments field to put my “attorney notes” for the document. I also use the URL field as a place to store my notes. I have noticed that some of the fields are not editable, though I have been able to edit some of them using scripts. So I am hoping that I can figure out:

which DT fields are easily scripted
then translate the data to appropriate fields using a script
then create tags for the remaining fields that I want to capture the data from and apply tags as indicated by the data

make any more sense?

If someone can help me answer 1 above, then i would be off an running - though i am sure soon back here for help!

For anyone paying close attention, here is what the data fields are in the data file:

BegBates EndBates BegAttach EndAttach Custodian Author Sender To CC BCC Title/Subject DateSent DateRcvd DateCrtd DateSvd LastPrintedDate FileExtension FilePath Attachment Name MessageUnitID NativeLink

c

BLUEFROG · April 28, 2014, 7:12pm

Just a note: Spotlight Comments are visible are searchable on your machine. Depending on how “private” this data is, you may not want to use Spotlight Comments for this purpose.

gelbin · April 29, 2014, 8:15pm

gotcha

korm · April 29, 2014, 9:19pm

You really want to use tags to manage the metadata for all those documents? It’s easy to delete a tag from the Tags group and never know it happened.

Just saying. If it was my livelihood I’d probably not consider tags a good option.

gelbin · April 29, 2014, 9:31pm

fair enough. i think i could use replicants in folders…scriptable?

Frederiko · April 30, 2014, 4:00am

Two comments -

a) consider very carefully which of this information you actually need. Some of it looks like information which is specific to identifying the document in concordance and may be irrelevant for your purposes. A good example of this is the LastPrintedDate field. I assume that this is the date the document was last printed from the concordance database and would thus be entirely irrelevant to you.

b) the most often overlooked place for storing metadata is the filename.

In general store metadata in tags when the precise piece of metadata is shared by more than one item. Store metadata in the filename or description when the metadata is specific to that document and will never be repeated (such as a document identification number. Store data in the filename where you are unlikely to ever have to look at the data directly but want it available for possible future export or to be able to search for

Store metadata in a file name with a consistent schema and you will be fine. Something like this works - _UnitID_DateSaved_Custodian_File name

Here are some suggestions on how I would deal with each item:

BegBates& EndBates: Only the first page number needs to be stored. The end page number is calculable. This is best stored as a prefix in the spotlight comment.
BegAttach EndAttach: I am guessing but I would think this is similarly a calculable page number
Custodian: filename
Author: tag
Sender: tag
To: tag
CC: tag
BCC: tag
Title/Subject: Descriptions
DateSent: Date added
DateRcvd: Dated modified
DateCrtd: Would probably ignore
DateSvd: Would probably ignore
LastPrintedDate: Would probably ignore
FileExtension: already saved as part of filename
FilePath: probably relevant only to concordance
Attachment Name: I would think the attachment is stored separately. If you really need to tie the file and attachment together (and it doesnt flow from the ordering of the documents) adopt a schema whereby a file is prefixed in its descriptions with a number such as 0100 … and an attachment will be 0101 …
Message UnitID : filename
NativeLink: probably ignore

gelbin · May 7, 2014, 9:59pm

Frederiko:

Two comments -

a) consider very carefully which of this information you actually need. Some of it looks like information which is specific to identifying the document in concordance and may be irrelevant for your purposes. A good example of this is the LastPrintedDate field. I assume that this is the date the document was last printed from the concordance database and would thus be entirely irrelevant to you.

b) the most often overlooked place for storing metadata is the filename.

In general store metadata in tags when the precise piece of metadata is shared by more than one item. Store metadata in the filename or description when the metadata is specific to that document and will never be repeated (such as a document identification number. Store data in the filename where you are unlikely to ever have to look at the data directly but want it available for possible future export or to be able to search for

Store metadata in a file name with a consistent schema and you will be fine. Something like this works - _UnitID_DateSaved_Custodian_File name

Here are some suggestions on how I would deal with each item:

BegBates& EndBates: Only the first page number needs to be stored. The end page number is calculable. This is best stored as a prefix in the spotlight comment.
BegAttach EndAttach: I am guessing but I would think this is similarly a calculable page number
Custodian: filename
Author: tag
Sender: tag
To: tag
CC: tag
BCC: tag
Title/Subject: Descriptions
DateSent: Date added
DateRcvd: Dated modified
DateCrtd: Would probably ignore
DateSvd: Would probably ignore
LastPrintedDate: Would probably ignore
FileExtension: already saved as part of filename
FilePath: probably relevant only to concordance
Attachment Name: I would think the attachment is stored separately. If you really need to tie the file and attachment together (and it doesnt flow from the ordering of the documents) adopt a schema whereby a file is prefixed in its descriptions with a number such as 0100 … and an attachment will be 0101 …
Message UnitID : filename
NativeLink: probably ignore

I appreciate the thoughtful response. I think i will work on something that can script inputting the data from the concordance file somewhat along your suggestions. I know there is a concern about using tags as a long term spot to reference documents, but I use them now that way and I am not sure a better way for me to go about things, despite the risks. there are more robust solutions, but they suck in terms of usability. so i am left getting by with something not entirely ideal either way. i will come back to this when i get some time to try scripting things. thanks again.

gelbin · February 5, 2015, 8:49pm

Hello again, I have a new case and new documents and am finally getting around to trying to do this. I can bring the metadata into a sheet (from a CSV) that has columns of metadata. I am planning on translating the columns in the sheet to fields for each record in DTPro. I am at the beginning stages of trying to get the “Custodian” reference out of column 9 of the sheet and copying it to the “Author” field for the first record, and then on down through the sheet/records. What I have now is not doing a darn thing. Anyone see what I am missing? At least if it was doing something, I would feel like i could correct it, but it is not doing anything i can see.

Help appreciated.

try
tell application “DEVONthink Pro”
set this_selection to the selection
if this_selection is {} then error “Please select a sheet.”
set theCustodianArray to {}
set currRow to 2
repeat while currRow ≤ rowCnt
try
set Custodian to get cell at row 9 column currRow
copy Custodian to theCustodianArray
set author of child currRow to Custdian
end try
set currRow to currRow + 1
end repeat
set theMessage to “Custodian:” & (theCustodianArray as string)
display alert “Devonthink Pro” message theMessage
end tell

end try

temjeito · August 20, 2015, 4:45am

Did you have any luck? I’m in a similar situation and just starting to look for options.

gelbin · November 11, 2015, 3:04am

I have a script working that does the job for some of my desires.

gelbin · August 7, 2017, 1:11am

i am back at it once again, i have a script that makes the metadata into tags. one question that I feel i should know the answer to, but don’t:

how do i script the pdf metadata (like the author field, etc.)? I could use those to store some of these fields. thanks.