Organizing large data collection

Kai108 · May 13, 2016, 12:39pm

I’m a bit helpless when trying to work with large ill-behaved data collection:
My starting point is a large collection of data on events, includin various document types:

Media (video, audio, image scan, transcript, edited summary);
file type (mp4, mp3, doc, pdf (typ text only; some with embedded video, audio)
Languages (mostly English, some Italian, German, other, mixed)
Scope (full event, summary, both)
Naming (most filenames include event-code, many also event-date)
Periodically, twice a year, I receive an “updated” version of some of the material (1 TB).
In order to efficiently work with the material, I need to come to somehow handle this variety:

eliminate real duplicates;
update filenames to include unique event identifier;
automate this process to some degree, to apply to periodic updates
I can imagine how to work with the material (adding tags, creating intelligent groups, …) once sorting out the raw material mess.
Trying to accomplish the prep tasks proved (too) challenging for me: ‘duplicate’ doesn’t consider file type, so I need “sequential” groups. Works for limited variety, but easy to loose consistency.
Any suggestions for a way to best approach this?
Kai

Frederiko · May 13, 2016, 1:24pm

I would keep a copy of each dataset separately for comparison purposes i.e. not in DT but in the form in which it arrived. Before trying to merge a new dataset into my tagged and marked DT database, I would run a comparison of the previous dataset with the most current dataset through a tool like Visual Differ. There are probably even more sophisticated tools for conducting this kind of comparison than Visual Differ but I haven’t had need of them. It seems that you are in need of a very sophisticated comparison tool with logging and versioning capabilities.

Frederiko

Kai108 · May 14, 2016, 9:32am

Thanks for the quick reply. I’ll prepare the data using VisualDiffer, and see how far it gets me.
I was pondering to keep the video data (1.5 TB) on an external disk, while importing the other files (after some weeding). Is that a good idea?
Furthermore, is there a smart way to keep track of the scope of data available for each ‘event’, like ‘pdf only’ or ‘no video’?
I thought of smart groups, but that gets messy pretty soon with all the combinatoric of criteria, plus the “intermediate” groups needed for nested filters. Also, Tags could be a way. But it seems they require great discipline to maintain them consistent.
Thanks, Kai