Auto Grouping: How Robust is it?

So here’s the deal.

I came into the possession of a large archive of e-mail messages (in mbox format, if you’re familiar with it) and I wanted to suck them into DEVONthink Pro to see what he’d make of it all. I wrote up a little Ruby script to chunk the file into almost 10,000 HTML files (one per message) and then imported them into a new database. So far, so good.

Now I’m wondering how well the “Auto Group” feature will stand up to the challenge of grouping this volume of data. I’ve tried a couple of small samples – say, selecting a hundred or so messages and auto-grouping them – and it finished the job in a few seconds time (and it did an admirable job of grouping them, I might add). But I’m wondering what the big-O level of complexity is for this operation, and if it’s just going to cause my MacBook to burst into flames if I ask it to auto-group all 10,000 items.

Does anyone have experience in using the auto-grouping feature for data sets of this size? Any suggestions about how I could differently attack the problem?

Well, since there weren’t any immediate replies, and because I’m just that impatient, I charged ahead to see what DEVONthink Pro would do with the aforementioned database.

I selected all of the approximately 9600 message items and asked DTPro to auto-group them. The first pass, in which it “compared text” for the items, took approximately 20 minutes. The second pass, in which it grouped the items, took another 35 minutes. So, close to an hour’s worth of processing on my 2.0 GHz MacBook with 1Gb of RAM – but it did finish.

I was a little disappointed (and surprised) by the results. It did create 303 new groups, each of which contained a handful of messages – presumably with each group corresponding to all of the messages in a given thread – but it left the remaining 9000 or so messages ungrouped. I may try again to auto-group those remaining messages, but I assume that it will take another hour to do so and I’m not sure what to expect at the end of the run; will it attempt to construct groups for those items this time?

Lyle, the results will vary depending on the content of the items – how closely they are actually contextually related.

I’ve sometimes dumped thousands of documents into a new database and then done some preliminary manual organization by doing a series of searches, and moving each set of search results into a new group.

I’ve also used the auto-group command on still unclassified material.

Once I’ve created at least a rudimentary organization that makes sense to me, I’ll then move to auto-classify blocks of the still unclassified material.

Sometimes I’ll end up with a rag-tag collection of items that neither DT Pro nor I can figure out how to handle. That goes into a new group named “Unclassified”. :slight_smile: