Correct use of Auto group and Auto Classify...

eben · August 6, 2012, 8:23am

I find myself avoiding these as DT gets it wrong more often than not. Is this due to my documents not being classified correctly already?

I’m about to go paperless and I’m tempted to create a whole new database for all my scanned documents. This leads me to believe that I might be able to scan the whole lot and let auto group do the heavy lifting. Can anybody comment on success in this area?

edgley · August 6, 2012, 11:53pm

Am about to do the same, so no tips, but we can share the pain together

eben · August 7, 2012, 12:12am

I should get my Scanner in the next couple of days. I’ll try to post everything I’m doing so it might help and others might have some bright ideas. I’m trying to automate as much as I possibly can. I’ll be writing AppleScripts, shell scripts and what ever other utilities I’d need to make this as hands off as possible.

I’ll also be automating backups to Amazon S3 so it might get interesting.

I’m determined to get the software to do as much for me as it possibly can. I’d love to have it down to just feeding the scanner and pushing it’s scan button with everything happening by magic™. I think I should be able to automate file naming and classifying of repetitive documents leaving only exceptions to be dealt with.

edgley · August 7, 2012, 1:50am

Good luck, I don’t have enough to worry about automating it.

I do consider this magic though:
amazon.co.uk/Vupoint-Magic-W … 94&sr=8-14

So many more things can now be captured

eben · August 7, 2012, 10:25am

iPhone cameras are pretty good, you’d be surprised what you can capture with it. We “scan” white boards all the time with phone cameras. I sometimes wonder why we still have printing white boards in this day and age (I guess it’s the older generation still hanging on to it).

Also I’ve got a tendency to hoard. Devonthink has been both a blessing and a curse. I am how ever working on a solution to review everything periodically to purge stale information.

pvonk · August 8, 2012, 1:44am

I have one of these, it’s the most frustrating device I ever bought. I can never get a good straight scan, nor can others in my office. Scanned text is all wavy. Blew $100 on that purchase. I’m actually thinking of just throwing it away.

But using a scan app on my iPhone (takes pictures in a special way), I get pretty good scans.

eben · August 8, 2012, 10:20pm

So I got my new toy, the ScanSnap s1500M, and it’s as awesome as what people said it would be, 5 year old technology to boot.

So I scanned every piece of paper on my desk and in my paper inbox (it’s now gone off to a better home). After DTPO ORC’d it I dropped about 130 documents into a new empty database. From there I selected all documents and auto grouped. That worked rather well but it still grouped some documents in a rather strange way so I selected all documents in the group and auto grouped again. It seems to get it right after about 2 or 3 levels of nesting.

I have another question, and this is possibly for the developers. How does sub groups affect auto classify? I get the idea that you have to work with the AI rather than to expect it to adapt to you. If someone could please explain how the grouping and classifying algorithms work? I’m assuming it’s a basic bayesian classifier, but it’s so smart, so maybe it’s a little tweaked.

I’m determined to crack this correct usage pattern since it will seriously make life a lot easier.

For those wondering about my automatic renaming of files… I’ve taken a leaf out of other members’ books. As long as my groups are named correctly it doesn’t really matter what the file it called. I do think this requires your groups to be rather specific though!

edgley · August 9, 2012, 9:25pm

I didn’t know about using auto groups to let the AI do the work from the start.
That changes things.

Hmm, maybe time for a test DB and see what that comes up with, thank you for this info!

eben · August 9, 2012, 10:20pm

I’ve done another test, last night I scanned my company’s first year of trading it was only 100 pdf’s and I let Auto Group do it’s magic again.

It grouped most of the documents left maybe 20 (to be fair I think the OCR had a hard time with it so it couldn’t really classify it). Everything that it grouped automatically it sub grouped correct too.

I did notice that receipts for parking buildings were all grouped together by parking company, my instinct tells me to create a new combined group with them as sub groups, but I don’t know how that would affect the AI.

From past experience if you manually group things the AI does not classify things all that well later on. I’d really appreciate other people’s experience with this. I think there must be a way to get the AI to do the lifting 99% of the time if I can figure out how to work with it rather than make it work with me.

edgley · August 9, 2012, 10:25pm

I have just dumped my 326 files into the inBox.
I have deleted all my exisiting groups.
The only two I left are the index to my graphics files and my RSS.

I have run auto group, am having a look, but it don’t look too good so far.

Good job this is a copy of my main DB

eben · August 9, 2012, 10:38pm

Run auto group on things inside groups if they don’t look good, repeat the process and see until it starts to look better.

My first reaction was the same as yours. This will give you an idea of how the AI classify things.

Initially I had NDA’s, Contractor Contracts and Insurance all in the same group.

I auto grouped across that and it created 3 groups with the documents in the right groups.

Insurance was still for 3 different things after I ran auto group on that I had another 3 groups. Things looked much better.

What this means is that your groups might be scattered a little but it does work. This is why I’m trying to figure out if sub groupings affects the classifier.

edgley · August 9, 2012, 10:56pm

Sorry, getting confused.

I now have 40 groups and 50 ungrouped files; not looking very good.
Do you mean I now select them all, as the have been first auto- grouped, then run auto- group a second time?

Or undo the first auto-group, and run it again on a complete set of un-grouped files?

Thank you.

eben · August 9, 2012, 11:04pm

So the 50 ungrouped files will be a manual job (but I’m still unsure what this will do to the AI). What this is suggesting to me is that there’s nothing similar to it according to the AI.

Look inside the 40 groups (you have around 280 documents grouped in there). see how well they are grouped. If there’s more than one category of document in it select all the files in that group and run auto group again.

Just for your reference, I’ve run auto group on a database that contains 4500 archived emails. about 450 was not grouped (a quick look at some of them suggests that I might be able to simply delete most of them as they contain random one off emails - note this might not be the case, but I certainly don’t think I’ll be grouping them with the others).

I’ve looked at some of the groups and found one group that contains emails with links to youtube videos - it seems fairly consistent. I haven’t had time to investigate further.

I’m sure if we keep experimenting we’ll find a workflow that helps.

edgley · August 9, 2012, 11:07pm

One of the things it has done is not groups things together enough, rather than too much.
For example, it has split the RSS stories into groups from each feed provider for most, but not all. So some RSS feeds have two groups.

I wonder if I should drop these groups together first, as most of my groups do not have enough items. Ah, I shall create a master group for manuals, move one group into there, and then see if I can get group to see that from the hat button.

eben · August 9, 2012, 11:22pm

Ah I see where things are different. Try collecting all your RSS feeds in a new database (My reasoning is that it might affect auto classify - but you will probably loose see also). Then try auto group on all the feeds and see what it produces.

In my case I only had scanned documents in my database (Since I treat it exactly the way I’d treat my paper filing cabinet). The email Archive only contains archived emails.

Another thing to try is that once you have different types of media auto grouped to merge the databases by group by media type. From there I think auto classify might actually work rather well.

Again I’m just speculating here, but it’s certainly worth a try…

edgley · August 9, 2012, 11:40pm

Going to keep with the one DB for now, I really want to see the power of everything together. I am starting to see some logic in the folders.

I have been grouping together, using the suggested moves and will auto-group the new large groups. If nothing else, beats an evening of playing Minecraft

eben · August 9, 2012, 11:44pm

I think one database should be fine, possibly if you introduced parent groups for the many groups to group similar items together. I do think keeping manual classification to a minimum is probably key…

edgley · August 10, 2012, 12:59am

I think that the AI is more clever than that.
If I manual drag one item, as the auto suggest is not right, and then select another one that is similar, auto suggest updates and now does work.

I am now wondering if you can use the OCR part of DTO to OCR all images, and then auto group them.
Oh, no need to wonder…

eben · August 10, 2012, 1:11am

I know that it does “learn” but I’m wondering how much tolerance there is for getting things right!? possibly a decent auto group to get it mostly into a structure that it’s happy with and then tweak it would do the trick?

How much time are you willing to spend on this?

edgley · August 10, 2012, 1:21am

lol, like anything new I find I make a point of not counting the hours I spend on it.
Part of it is nothing more than a life long passion for cataloguing things.

I have spent hours trying to get my music into the correct genres. In the end I have adapted and made my own, and now I can find my music, and quickly.

As this is something that I hope will be with me for years I am happy to put time into it to get it right. I have been looking for this software for years so its a small price, to me.

Add to that, how cool is it when it works?

lets not forget that Macs are just for designer types: