Auto Classify failure

NTParsons · July 15, 2015, 10:08pm

Hi - I’ve never used Auto Classify before, so this might be some stupid thing everyone knows but me. OK so I created a new database and folder structure for storing paid bills by category, Phone, Utilities, Medical Insurance Statements, and so on. I thought this was EXACTLY the kind of thing Auto Classify should be able to sort for me. I loaded a bunch of pdfs of paid bills into their appropriate folders so DTpro could “see” them and have a model for what goes where. Then I stuck a bunch of unsorted paid bills into the inbox, and hit Auto Classify and… nothing. A little window popped up and said “Unclassified” for each item.

Fine, I thought. It needs time to index or whatever. I waited till the next day, tried again. Same thing. All unclassified.

So what did I do wrong?

Bill_DeVille · July 15, 2015, 11:29pm

On the contrary, documents like bills and receipts are among the worst candidates for using Classify or Auto Classify. This AI assistant looks at the contextual relationships among the contents of each of your groups in order to determine patterns of terms, their associations and frequencies. It then looks at unclassified documents to see if the contextual relationships of the contents of each document are sufficiently similar to a group’s pattern to make a filing determination. The AI looks only at the textual information Content (the body of text), not the Name or other metadata. Most bills and receipts don’t give the AI much to work with.

I’ve got collections of other kinds of documents for which Classify and Auto Classify do work. For example, a collection of scientific papers from various disciplines and subdisciplines will give the AI a lot to work with. If I initially create appropriate groups and seed them with appropriate documents in each group, the AI will “see” distinctions between each group, and filing suggestions by Classify for new content that is about one of those scientific disciplines will become more and more accurate as the database grows, so that eventually I might begin to use Auto Classify. This works very well in my large Main research database, which has hundreds of topical groups.

The upside of the fact that AI isn’t really appropriate for bills and receipts is that your organizational structure for them is probably very simple and straightforward, so that you can easily toss new items into appropriate groups.

I scan a lot of paper to searchable PDFs. I scan into a database named Incoming Scans. That database holds some 33 smart groups to collect items that can be defined by search criteria into groups so that I don’t need to rummage through a large collection of scans to isolate categories of documents for filing purposes.

Example: One smart group is based on a search that pulls anything from my water utility, Brown County Water Utility, into it. The content will include monthly water bills, as well as any special notices. When I feel like it, I’ll use the Change Date script to assign Creation Date to the date of the bill or notice, so that I can sort by date and file the items into another database where I keep records of expenses. I have no plans to sell my cabin, but potential buyers might want to see utility costs.

Other smart groups pull scans together from each of my bank and investment accounts, medical, home and vehicle insurance, etc. And so on.

When I’ve emptied those 33 smart groups there will still be stuff left that I have to manage item by item. There usually aren’t many of those. I’ve saved time and effort.

korm · July 15, 2015, 11:31pm

Are all these PDFs OCRd? I.e., is their Kind “PDF + Text”. Classify cannot work with plain PDFs or images.

NTParsons · July 16, 2015, 10:29pm

Bill - Thanks for that reply. Very interesting.

So let’s say I have a bills from ATT phone company, the mortgage bank, and the gas company. They all are pretty much identical (all ATT’s bills are like other ATT bills) and carry their name and other identifiers that should flag them as the same class of document. you are saying that even under those circumstances, the AI won’t catch it.

Korn in answer to your question, I scanned these myself using my scan snap. I actually don’t know if they’re “readable pdfs” or not, since I (sorry) didn’t know there were such things. I thought a pdf was a pdf. I thought they were all readable with an ocr program, which then turns it into text.

have I made a fool out of myself here?

thanks
t

korm · July 16, 2015, 10:59pm

Documents scanned with your ScanSnap are not necessarily “readable” by the computer. You’re looking at an image when you open the PDF – which you can read – but your computer can’t see text in images wihtout help. So, we can OCR the PDF and add an invisible (to you) layer of text that DEVONthink and other applications can read. DEVONthink Pro Office has a built-in feature to OCR PDFs. ScanSnap Manager does also. The way to tell if the document is OCRd PDF is to make the “Kind” column visible in DEVONthink and see if the PDF’s kind is “PDF+Text” or not. “PDF+Text” means the PDF has been OCRd.

NTParsons · July 17, 2015, 6:07pm

Ah ha. OK then. I just looked and Kind only shows PDF.

So qustions:
1 - If I OCR these docs an make them readable, will the AI be able to auto classify ATT bills with other ATT bills, bank statements with other bank statements? Or is that just off the table? AI isn’t meant to do that under any circumstances.

I like the solution mentioned using smart groups too, I’m familiar with the logic of it from using Hazel to classify these docs before. I’m guessing I need to OCR everything to make that work as well. I also imagine there’s a way of doing that within DTpro, I’ll go find it.

thanks for your help
tucker

BLUEFROG · July 17, 2015, 7:50pm

OCR is only available in-application with DEVONthink Pro Office.

Bill_DeVille · July 18, 2015, 2:15pm

The scanned bills would have to be OCRed in order to provide text for the AI assistant to work with.

My experience with bills is that there’s quite a lot of variation from one vendor to another, so the group holding bills isn’t likely to have a distinctive pattern of language usage and associations of terms. It’s more like a hodge lodge.

Classify – and eventually Auto Classify – begins to shine in a se database where the groups are topically organized, and in which the topics have quite distinguishable differing patterns of terms and associations of terms. For example, DEVONthink’s filing suggestions can be very on target in a database that holds topics as distinctive in their jargon and general language usages as String Theory and Wetlands Ecology. DEVONthink doesn’t know beans about physics or ecology, but its algorithms are capable of seeing differing contextual relationships in text content in such topical groups.

NTParsons · July 21, 2015, 1:30am

Hi Bill and thank you again.

But what’s an “se database”?

BTW I own DevonThink Office Pro, so does that mean I can OCR the pdf’s already in there and make them searchable?

thanks
tucker

Bill_DeVille · July 21, 2015, 2:14am

You may add the Kind column to a view window to identify PDFs that have not been OCRed. Choose View > Columns > Kind. The Kind of a searchable PDF is PDF+Text. If the PDF is image-only it is a candidate for OCR, and its Kind is PDF.

In DEVONthink Pro Office you may select a PDF document and choose Data > Convert > to Searchable PDF to perform OCR.