Improving confidence score for "classify" and "see also"

macula · December 7, 2011, 7:08am

Better late than never: I have started to realize the usefulness of the “see also” and “classify” drawer in DTPO. However, the overall confidence score (at the top of the drawer) tends to be below 25% for the vast majority of my documents.

Why? How can I improve this?

My first thought was to exclude large files (e.g. book-length PDFs) and their parent folders from these two mechanisms. There would be a significant downside in this case, however.

The database in question is thematic (focused on musicology) but admittedly large (40+ million total words) and rather multidisciplinary. It consists largely of RTFs and OCR’ed PDFs.

Any ideas? Thank you.

korm · December 7, 2011, 10:05am

Christian’s reply here indicates that the accuracy improves with an increase in the number and depth of your group hierarchy. Many users also see increased accuracy with a larger number of small documents as opposed to a smaller number of large documents. Since the latter factor might be out of your control, you might experiment with a broader/deeper group hierarchy (i.e., more groups).

Greg_Jones · December 7, 2011, 10:30am

It is also helpful to not have a mix of documents and groups in the same group. In other words, a group should contain either documents or sub-groups, but not both.

macula · January 19, 2012, 8:47pm

Greg, I am resurrecting this thread as my efforts to increase the “quality” my database in terms of “see also” confidence scores has so far been unsuccessful.

May I ask why such mixtures of of subfolders and files are detrimental for accurate classification?

Side note: Once again, I very much wish that DevonTechnologies would provide official and technically sound information about how its AI features work, or at least how they could be optimized on the user’s end. This lack of information is very unfortunate and haphazard.