Classify made easy

By far, the biggest time sink in working with DTP, and the one most likely to promote carpal tunnel syndrome, is integrating new documents into the database from the inbox.
Right-click Replicate -->Folder X
Right-click Replicate -->Folder Y
Right-click Move to -->Folder Z

Imagine, highlighting a new document, pressing Classify, then having all the possible folders presented, each with a checkbox.
Check the ones to replicate to. Hit Replicate.
Check the one to move to. Hit Move.
Bliss. :slight_smile:

Seriously, I use DTP less than I should, and likely get less out of it when I do, because it makes filing in an informative way so tedious. :frowning:

Right-click? I just drag and drop for Moves.

Afraid I can’t help much with Replicate, as I rarely use it. But iirc there’s a magic key that you can use to turn a drag and drop from a Move to a Replicate. Bill?

Katherine

Katherine, if there’s a drag & drop modifier for “replicate to” I don’t know it – but that could be useful.

I often use the floating Groups panel to move items.

My Incoming group often contains hundreds to thousands of items. I often drop in large numbers of results from a DEVONagent search, for example. And when I’m downloading articles from journals as RTF notes I send them to Incoming for later classification.

Periodically I trim Incoming down using searches and move or replicate search results to groups for classification. I often have to create new groups for current research interests, so Auto-Classify wouldn’t work. After doing that, I delete replicants left in the Incoming group. No problem with tunnel-carpal syndrome, as there are few repetitive movements.

At the moment I’ve got more than 7,000 items in my Incoming group. Have been too busy with a project and with Leopard support to organize stuff recently. :slight_smile:

But See Also works well for me, anyway.

I see there are many different tactics when training the mysterious AI.

I am an ecologist, which is all about webs of relationships. Any given PDF reprint will thus be moved into its core subject, but likely replicated to at least two more (e.g., interaction, taxon, ecosystem type). Even dragging and dropping 50*3 reprints in a given session is a fair bit of mouse work, especially when you have 100 or so groups, nested just so. I can feel it when I’m done. And it takes time.

Since each PDF is different, there is no easy way to sort and do this in batches. It’s open a paper, scan the abstract, classify, and move on. Auto-classify is never perfect, but it usually brings up 70% of the likely candidates. Thus my request for the ability to deal with all the suggestions in one batch.

How hard can it be? :wink:

But this brings up a deeper issue. Am I overdoing it in classifying reprints? I guess it depends on how the AI works. So here are two relevant questions about the use of folders:

  1. To what extent is the text in folder name weighted relative to the text in the document itself?

  2. Would it be better to have 10 basic folders, and distribute each PDF into two, or 100 folders, with every permutation of 10^2 pairs?

I have abandoned most of my topical folders in DTP and simply arranged materials by date of accession. That’s similar to the system that most libraries use as they acquire new items. So I have a folder for 2007, into which go all the files, URLs, and PDFs that I acquired this year.

Then I use Search, Classify, and See Also to locate topics or particular words and phrases. Sorting them by date reveals a trail of learning that’s often helpful. I found this so useful in DTP that I now have a chronological arrangement in my paper files, with a typed listing of those files placed in DTP.

No longer having to sort and classify saves much time and frustration. It’s also good to review files that are older than 10 years and winnow them down to essentials.

I think the names of groups and documents are pretty much irrelevant. Yes, for my own convenience I give them names meaningful to me. But it’s the content of documents that’s important.

My main database deals with environmental interests. That pulls in a broad range of scientific and technical disciplines, from chemistry to toxicology to food and agriculture to molecular biology and genetics and yes, ecology. And economics, engineering, law and policy. And cultural patterns and behaviors. Energy sources and alternatives, geology, hydrology and geopolitics.

Take a case such as high arsenic concentrations in a community water well in Pakistan. There are many such wells that have arsenic levels high enough to raise concerns about the health effects on people drinking the water. Some wells are more contaminated than others. Approaches to sampling and analysis run into issues of economics and available resources. Evaluation of alternatives should take into account history. In many cases, previous water use involved contaminated surface water that resulted in a high incidence of cholera and other waterborne pathogens that had even worse health effects. So there had been governmental and nongovernmental funding and encouragement of development of ground water resources to reduce the high mortality rates resulting from consumption of contaminated surface water.

Indeed, there had been demonstrable reductions in mortality rates, especially childhood mortality, resulting from the historical push away from pathogen-contaminated surface water. But the unforeseen levels of arsenic contamination in some subsurface water resulted in a new environmental problem.

Technically, it is easy to remove arsenic from drinking water, although there is an associated cost. In the case of a poor community in Pakistan, however, the question as to what technology to use, especially in light of resources and questions as to who pays and how the cost is to be borne makes solution of the environmental problem complex. Some technologies are very simple and inexpensive, but require education as to how to use them, and understanding and adoption of the procedures by the affected population.

And of course one should worry that simply frightening people away from contaminated well water might lead them to return to drinking pathogen-laden surface water. So we have information and communication issues.

I talked about that case to emphasize the point that few if any environmental issues can be understood or addressed within a narrow context.

Some environmentalists advocate the precautionary principle as a guide to decision making about adoption of new technologies, or even continued use of existing technologies. Basically, the precautionary principle states that if a technology has, or could have, any adverse impact then it should not be adopted. My opinion is that the precautionary principle is one of the dumbest possible decision rules and leads to decisions much worse than the decision to move from surface water to ground water in some areas of Pakistan.

We talk about “sustainability” a great deal in environmental policy literature. There are some basic and true principles embodied in that concept. But many use the term as wrongly as the concept was presented in a book “Global 2000: The Limits to Growth” back in the 1970s. The resource predictions in that book were wrong, because it was assumed that human behavior is too static and unadaptable to change when problems are presented. Some use the concept to argue that human society must revert to a much more primitive lifestyle to remain sustainable. Sorry, guys. I know that hoe-culture agriculture is the most energy efficient and land use efficient, but I don’t intend to spend all my time chopping weeds with a hoe. That’s not the necessary solution to our environmental problems.

OK, so I’ve got tens of thousands of reference materials covering a number of disciplinary areas. How do I organize them?

I treat a group as a cluster of related contents. Most of my groups are not tightly subdivided, although in progress of a project sometimes I’ll subdivide a group if that seems useful. I’ve got 657 groups at the moment. I’ve split out into a separate database many thousands of strictly technical references such as chemical analytical methodologies, sampling protocols, statistical data evaluation procedures and the like into a separate database. Most of my projects deal with analysis of specific problems, comparisons of regulatory approaches between the U.S. and the EU, a variety of policy issues and the like. I often seek an overview of issues – how a number of factors fit together in giving a picture of the problem or issue.

I don’t spend much time on organization of database contents. That’s because I’ve learned that the next project really won’t benefit from time spent organizing for the previous project. :slight_smile:

I do use See Also a great deal, and sometimes the related “See Selected Text” to isolate for the AI routine a particular paragraph or section of a document that is of interest.

I neither want nor expect See Also to suggest a list of other documents that are “just like” the one I’m reading. Instead, I’ll be interested in finding new relationships that I hadn’t thought of. That’s rather like serendipity, but with computer assistance to help me explore my database in new ways. Many times I’ll do a trail of See Also operations, running See Also on a suggested document, and then again on something interesting that pops up on that suggested list. Once in a while there’s a Eureka! experience, and then all the time spent in gathering fodder for the database seems worthwhile.

DT Pro isn’t a chemist or ecologist. The database doesn’t “know” anything about these disciplines, or any other. It’s the responsibility of the user to understand and interpret the usefulness of the AI suggestions. But it can be a very useful interactive process.

This is certainly an eye-opener. I have been under the impression that the AI uses how you associate documents (i.e., put them in folders) to improve its ability to predict what you are looking for. So even if it doesn’t use the names on the folders, it uses the information as to what is and is not grouped together.

If this is true, then there is an optimum set of tactics (number of folders, nestedness of folders, replicates) that increases the AI’s performance. If so, I’m looking for what those tactics are.

If it is not true, then I have been wasting a lot of time when I should have been dumping my PDFs into a few folders. :neutral_face:

All I want is the truth. Just gimme some truth. :wink:

There is no question but that the more tightly organized a database, the better the performance of the Classify AI routine will become for suggesting classification of new content. In a large, well-organized database the Auto Classify AI feature can become quite reliable.

But classification isn’t significantly related to Search performance.

And I’ve got to say that the performance of the See Also and See Selected Text AI features remains very useful to me in my less than optimally organized databases. So I confess that my database organization often becomes rather sloppy.

My database isn’t like a file cabinet, which requires very tight and systematic organization if I’m to find and use the information stored in the file cabinet.

If I always approached the information content of my databases in the same systematic way, I’m sure that I would spend more time on classification of the contents. But – except in the case of my databases for financial records, where tax year is extremely important – most of my projects differ in what information I’m interested in, and in how I might select and organize content for the project. So I tend to spend less time on organization than on other database work, such as building new content related to my interests. Result: most of my databases are far from optimally organized. So be it.

The drag and drop-approach turns out to be a drag. It is so '1990 and takes ages to perform. Bills description of his database is a good example for the state of affairs in my DT Inbox: Valuable nuggets of information amidst cloudy chaos and utter crap culled from the web, a whole week of work to sort the mess with my drag-and-drop CTS-aching fingers:

click. Select. click. see also or classify. click. this group seems relevant, but whats inside? clickclickclickclick […] ok, lets move this file. where the heck is my groups palette? click. there! now where is this group? clickclickclick. now DRAG this #**##* file into the… whoops, the groups look so small on my screen so I hit the wrong group. Where is my file? clickclickclick…

average time used for sorting 1 pdf file: 1 minute
time needed for sorting 700 files: why should I do this?

In comparison: Using quicksilver in the finder:

Select a file - invoke QS (My hands can stay on the keyboard), type first letter of “move to…”, type in first letters of relevant group, browse the group with cursor keys, hit Enter, done.

average time to sort a file: less than 10 seconds

Unfortunately in Devon, “move to…” is only a contextual menu item. if it would be in the top menu pane, it could be accessed via Quicksilver and the keyboard.

When I bought DevonThink, I was under the impression the program helps me organize a lot of data. With the current implementation, I am happier with the finder plus quicksilver. Yes, see also is useful, but why is organizing files so unnecessary difficult?

So be it?

Mark