Organisation with regards to AI

Hello,

I am seriously considering investing in DTOP. I’ve been reading through the posts in this subforum (~ half way through), but I feel like some of the things I am reading are out of date now due to DT updates, so I wanted to check something with you.

I am a PhD student and one of the main uses for DT would be a scientific literature DB. I want it to help organise papers and enable me to search efficiently, but I am also very interested in the AI, it sounds like it could really help make links and could be great for creating ideas.

Hence I am interested in how organisation within DB affects AI. I do understand that there are as many ways of organising content as there are users, so I am not asking for “the” way, but for guidance, so that I can optimise my way for the AI.

Firstly, it sounds like earlier on groups were important for the AI, whereas tags were ignored. Is this still the case, or are groups and tags now equivalent with regard to AI?

Secondly, do groups (and tags if they are equivalent) affect certain parts of the AI or all of it? I.e. are they only important for automatic grouping and sending your newly added documents to the right place, or do they also affect searches and links made through “see also”?

Thirdly, I understand that more granular hierarchy might be better for the AI. I am happy to do that if that helps, but I wonder if I can shoot myself in the foot a bit here if I am not strict enough about classification.

To explain, my PhD focuses on two main themes, A and B, with an important side of C. So I can have group A and group B and group C with subgroups (in reality it would be around 6-7 groups with subgroups). BUT many of the pdfs will treat about two or all three topics - is it then crucial that I replicate those consistently to all three? I.e. I might think that the article treats mainly about A, so I might be happy to stick it in group A as I am likely to look for it in that context first + I’m sure the search is powerful enough to find it when needed anyway, but if I fail to replicate it in the other two groups too, will that confuse the AI?

Any advice would be appreciated. I’ve imported about 350 pdfs into DTOP and I am playing with it, but I can’t quite see what’s important for the AI + I imagine that AI needs time to learn too, so certain things might not be evident straight away.

I’d appreciate advice, I realise that newbies can be annoying, but this forum seems so friendly and full of knowledgeable folk that I couldn’t resist asking!

~M

The reason I love and use DEVONthink Pro Office for my own research and data management is that it frees me from the necessity of a lot of grunt work such as highly hierarchical, detailed organization into groups and/or tagging for reliable and comprehensive identification of the information content of documents.

Before computers became available, it was necessary to file documents in a consistent way in order to retrieve them. In the early days of computerized databases, before full-text searches of content became available, it was necessary to use keywords (tags) to retrieve them.

But there are two fundamental weaknesses in retrieving documents by their organization or by keyword/tag. The first is completeness of identification of important information relevance, as the same document may be useful in multiple contexts. It would take a lot of time and effort to do that for many documents in my database. The second is consistency, as different persons (and certainly the same person at different times) will be likely to organize or keyword/tag the same documents differently, and I found that no amount of training could solve that problem. (From 1969-72 I developed and ran a university computer information center that disseminated the results of federal R&D that were relevant to environmental issues. Searches were by keyword/tag, and the weaknesses of this approach were highly evident, both up-front at federal agencies in applying the identifiers and when devising search strategies to meet information requests by my staff. Those weaknesses are often noted in the literature of computer science and document management.)

I discovered DEVONthink in its first year of release, and it revolutionized my management of my digital documents. For the first time I could truly integrate the information content of documents that were of different filetypes. And DEVONthink included artificial intelligence assistants to help me file new content appropriately (including the use of replicants when multiple filing locations might be appropriate), and to look for documents that might be contextually similar to one that I was viewing.

DEVONthink has continued to evolve over time to the point that I consider it the best research assistant I’ve ever had. It has AI assistants and powerful searches that also enable creation of smart groups on the fly. From the beginning it enabled ‘marking’ by keywords (in the Info panel of a document) and later brought tags into its arsenal. DEVONthink Pro and Pro Office have large AppleScript dictionaries, enabling automation or extension of procedures via scripting. And (perhaps because of my training in the old days before lab equipment to tackle bleeding edge research wasn’t available off the shelf but had to be cobbled together from what was already in the lab or available at a hardware store or Radio Shack), I found that I could often create kludges to accomplish tasks for which no built-in tools were available, such as replicating search results into a new group in order to perform subsequent procedures on those items.

Some of the tools of DEVONthink such as searches, tags and See Also work well in a database that has no group organization, such that all documents are held at the root level of the database. Some users operate in that mode, at least for some databases. Of course, the Classify assistant becomes useless in that case.

Christian Grunenberg, the architect of DEVONthink, notes that See Also is somewhat improved in a database that holds groups. But See Also still can ‘find’ contextually similar documents in different groups or at the root level.

Personally, I do create groups, but for my own edification and convenience rather than DEVONthink’s. I rarely create multilevel hierarchies of groups, as that takes more time and effort, although I sometimes find it useful. Most of the groups I create in my research databases contain contextually related (topical) content, so that the Classify routine works well in suggesting locations for filing new content. But in some databases such as my Financial database, I do use a highly structured hierarchical group organization and don’t use Classify for filing chores. I know precisely where to file a new invoice or receipt in that structure, e.g., by Year, category and Vendor. (And I don’t need to do a search in the Financial database to find a receipt in that database, as I already know where it will be found.)

As for keywords or tags, I almost never apply them as new content is added to the database. Why? Because, as may be evident from my previous remarks, I don’t consider the time to do a good job of tagging would be an economical use of my time – it doesn’t pay off well enough.

Most of my tagging is done at the project level, for example when I’m identifying references or notes useful for that project and ranking their information content for specific purposes. When I finish that project I will usually delete those tags, as they would likely have little or no utility (and might be counterproductive) for the next project I undertake. But in my Financial database I might create permanent tags identifying cost items for a project.

So my advice in approaching the level of effort in group organization or tagging is to tune it to your needs and workflows in that database, so that you accomplish what you wish to do with the minimum level of effort that gives a good payoff for the effort put into it. In a database like my Financial one, I gave a good deal of prior thought to group design, and that paid off. In my research databases, sometimes I start with one-level topical groups, and may find it useful to reorganize some of them into a hierarchical structure later.

Tips: When devising hierarchical structures for groups or tags, give sub-groups unique names so that when Classify suggests that group, it won’t have a common name used in other groups or tags also, such as Miscellaneous or Physics – which would be confusing. And when filing documents or assigning tags in hierarchical structures, always do so at the lowest appropriate level of the structure – otherwise, the hierarchy won’t make logical sense. Never file or assign at the top level of a hierarchy; if that has occurred, move items down into the lowest level of appropriate sub-groups or sub-tags.

Final tip: Don’t get obsessive about organization or tagging, as you won’t have time to get any real work done with your database.

Thanks for this post! Very interesting for me. I especially like the last sentence. However, I must admit that I’m a bit afraid to let go of hierarchical groups and depend on the software for retrieval of documents. Maybe I should try and see how it goes.

cheers,
Tom

Bill, thank you for putting the time and thought into answering. I agree on many points.

TAGS

I dislike tags in general because I think that the effort-to-return ratio is very poor, so if they don’t help AI in any significant way I will be more than happy to not use then (or use them very sparingly as temporary tags that get removed as soon as possible - similar to your project-level tags). It sounds like I should just be able to use smart groups where before I’d use temporary tags, so that’s great.

Second reason I wondered about the tags and AI was because I sometimes use tags to mark the next action needed e.g. check the cite key. I wouldn’t want AI to treat documents that need cite keys checked as more similar to each other - they aren’t, they just happen to all need cite keys.

GROUPS

As for groups, I currently hold my research in fairly broad categories (folders in Finder). So I would be happy to just move those to DT as groups.

I keep all those at the same level (not nested), but they would be easy enough to nest within each other if that helps the AI. However, I tend to be relatively relaxed about categorising things, hence my question whether groups would hurt more than they would help by confusing the AI if one isn’t obsessive about replicating things to all relevant places.

For example, one of the things I work on is inbreeding. I have an “inbreeding” folder where all the relevant papers go, including many mentioning/including inbreeding depression. Particularly useful papers treating specifically about inbreeding depression go to an “inbreeding depression” folder. So I could nest both “inbreeding” and “inbreeding depression” within a parent group in DT to keep them separate from the other topics. But not all papers mentioning inbreeding depression are in the “inbreeding depression” folder - only the crucial ones - so I wondered if that would confuse things for the AI. There are many papers that treat about 2-3 topics, but I like them as an example of topic A, so they end up in folder A, even though they could easily fit in B or C (and I’m sure plenty of other people would rather have them there).

Basically, I’m trying to understand the AI a bit more so that when I dump my current system into DT I can make good use of the AI. I don’t want to re-invent the wheel and don’t want to introduce unnecessary things to my workflow/DB. But I am happy to make small changes (e.g. nest the folders) if that helps the AI and requires little effort on my part. If to make the group structure work I have to start religiously replicating the files to all relevant groups… I don’t have the time or patience for that.

Thank you for the tips, that’s useful too.