Using Keywords?

jchiar · June 14, 2007, 3:06pm

Hi I am a new user of Devonthink personal.
Is there a way to use keywords to search for documents?
I would like to use it to track thinks like receipts, etc…

Is this possible? I saw categories like personal, important… But this isnt exactly what I was thinking of…

Thanks

Bill_DeVille · June 14, 2007, 5:17pm

Why I don’t bother to add tags to most of the tens of thousands of documents in a reference collection:

Most of the time, especially when I’m doing research in a large collection of reference materials, I don’t need or use keywords or tagging (except by group classification).

In the early days of document databases keyword and other forms of tagging were absolutely necessary, because there was no other way to find stuff.

Tagging is a lot of work and takes time, and is difficult to do consistently. My databases typically contain tens of thousands of documents, and I often add large blocks of information that, if tagged, would require a large number of tagging classifications.

Instead, I rely on the search capabilities plus the contextual recognition See Also feature. DT “sees” all the text in the database and can “see” other documents that contain patterns of words that may be similar to those in the document I’m viewing.

That avoids a very significant limitation of any tagging scheme, i.e. the fact that tags must be applied in the first place, and provide a limited spectrum of the possible classifications that might have been invoked for a particular document. Tagging is static; it provides a classification thought of by the user at the time of tagging, even if tagging is done on multiple occasions.

So if I do a search based on tags, the results will be a regurgitation of the classifications of documents that I made when those documents were tagged. Yes, there can be some degree of sophistication in handling and relating multiple tags, but I’m very often looking for new insights into material with which I’m already familiar. That’s when I explore the material using See Also, which looks at word uses and patterns that are less limiting than any tags I might have used. Often I’ll follow a trail of See Also suggestions, picking an interesting item and running See Also on it, and so forth.

Tagging is simply a form of a filing system and however sophisticated, is based on the user’s preconceptions about the data that’s tagged.

So I call DT the best research assistant I’ve every had, because it can often lead me to insights about relationships between terms and concepts that I hadn’t thought of (and so couldn’t have tagged).

That said, there are times when i do use tagging, often in a nonpermanent way.

Your example of receipts is a good example. When I’m doing a project I’ll create an organizational structure for related documents, which will include project costs, receipts, and invoices. Creation of a subgroup for receipts is itself a form of tagging. But I might also add to the Info panel Comment field the keyword “receipt” and perhaps also a keyword for the project name. (DT Pro and DT Pro Office provide a script that would let me select the contents of the Receipts subgroup and add my keywords to all of them at once.) That’s additional tagging. Perhaps for invoices I might add a State marker to indicate whether or not it has been paid. Perhaps I might add labels (usually only temporary) to some items, perhaps to indicate which tasks remain incomplete. All of these characteristics are searchable, and so can help me look for all of the items that possess one or more of the tagged characteristics.

Other forms of tagging are available. I might, for example, do a search, select all or some of the results and replicate them to a new group. Or create a smart group. So, for example, I can easily create a group of replicants representing all unpaid invoices, or incomplete tasks (for a specific project, or for the entire database).

When I’m using DT’s See Also or See Selected Text AI features, I’ll often tag (mark) some interesting suggestions temporarily using labels. I might then search for that assigned label characteristic and create a new subgroup representing important reference sources for a project, for example. (I usually remove label tags when I’m finished, so that they are available for other future purposes.)

There’s user-provided metadata, and then there’s DT metadata:

While DT provides tools to let the user add and search for tags of documents, what I value most is the metadata built-in to the DT databases.

Unlike most other document databases, DT has a glossary of words that ‘sees’ all of the terms used throughout the database and ‘knows’ which documents contain each term, as well as frequency and patterns of usage of those and similar terms. That provides the basis for AI assistance to the user, allowing metadata identification and analysis that avoids much of the need for user tagging.

So I’ve got built-in “tag clouds” that are available to me without the need to individually tag my tens of thousands of documents, and are not limited by my preconceptions about the contents and relationships among documents.

That allows me to interact with my database when I’m exploring ideas, without the drudgery of tagging every document, and without the limitations of my preconceptions about using tags. That’s why I call my database a research assistant, better than any I’ve actually had (human research assistants, especially good ones, are expensive).

Yes, in the process of interaction with my database I often add additional metadata. Future versions of the DT applications will enhance user addition of metadata, but still provide that invaluable base of built-in metadata about each document and how that document ‘fits’ with the whole collection through textual contexts (note the plural of “contexts” – that’s important).

That’s why I don’t have to go through the drudgery of tagging every document, except for my own special purposes (and in a limited way). And that’s why I can explore a document collection from any perspective I find useful, without limitations imposed by a preconceived, necessarily limited and probably inconsistent tagging scheme.

jchiar · June 14, 2007, 8:56pm

What I have is receipts that i scanned in to pdfs…
I am guessing it can understand the text in those, olny text files correct?

Bill_DeVille · June 14, 2007, 9:02pm

PDFs produced by scanning, but without having gone through OCR, contain no text. They are only images.

In that case, you might want to add a keyword such as “receipt” to the Comment field of the PDF’s Info panel in your database. The Comment field is searchable.

If you used DT Pro Office, the scanned image can be automatically run through OCR and saved to the database. Now the text is searchable.

jchiar · June 15, 2007, 12:05am

I have an OCR application. My question is, does Devonthnk pro ocr scan directly to the database somehow, or can I accomplish the same thing with my own ocr sw?

Bill_DeVille · June 15, 2007, 1:40am

Yes, DT Pro Office can scan and OCR directly to the database. Works great with my ScanSnap scanner. DT Pro Office can also OCR and save to the database image-only PDFs stored on a mounted drive, or even OCR-convert image-only PDFs that have already been imported to the database.

Yes, you can use an OCR application to perform OCR on the output of your scanner, then import the resulting PDF+text files to your DT Pro (or DT Personal) database.

Maria · June 18, 2007, 8:04am

Why tags are so important to me:

Bill,
you wrote about tags in bold, so do I. Tagging has become essential for my workflow, I do this on a systematic basis in the Finder comment field, using an app called NiftyBox.

People who work in multiple languages have to use tags because there is no way for DT to recognise that a document writing about 考古学is about the same topic like a document writing about archaeology. As long as we cannot build an index with corresponding words, we cannot use automatic classification.

An I realised that it is much easier to manage files with tags than to work with replicants and duplicates in DevonThink – not because the concept is bad but because the implementation is poor (delete all duplicates, show replicates, almost impossible).

I like DT and hope that it will after all overcome its shortcomings.

All the best,
Maria[/b]

alexwein · June 18, 2007, 8:15pm

I once again find myself agreeing with everything Maria says!! It is interesting, because I am just now today re-visiting the ‘can I replace DT?’ question. And it precisely this issue of tagging that is bringing this latest round to bear.

I love some things about DT–wikilinks, it’s power, stability, ability to handle massive amounts of information quickly, this user forum–but I also find myself constantly running up against its limitations in organizing large amounts of information. Yes, yes, I know I can use comments to tag, but that is extremely cumbersome for my uses. I have used replicants extensively, but I end up with a mess of duplicate file names and the system breaks down anyway due to the inability to quickly tag items. The program I’m trying out today (again) allows me to create nested smart folders and create a new document within them that automatically tags the file for me. I can drag a file directly into a smart folder and viola, I’m done! Very simple and direct and when I want to find these files, I know exactly where they are in an instant.

So, I am shamelessly grabbing on Maria’s coattails here in adding my voice to the ‘need for tagging’ as well as other features that I consider to be basics for any genuine information management program, such as ‘real’ smart folders (real ones), etc.

Alexandria

Bill_DeVille · June 18, 2007, 10:25pm

Hi, Maria and Alexandria.

I shamelessly admitted in my long post above that there are times when I do tagging. And I understand Maria’s need for tagging when one is working with multiple languages.

But the reason I’m such a fan of DT for researching large document collections is that its contextual analysis AI support and some of the other ‘word’ tools (like Option-clicking a selected word, or the Context lists available in Tools > Search) provide built-in and infinitely variable “tag clouds” that don’t require work on my part. I’m always trying to look at my references from new perspectives, which means manual tagging becomes an oxymoron.

Yes, additional support for tagging will be in version 2.0. So will true smart groups. And – importantly – much more powerful querying that can allow complex filtering when needed.

Maria, I’ve already told Alexandria this. I’ve bought a log cabin in Brown County, Indiana, which is one of the pleasant areas of the U.S. for scenery and seclusion. Will also dabble again in the groves of academe at Indiana University, as I’ll be doing some light-duty lectures and research. Will probably make the move in late July or August.

Maria · June 19, 2007, 12:42am

That sounds wonderful. Lectures and Research – that is fun. Although I do not know Indiana (and the rest of the US) it sounds like a nice place. Congratulations for such a decision. I wish you good luck!

Maria

alexwein · June 19, 2007, 3:05am

Yes, I told Bill that he is living my dream! Down to living in a log cabin (not the roughing it kind, of course!). It sounds quite lovely!

Alexandria

fharvey · June 19, 2007, 12:00pm

I’m with Bill on the contextual search capabilities of DTPro, but I also want to voice support for adding hierarchical tagging capability. Personally, while I benefit from contextual searches, much of my work is organised hierarchically in relationship to projects, teaching and research interests. I think tagging would help keep the hierarchy that corresponds to my work activities.

Since we’re talking about dreams: A bolder dream in my book would be the cross-platform integration of data from multiple knowledge management tools (e.g. DTPro and Endnote)

Only Bill may be close to realising his dreams in Indiana, yet since I’ve been there and know the scale of that dream, I hope that modest and bolder dreams may come true too.

Timotheus · July 14, 2007, 5:57am

Maria’s point about the relevance of keywords / tags for multilingual users of DT Pro is indeed of fundamental importance, and has already been stressed more than once on this forum.

Therefore it’s a great relief to hear from Bill that “additional support for tagging will be in version 2.0”. But … what exactly does “additional support” mean? Does it mean that DT Pro will have a standard keywords / tagging feature, like so many other programs, or does it hint at something else?

jwiegley · March 16, 2008, 8:01am

Hi Bill,

I find your comments on tagging of interest. You mentioned that tagging is another kind of directory structure; but can’t we also say that groups and hierarchies are simply another form of tagging? If I have one file in a “Philosophy” group, and another in a “Religion” group, there is really no difference between that, and tagging the files accordingly within a single group. Tags and groups both are forms of metadata – or external classification – which is separate from the content.

I’m sure you use groups a lot, so in a sense you are creating those same artificial tags which you rebelled against in your post. They just snuck in under the radar, because you were able to use drag-and-drop, and easily separate files since groups are clickable and individually browsable, whereas tags are (currently) very manual.

I’d wager that if tags could be applied just as easily – using drag-and-drop, with (semi to full) automatic creation of smart groups, and tag clouds of easily selecting unions and cross-sections – that you would be using tagging a whole lot more than you do now.

Sometimes the tool determines how we work.

John

ndouglas · March 16, 2008, 12:47pm

Ahh, reopening old wounds

Bill_DeVille · March 16, 2008, 7:12pm

Hi, John. You are absolutely correct. Classification into groups and adding keyword tags into the Comment field of documents are both forms of tagging.

And yes, I do some tagging by both methods, and indeed by some other methods including the way I name some documents or by creating hyperlink references between documents.

What I “rail against” is the idea that I must spend front end time and effort in assigning tags of any kind to most of the new content of a database at the time it’s added. I will strongly resist using any document management system that requires me to do that, and punishes me if I don’t observe its rules.

To start with, I hate working with paper. Paper books; paper journals, magazines, newspapers and reports; paper correspondence; paper notes and file cards; and xeroxes of some of the above, plus notes to myself about where I put something or should put it – most of the latter maintained in my head until I forget them. Even today, paper comes in to me in an unending stream. And in the past, when I was working on a publication project or managing a governmental project, I would be working with many thousands of paper objects.

Paper objects can be organized. Librarians and archive specialists undergo years of training on how to do that, and they spend much of their time implementing such organization. But my offices have always been famous for clutter. I’ve usually got stacks of paper, some on my desk, some on chairs, some on the floor and some on shelves. Yes, I’ve got bookshelves, file boxes and file cases. I’ve always had a pretty good memory that allowed me to remember where a certain paper object could be found in a stack. But when a secretary or assistant finally prevailed on me to allow stacks to be properly filed in a file case, well organized by some criterion or other, I found that I often had more difficulty in locating information for a new purpose. The persons doing the organization had tagged each object, usually with a procedure that seemed objective and logical.

The problem with tagging is that it doesn’t cover all the alternative contexts in which a particular object might turn out to be useful. No tagging system can do that a priori. Who could ever know what needs for particular bits of information might exist in the future, or how to tag each object so as to anticipate such needs, so that the tagging system could quickly aggregate the information contained in all those object for a particular purpose? We don’t do that. It could take hours, days, weeks or years to try to figure out a comprehensive tagging for a single object, with a high risk that it wouldn’t really turn out to be comprehensive. The resulting tags would be so complex as to be effectively unusable. In practice, we make a quick pigeon-hole decision.

Related to that is the fact that most documents contain not just one piece of information, but many. No practical tagging system can tag all of the different information components that might be found. Instead, the tag(s) for that document will subordinate most of the information components to the one or few that are judged most important at the time.

Thus, any practical tagging system will be simplistic, choosing among many possible ways in which a tag might be assigned. It is therefore subjective, so is likely to be applied inconsistently over time, and among different objects to which it is to be applied.

Tagging can sometimes be a disaster. Example: a governmental agency decided to transfer all of its historical documents to scanned versions of them on a computer. Subsequent to scanning, the paper documents were boxed in no particular order and sent to archives, and have since been destroyed. A contractor was hired to perform computerization of the paper documents. Because many of them were handwritten field reports, or contained handwritten notes and drawings, the decision was made not to perform OCR. Because OCR was not to be performed, the decision was made to scan the documents at low resolution (so subsequent OCR of the scanned documents became impractical). As each document was scanned, the scanner operator assigned keywords intended to permit computer searching for that document. A list of possible keywords had been compiled and was provided to the scanner operators.

Problem: The scanner operators were minimum-wage employees of the contractor, with no familiarity about the agency operations, the content of the documents, or the relevance of most of the keywords to any document.

Result: Millions of pages of historical documents, many of them relevant to current regulatory issues, have been effectively lost. Although they exist on a computer system, incomplete and inconsistent use of keywords – the only means by which the documents can be searched for on the computer – makes it impossible to find important documents in contexts in which they might be important.

This is a worst-case example. True, a possible remedy might be to have persons familiar with the content and purpose of each document reconstruct the keywords list for that document. But one shudders at the time and cost, and the logical problems with keyword tagging would remain. Even that would be difficult in this case. I’ve looked at the display of some of the documents I myself had created. Because of the low scanning resolution, the images are so blurry that one often has to guess at the content of the document.

But I will argue that even if the images had remained legible, and even if the keywords had been applied by persons familiar with the content and purpose of each document, much of the information content of those millions of pages of records would have been lost, as much of the information content would be unsearchable in a context other than that established by setting the keywords.

By contrast, most of the content of my databases allows full-text searching. That’s important.

OK, I do use some organization of content into groups. I’ve got hundreds of groups. Some people complain that DEVONthink constrains one to use hierarchical organization of all of the content.

I refuse to be a slave to organization! I view groups as clusters or clouds of related documents. Some of those clusters are carefully designed and may be tightly hierarchical. The vast majority of my clusters of information are pretty loose and undefined. I routinely violate the rules of hierarchy; I don’t care. I’ve got a big group named Miscellaneous into which I dump items that are peripheral to the more defined interests for which I use my main database, but which are sometimes relevant. Even so, I usually find that if I select a new item that I would probably toss into that heap, The Classify operation would usually suggest the Miscellaneous group for it. So DEVONthink can be pretty smart. It looks at the content of the documents that I’ve tossed into Miscellaneous, and recognizes that new document as one that would probably fit there.

I’ve got a much larger group named Incoming. Because my main database is almost always the open one, I use it to collect data that I’ll eventually transfer to another database. So I’ve got a number of subgroups under Incoming that correspond to those other databases. But right now almost all of the content of Incoming consists of documents that I haven’t taken time to classify. How many? 8,411, to be precise.

So the only tag for those documents is that they are untagged, either by organization or by keywords. Once in a while I’ll take the time to whittle down the backlog of unclassified material, when I don’t have something better to do – which isn’t often.

Just like every office or study I’ve ever occupied, I can be accused of maintaining a cluttered, messy database. It’s true. But DEVONthink Pro Office takes that in stride and doesn’t slap my wrist.

When I take on a project, though, I will organize my work pretty well for that purpose. I’ll create a project group with subgroups containing my drafts, notes and reference materials (duplicated or replicated). I’ll use keywords, Labels, States, highlighting and hyperlinked notes to tag things in that project group, as much as is useful to me for the project. DEVONthink lets me find useful references though searches, See Also, and by playing around with terminology by clicking on the Spelling or Context buttons on the Search window, or by looking at the Words list in a document view.

I love the writing environment for drafting inside the database. If I need to look something up, it’s right there. If I want to see what someone else might have written that’s similar to the content of a couple of paragraphs I’ve just done, I select those paragraphs and choose the contextual menu option See Selected Text. So it’s a great environment for exploring ideas, and a great environment for breaking a writer’s block by finding something interesting.

Fortunately for me, DEVONthink’s searches, See Also and the other tools I mentioned operate just as well whether I’ve organized or otherwise tagged items, or not.

So while there are times I make use of a considerable amount of tagging, it’s post hoc, not a priori. It’s after the fact of my having defined specific tag references for a project, which is pretty easy to do. I understand the context in which material relates to that project.

About the only time I use keywords up front, when I add an item to the database, is for simple PIM stuff, or for something like an expense or other item related to my need to put together records for a tax report – I already know the context and purpose of such records.

But when I tag items for a project, I take great care not to have those tags spill over into the rest of my database. They are explicitly limited to that project. When I’m finished, I’ll export that project to it’s own database and remove everything except the finished product from my main database. Otherwise, those project-specific tags will be confusing next time I tackle a completely different project.

And I luxuriate in how much time and drudgery I’ve saved by not trying to define the most appropriate tag(s) for every new item I add to my databases.

jwiegley · March 16, 2008, 9:16pm

I really appreciate the thoughtfulness of your reply. I take many of your points to heart.

I too do not rely on tagging as a way of finding things. Rarely do I have the forethought and presence of mind to tag something in such a way that I can remember exactly how I had tagged it in order to find it! For discovering information, I rely on full-text searching and See Also.

But browsing information is something else entirely. For example, I have an “unread” tag/group (from this point on, I’ll dispense with the non-essential difference between tags and groups, and just call them all tags; the only real difference today is the way they’re used). I have tags for games, rpg games, python, erlang, etc.

When I want to do a search among all my python documents especially, I use a smart group to pull in everything that’s been tagged as relating to Python, and then focus my database search there. If that doesn’t yield what I want, I open it up to the full database. In this usage, tagging is a shortcut – like being able to walk into the Scifi section at your local bookstore is a shortcut. It facilities browsing among objects that will likely all be directly related to your main topic of interest.

DEVONthink could make tagging really come alive if there was a checkbox on the Classify pane: Classify by tag. If unchecked (the default), the behavior would be the same as now. If checked, it would show me a list of possibly relevant tags, rather than groups, and if I selected a bunch of tags, it would add those tags to the document. Or maybe it could fold both tags and groups together, so that I could do everything in one go just by Command-clicking…

When a particular document fits into four or five categories (a real example from my filesystem: &games &rpg &apple2e &ultima &docs), tagging becomes a nice, lightweight way to relate like documents. The alternative is replicants and groups, but there is one other advantage to comment-based tagging: Spotlight can find them as well.

Which brings me to another suggestion: In the preference pane for DEVONthink, make it possible to export the parent groups of an item to the kMDItemKeywords metadata for the each .dtp1 metadata cache file. Thus, if my record were located in “/Clippings/Erlang/Server”, I would see this from mdls:

kMDItemKeywords = (
Clippings,
Erlang,
Server
)

This would blur the distinction between grouping and tagging even more, and would make it possible find DTP articles in the Finder by such means as:

threads kind:devon keyword:erlang

Of course, the ultimate for all this would be to eliminate the distinction altogether: groups would be tags and tags would be groups. When you replicate a record into another group, it now has keywords for those groups and their hierarchical ancestry apiece. If you move a record, the keywords change to the new location. If you set the tags manually (rather than dragging and dropping into groups), DTP would instantly create the necessary groups at top-level and would replicate the record into those groups. To specify nested groups during a manual tagging, something like Clippings/Erlang would be possible. Of course, I’d expect a pull-down menu of all the possibilities, and typeahead completion, just to coddle the keyboard lovers among us.

There should always be a balance between automation and carefully executed thought. Never has either been a replacement for the other; but the union can be glorious.

John

jwiegley · March 16, 2008, 10:23pm

After a quick bit of thought, I realized I could implement a stop-gap for my suggestion (albeit it much more manually) using Applescript today. This script will take the current hierarchy for an item, and its current tags (set via comments), and will make sure that its replicated in a group named for each tag, and also that its comment reflects every group it’s in.

This script is really only appropriate if you use a single level of directory hierarchies. If I find the idea useful, I’ll extend it to include the notion of tag/group hierarchies.

tell application "DEVONthink Pro"
	set theRecords to the selection
	repeat with theRecord in theRecords
		set theTags to {}
		set theLoc to the location of theRecord
		set theTags to theTags & words of theLoc
		set theComment to the comment of theRecord
		set theTags to theTags & words of theComment
		
		set theNewTags to ""
		repeat with theTag in theTags
			if theNewTags is "" then
				set theNewTags to "&" & theTag
			else
				set theNewTags to theNewTags & " " & "&" & theTag
			end if
			
			if (get record at ("/" & theTag & "/" & (name of theRecord))) is missing value then
				if (get record at ("/" & theTag)) is missing value then
					create record with {name:theTag, type:group} in get record at "/"
				end if
				replicate record theRecord to (get record at ("/" & theTag))
			end if
		end repeat
		
		set the comment of theRecord to theNewTags
	end repeat
end tell

Bill_DeVille · March 16, 2008, 11:24pm

Good responses.

jwiegley · March 17, 2008, 1:34am

I came up with two scripts that do much better toward my desired working scenario. See the new threads I created in Tips & Tricks:

John