Revisiting Steven Berlin Johnson

I re-reread this article in response to this post in another thread, and my attention was inevitably drawn to the 3 reasons why Johnson thinks his system works well. I have no argument with 1 (DT is good at making connections), but I wonder what people think about 2 and 3: prefilter results by including only quotes that are of particular interest (presumably throwing away all the rest); and make each file about 50-500 words in length.

With plenty of time or a research assistant, this would presumably be a good way to go. But to what extent do people think it’s necessary in more typical scenarios? (A case where splitting a PDF has proven necessary – the PDF was more than 270 pages in length and comprised a whole journal volume – was mentioned by me yesterday elsewhere, but I had never really considered going further than breaking it up into whole articles.)

In response to: With plenty of time or a research assistant, this would presumably be a good way to go. But to what extent do people think it’s necessary in more typical scenarios? (A case where splitting a PDF has proven necessary – the PDF was more than 270 pages in length and comprised a whole journal volume – was mentioned by me yesterday elsewhere, but I had never really considered going further than breaking it up into whole articles.)

I wrote an entry on the Circus Ponies (Notebook) user forum that somewhat addresses the issue about how to split pdf files. I have tried this on a few occassions and have mixed feelings about its usefulness, though I still have high hopes that the AI search capabilities of DevonThink will prove to me that the process is worthwhile in the end.

Dear Fellow users,

Many of you are likely familiar with Steven Johnson’s path-blazing article about using Mac software to enhance writing: … 00230.html

Essentially, the idea is to have a search engine go through clusters of text that are no longer than the 50-500 word range to show semantic relations between them that might uncover interesting discoveries that enhance a writer’s project. He calls the 50-500 word cluster the “sweet spot” because any more (like whole articles) is too much information to garner interesting thoughts or connections. Johnson fantasizes about software that might break-up texts (books or articles) into clusters suitable for semantic searching. Here’s the quote:

“I wonder whether it might be possible to have software create those smaller clippings on its own: you’d feed the program an entire e-book, and it would break it up into 200-1000 word chunks of text, based on word frequency and other cues (chapter or section breaks perhaps.) Already Devonthink can take a large collection of documents and group them into categories based on word use, so theoretically you could do the same kind of auto-classification within a document. It still wouldn’t have the pre-filtered property of my curated quotations, but it would make it far more productive to just dump a whole eBook into my digital research library.”

Well, it seems to me that Notebook can fulfill this function through its indexing feature. Paste or import a full article (say, ‘paste text as outline’ in the ‘Edit’ menu–which separates paragraphs as separate cells). Then, in the index, look for links to text that contain key words, words that you think might refer to an important or central idea. Then, go through the text and clip any relevant quote into your semantic search program (Johnson uses DevonThink).

I wonder if anyone has worked with this feature. Any success? Any ideas to modify the process?

Well, I don’t know if I would call SBJ’s article “path-blazing.” Illuminating an extant trail, yes. Blazing a new one, no.

And, to answer both the question he asks and the one raised here: I think what everyone, including the blazing SBJ, is that what makes his database so damned good is that the information in it has been pre-selected. It’s not regular grain, it’s fortified breakfast cereal. Context matters, and what makes a search, especially the fuzzy search of DT, work is that there is a rich contextual network. If you poured every part of an article in, you would weaken that network. You might still get interesting results. You might even get unanticipated results that led to fundamental insights and innovations, but what SBJ was getting back was his own mental processes, mirrored and sorted outside of his own mind. It’s sort of the ultimate version of Freud’s mirror stage: “Goodness, that’s me!” Except, it’s more like: “Goodness! That’s what I think!” And lest you think I’m making a mockery of all this: this is a very cool thing indeed. Witness my own use of DT Pro, and in a way very similar to SBJ’s:


////Montanari 1994
/////24 – Quotation or notes paraphrasing larger span of materials. Anywhere from 10-1000 words.

The “Montanari 1994” above is a text (as a folklorist I use Chicago style documentation, with the author-date option) and the “24” beneath it is a page number. Sometimes it can be range: as small as one for a quotation that breaks across a page but sometimes 10 pages, or even more, when I simply prefer to summarize a chunk of the text which is less useful in detail but still worth capturing in some fashion.

Labor intensive? Yes. Context rich – that is, a context of my choosing and so richly meaningful to me? Yes.

Would I be interested in a script that could take an RTF, DOC, or PDF and break it into paragraphs and make each paragraph a separate entry in DT? Sure. But I’m not sure the results would be as good.

Of course, I could be wrong and I would be happy to hear discussion to the contrary.

EDIT: I didn’t originally write “Goodness” in the quotations above. It was sh*t, but that got replaced with “Nuts” and that just didn’t sound like me. I have been known to replace almost any expletive with “goodness” when my one year old daughter is around, so I went back in and cleaned up the language my way. I had not idea that there was a language censor built into this thing. Also, spaces aren’t respected by this version of phpBB, and that’s why the directory structure represented above has multiple slashes to represent nested groups.

Wow, this is hot stuff. I’m going to try it.

I think almost everyone here would agree with that!

My first reaction is that the way I’m working now is labour-intensive, but not very effective. If we can find a way that is labor-intensive but works really well, it would be well worth the effort.

The results may not be as good, but once it’s broken up like this there’s nothing to stop us throwing away the entries we don’t need, or moving the ones we do need into a current work or “hot” database. It might be easier than using Circus Ponies as an intermediate step.

I don’t bother with emulating Johnson’s practice of breaking up books and documents into small bits.

I’ve got thousands of references, most longer than one page, some running hundreds of pages.

When I’m looking for something the real gem of information may be on page 473 of a book. But the odds are that DT Pro will find it for me anyway.

When I’m writing, though, I may break out a particular section or paragraph as a clipping into a new document, so that I can play See Also or Words for that specific topic. When I’m looking for ideas on that topic, I don’t want to look at the 2 or 3 other topics that might be discussed in my article.

Can you elaborate on exactly how that happens. I’m using the “see also” feature, and it’s returning relevant results… but I can’t figure out where in this 270 page pdf it just returned the relevancy is.

I’m wrestling with weather or not I need to break this PDFs books down into smaller chunks so that the results returned are manageable. Knowing that the 270 page book is relevant doesn’t help me find the quote I’m looking for. :frowning:


I second John Randall’s excellent question. Could you please answer it, Bill? It is really quite critical as it affects how we go about setting up our databases.

Nothing mysterious, really. I’ve got just a few hundred long documents, say over 200 pages, in my main database. I don’t pretend that I’ve read through all of them, but I chose to add them for a reason, because they contain material that interests me.

If a book-length document turns up in a See Also list, I’ll open it in its own window (or in DEVONthink 2, a new tab) and take a quick look at the table of contents. And/or click on the Word button to see if one of the terms I’m interested in shows up. And/or see if that document pops up on the results list if I do a quick search for some pertinent words.

If within 4 or 5 minutes max I haven’t found evidence that I want to spend more time on that book for the purpose at hand, I move on (often, in less than a minute). But if it does look interesting, I may spend a deal of time going through it, as much time as it takes. If it’s really good I may spend the rest of the day reading it and making (searchable!) notes. That’s called studying; I hope I’ll learn something. :slight_smile:

The idea of chopping up my collection of selected references into little bits and pieces makes me shudder. For if it’s worth having, chopping it up makes it harder to make sense of (how’s that for a dangling participle?). A well-written book or report has a flow, a meaningful sequence of facts and ideas. Often, alternative ways of looking at something are evaluated, then dismissed. Picking out one of those dismissed alternatives as a “snippet” has very little to do with the information in the book.

In short, I don’t try to produce a collection of Bartlett’s Quotations, pithy though many of those may be. I’ve never bought into Johnson’s premise of working with little chunks of words. I don’t believe he really thinks that way. IMHO, he’s culled out short extracts that remind him of their context, which is probably more important than the extracts themselves. And he used paid assistants to put together those collections for him.

Call me Scrooge. Bah, humbug to the idea of parsing everything into small pieces, like making a jigsaw puzzle out of a picture of Niagara Falls. Pick out one of those little pieces, and you don’t have much perspective about Niagara Falls.

Sure, See Also has more trouble with a 270-page book than with a hundred-word abstract (so do I, for that matter). That’s the way the world is. I still value See Also very highly. It has given me some very valuable tips about some of the books in my database. It also gives me some bummers. It’s up to me to follow up or dismiss those tips. Overall, See Also saves me a lot of time and effort.

Thanks for the explanation, Bill. I’m with you on the thought that breaking my database into little chunks “makes me shudder.” On the other hand, Johnson’s approach for adding new material from printed books that are not already digital makes a lot of sense and is useful. I suppose it is not an “either/or” situation – and I currently use both techniques.


Thanks again for the thorough response Bill.

…I need to chew on this for a while.

I’m not sure why people are treating the “chunked” approach and the “whole document” approach as mutually exclusive. My database contains both.

As noted, chunks are one way to handle documents that don’t exist in electronic form. Chunks also come in many shapes and sizes. I wouldn’t break a 400 page book into 500 word snippets, but I might break it into chapters or break a long chapter into sections.

It also depends on my need for the material. If I am doing my initial research on a topic, I’ll want to read long sections with full context. But if I’m double-checking references before submitting a paper, I’ll want to go directly to the section containing a particular fact.

I would agree with Bill, though, in that in my experience breaking all documents into chunks is a waste of time. Until I’ve worked with the document, I don’t know whether it’s useful at all, much less which pieces of it might make viable chunks.


Father Moses and Katherine, I completely agree that mixing whole documents and searchable notes about them (which may contain excerpts) can extend the power and usefulness of a database. I routinely do that.

And when I do so, I am thinking rather like Johnson. I’ve entered comments and selected excerpts, based on my own analysis of a document, that emphasize material of special importance to me.

Such notes may be associated with a document that’s in the database or, as Katherine suggests, may introduce information about a book that hasn’t been digitized.

There’s an important distinction between this approach, which reflects my thought process in making comments about a document, and some mechanical action that chops a document into a pile of “snippets”. I would place a much higher value on my notes than on a random, mechanically-derived “snippet”.

By the way, I make a lot of searchable notes when I’m working on a project. I don’t use Skim or any other single-filetype note-taking approach, as I want to make notes relevant to a project regardless of the filetype of the document that’s referenced. I have no problem identifying the referred to document in a note, nor any problem in identifying all of the notes associated with a particular document.

I’ve been working up an example database that illustrates the techniques I use for note-taking and association. Last night I emailed the first pass to Eric for comment.

This is a really great thread.

I have a handful of different usage scenarios, many of which involve note taking and chunking. Some of them work well. Others are downright clumsy and arduous. If anyone has any advice on how to better handle any of these situation… or if you think I’m barking up the wrong tree and have a completely better idea, please share.


  1. Reading a physical book– I scan in or re-type special passages. Anywhere from 20-500ish words. In each note, I put the page number. I also type in my reactions and thoughts in that same file. I put them all in one folder so I can have all the quotes from that reading in one place. I replicate them to other folders if need be. I also try to find a PDF of the book to store in a folder so I have it on hand later if I want further context for these quotes.

  2. Reading a PDF book– I like to take notes in skim- because A) reading books in skim is a pleasure and B) the notes are directly contextualized, like scribbles in the margins. I’m not sure what is best way to get these skim notes into DEVONthink, but I’m not worrying about it until DT v2.0 comes out.

  3. A REALLY GREAT book– If there is a book I keep coming back too… one that is relevant to a lot of my research, so full of relevant passages that I find myself highlighting the whole book, I definitely find an electronic copy, preferably PDF copy. (I have a couple of methods for breaking the DRM off of ebooks.) I store the entire PDF in DEVONthink, then I highlight each chapter or section and copy it into an RTF file. This helps me get to relevant passages (via see also) quite quickly when I’m looking for them. The problem is, the resulting RTFs are ugly. I want them as PDFs and I want to be able to open them in skim easily. I’m still looking for a decent automated way to chop up PDFs. There are ways to do it, Using PDFpen or the like, but they are really clumsy and time consuming, and I don’t have the chops to script them.

Journal Articles
4) PDF into DEVONthink– read in skim. Notes in skim. Again, not sure what the best way to index these notes in DT is. Waiting for version 2.0

Web Pages
5) Web Archive via bookmarklet–
Pro: Web archives are great because I can read them offline, and the URL is stored.
Con: A little slow to render. More importantly, they don’t play well with Bibliographic software.

  1. Print a PDF to DEVONthink–
    Pro: fast, preserves formatting. Can read in skim and add notes.
    Con: URL is lost

  2. Multiple page web pages–Websites that spread articles across multiple pages without offering a “print view” page that puts all the content onto one page are horrible. Sometimes I need to make 10 different web-archives just to capture one article. Usually I print 10 PDFs, then consolidate them in PDFpen or similar. It is a really cumbersome process.

My Writing
8) Final Draft goes into a folder. Final draft sometimes gets replicated elsewhere.

  1. Information rich drafting and organizing materials–
    Sometimes I have really great chucks of thought that don’t make it into a final draft. I don’t want to loose these chunks, so I print PDFs of my information rich drafting materials (usually from MindManager or OmniOutliner) into DEVONThink. They go into the folder with the final draft.

The problem is, these PDFs can be indexed, but they are fixed, and therefore aren’t as useful as the working documents themselves. So I wind up importing the original OmniOutliner files and MindManager files along-side the PDF versions so I have them there safekeeping (The aren’t indexed and are just import as URLS.) It’s a little cumbersome.

I’ve tried exporting into OOML language or XML– trying to find something that retains some of the functionality of the original doc while allowing DT to index them. So far, no luck.

This is a great thread.

John—I believe all file types with QuickLook plugins should be indexed by DT 2.0, so that should take care of your OmniOutliner and MindManager files, at least.

I can’t wait to have a look at your note taking example database… :slight_smile:

I’ve posted this in my Public folder.

URL: … US&lang=en

Download +Example Database - Associating Notes.dtBase2". There’s also a version for DT Pro/DT Pro 1.x – note that the 2.0 version has the suffix “.dtBase2”.

NOTE: Example databases are not available for DEVONnote or DEVONthink Personal, as these are limited to a single database.

The download is 4 MB. When the download is complete, double-click on the file to decompress it, then double-click once more to open it under DT Pro 2.0 or DT Pro Office 2.0.

I used an OCRd copy of an old paper about an environmental science exchange with Egypt to illustrate association of notes with that PDF.

Demonstrates my personal biases and eccentricities, but I find this kind of workflow quick, intuitive and effective. As always, feel free to disagree or to modify my approach in any way you wish. :slight_smile:

Thanks again for sharing Bill.

Dear Bill,

I know it’s been awhile but I was wondering if I could also take a look at your example database (apparently I don’t have permission to download from the public site?) - it sounds like exactly what I’m looking for -

Thank you -


There have been a number of enhancements to DT Pro/Office so I’ve started a redesign of it, but haven’t had time to finish it. Send me a PM in 2 or 3 weeks and I’ll try to finish it.