Indexing entire books: A meaningful strategy?

macula · August 19, 2010, 7:46am

I have a ‘strategy’ question: My DT database consists largely of research notes and ideas, all organized in text ‘chunks’ and PDFs the length of which generally ranges from a few lines to a couple of screenfuls.

Tempted by the enormous power of DT, I gradually became voracious and decided to OCR my entire library of scholarly papers and books, the intention being to have it indexed by DT at some point. Should I indeed proceed with this plan, my database would all of a sudden include some quite long documents, e.g. searchable PDFs hundreds of pages in length.

While my Core2Duo churns out the OCR’ed PDFs, I am beginning to question the meaningfulness of this strategy. DT is an AI-driven environment, and I wonder whether its algorithms (e.g. ‘see also’, concordance, etc.) are designed to cope with such long documents.

In fact, I recently ran across this (admittedly old) article: http://www.stevenberlinjohnson.com/movabletype/archives/000230.html
which claims that DT has a ‘sweet spot’ of approximately 500 words per note, above which the returns begin to diminish.

In principle, however, I find the ability to search my entire library of articles, ebooks and research notes at once quite irresistible.

Any comments would be appreciated. Thank you.

Greg_Jones · August 19, 2010, 9:19am

I once tried what you are considering and soon rethought the plan. Just he additional disk space required for these scanned books was going to be an issue as some point. What I have done instead is to take a copy of the book’s PDF, delete all pages except for the table of contents and the index, and then OCR these sections (sometimes I’ll even skip the index). Edit the book to replace the original sections with the scanned sections, or just leave the index and ToC in its own document with a link to the original book.

I’ve also experimented with creating the table of contents in a text document, with each entry linked to the actual corresponding page in the PDF, then replaced the non-OCR’d ToC in the PDF with the new ToC. Then I can not only search for the text in the PDF’s ToC, I can click on a link to go directly to that chapter in the book. This process may take more time than value added for some books, but for books that are accessed on a semi-regular basis it can be quite valuable.

macula · August 19, 2010, 10:30am

Many thanks, Greg.

So I take it that your reason for abandoning the original plan was disk space? This alone would not be a deterrent in my case, as storage is ample.

Your final idea—linking each entry in an text-based TOC with the corresponding pages in the non-OCR’ed ebook—is appealing but the workload involved seems truly prohibitive. Also, one would also want to do the same with the index, and this would take more manpower than building the pyramids!

Greg_Jones · August 19, 2010, 10:38am

Correct-storage space was my primary consideration for abandoning the idea, and not having the processor horsepower available that you have to speed the OCR process was another concern. I did not experiment with the results to see if Johnson’s theory would have merit on scanned books. I may be mistaken, but I believe Bill D has explored Johnson’s idea to see if it has merit? Perhaps he will chime in if that is the case.

korm · August 19, 2010, 11:07am

.

Bill_DeVille · August 19, 2010, 2:47pm

Steven Johnson is a prolific writer who has written about how he uses DEVONthink databases to support his work. He advocates entering text “snippets” extracted or summarized from books or articles, each of which concisely presents a single concept or fact (and includes source information). And he has concluded that such snippets should be no longer than about 500 words. When he’s writing a new book he finds that exploring the “snippets” database constructed for that project is very efficient, including his uses of See Also.

I’m certain that approach works well for Johnson.

But my databases do not consist just of snippets, and never will. For one thing, Johnson uses paid research assistants to pour through libraries and journals and create those snippets to go into his databases. He’s a well-known author and can justify this as a business expense. I don’t have paid assistants, and I don’t have the time to go through the long documents in my databases and distill them into snippets.

Some have interpreted Johnson’s advocacy of snippets by splitting long documents into “chunks” of 4 or 5 hundred words, either manually or using a script to do that. I’m not going to do that, and I’m quite sure that’s not what Johnson expects to get from his paid research assistants, from whom he expects perhaps just one or a few snippets from a book or article. There’s no guarantee that such a mechanical approach would capture a “whole” fact or idea per snippet, and a high likelihood that the conceptual gem contained in the document will be split among two or three snippets, perhaps in a confusing way.

Finally, I’m a nut about not vandalizing my collections of references. I don’t split, highlight or mark up my documents. If I want to mark up a document, I’ll do that on a duplicate and then delete it when I’ve gotten what I want from it.

My approach, when a search or See Also list presents a long document that might be interesting, is to examine it for possible use. If it seems worthwhile, I’ll create a new rich text note (probably the “Annotation” smart template, though I may create others as well) in which I enter notes or brief excerpts, with links to a particular page in a PDF, or a “cue string” that allows me to do a Lookup search that will be highlighted in the search result.

Such notes become rather like Johnson’s snippets, as they condense and point to particular facts or concepts in my references, and I keep many of them in my databases for that reason. My main database contains about 25,000 references and about 5,000 notes. Many of those notes point to multiple references via hyperlinks.

In the past I’ve been a research assistant with duties like Johnson’s assistants, and later had my own research assistants. Nowadays, I maintain that DEVONthink is the best research assistant I’ve ever had, as well as the most inexpensive one.

The collection of reference materials in my main database has been built up over the years and is dynamic, with new content added (perhaps as the result of DEVONagent searches for a topic, as well as through new articles, reports and books of interest from journals and other sources). From time to time I may weed out items, or even put them into groups labeled “Nutty” or “Junk Science”.

macula · August 19, 2010, 11:18pm

Thanks for this extensive response, Bill. It helps put Johnson’s ideas into technical perspective. My OCR batch job is estimated to complete in a couple of days and I will index my entire article and book library soon thereafter. I will keep you posted on this thread should any problems arise.

ryannagy · February 13, 2011, 6:42pm

I took the plunge 6 months ago and scanned several hundred books into DTPO. I love having them there, especially for keyword and search purposes. I can search for a particular author or phrase and see not only the books that he or she has written, but also the other times he or she has been quoted in my library. It’s very useful and enlightening.

However, as Steven Berlin Johnson and the original poster mentioned it is not very useful for the “See Also” function, “See Also” very often gives me a list of book pdf’s without anyway of know exactly what in the book triggered its inclusion. To learn more I have to open the pdf and do keyword searches within it. This can slow me down quite a bit.

Though over time, I suppose I can do what Bill mentioned below. This will help me to slowly build up a more meaningfully searchable database:

“My approach, when a search or See Also list presents a long document that might be interesting, is to examine it for possible use. If it seems worthwhile, I’ll create a new rich text note (probably the “Annotation” smart template, though I may create others as well) in which I enter notes or brief excerpts, with links to a particular page in a PDF, or a “cue string” that allows me to do a Lookup search that will be highlighted in the search result.”

Just thinking out loud here. Would enjoy knowing what others are doing with their books and how they search.

Thanks. Ryan

elwood151 · March 8, 2011, 12:38am

I found myself facing the same question and I tried digitizing some large books…

pro:

if OCR works well (depends on quality of scanned pdf), you have a full text search and don’t need to rely on the index of the book and search for a long time.
This can quickly lead to finding the interesting pages (with the right keyword).

contra:

digitizing of many pages very time consuming (is it worth it? maybe only for books you often need or with very interesting/difficult content)
requires disk space / backup space
enlarges your database which might consume RAM and make the mac slower (unfortunately my old MacBook is limited to 3GB of RAM and with my large DB I think that is a problem)

I don’t have much experience with “see also” in this case.
However, for some practical reasons (previewing/opening large scanned pdfs might take ages) I have many books save as one pdf per page or double page.
If the pdf name contains the correct page number, this is good for citing and see also might make more sense, if the “<=500 words theory” is correct…
(However, page breaks are mostly not identical with ending/starting new sections, so the created chunks of information are somehow randomly distributed…)

an interesting topic to discuss…

@Bill: thanks for sharing your experience…
I disagree in one point: for me, already highlighted pdfs (I use Skim) make finding relevant information easier (if you’re always on one topic as in my thesis - maybe you’re searching different aspects and then highlighted sections from another reading might be misleading…)

elwood151 · March 8, 2011, 12:39am

how much time did that take?

ryannagy · March 8, 2011, 1:12am

Elwood - Quite frankly, I wasn’t paying too much attention as I didn’t want to get discouraged. I took my books to a printer about 20 to 30 at a time. He cut off the bindings and I feed them through the sheet-feed scanner. I did it during the final two weeks before I left the U.S. for Mexico - so it was all a bit of a blur.

It was worth it, though!

Ryan

elwood151 · March 8, 2011, 1:50am

@ Ryan: ok I see.
(now I remember that you wrote that in another post, because it seemed very cruel to me to destroy the books), with single pages and a scansnap or similar this is feasible…

ryannagy · March 8, 2011, 2:13pm

Yes, cutting up those books got me (literally) sick to my stomach. I still feel it sometimes. Did not do it lightly. But having them in DTPO has been priceless.

padillac · March 11, 2011, 11:54pm

I’m just about to do this, too. Is it possible to glue them back together? Was thinking if not, I can have them spiral-bound back together.

At any rate, I don’t think I’ll feel too bad. All of the books that I really want in DTP are worth far more than 2x what I paid for them, so if I desperately need to have that paper copy for some reason then I can just buy another one.

ryannagy · March 12, 2011, 12:38am

Hi there - Yes, you can certainly make a few “spiral bound.” I was leaving the country and ended up throwing many of them away. I wish I had put them in storage and kept them, as I think I am going to replace many of them anyway…

Nothing beats having a real book in your hands. For me, the electronic version isn’t quite the same…

cheers - Ryan

sjk · March 12, 2011, 1:56am

I’m overdue to check on my book collection stored at a friend’s house. Some concern there may be water damage to the bottom box so hopefully the real keepers are higher in the stack.

I like electronic versions of reference books and others intended more for randomly browsing/searching than front-to-back reading.