Advice sought for setting up research project

I’m about to start a large writing project about a historical figure about whom much has been written. So I’m having some books and articles scanned and am trying to figure out the best way to set up my DTPO database. Here’s what I thought I’d do … maybe you can tell me if this makes the most sense.

I thought I’d set up folders for each book, and sub-folders for each chapter. Then, I’d like to assign portions of each chapter to various “Subject” folders. In other words, the portion of each book that deals with his birth go into the birth folder, and the portion of each book that deals with his teen years goes into the Teen folder, etc. This way, text will exist in more than one folder.

Does this make sense?

If so, I have a few questions:

  1. When I scan the books what are the PDF specs I should adhere to? How many dpi, etc.?

  2. In the structure I’ve outlined above, if the same text exists in more than one folder and I make a change to it, will that change be reflected in the other folder as well?

  3. One of my goals is to never separate text from some i.d. that tells me the source. Is there some way to automatically tag everything so that even if I accidentally remove some text from the Book folder it comes from, that I’ll always be able to identify the source?

Thanks in advance …

Organize your references and notes in ways that are efficient for you.

  1. Scans should be no less than 300 dpi for reasonable accuracy of OCR. Depending on the scanner, the fonts used in the copy material and their condition, you might go to 400 or 600 dpi.

Scan/OCR output should be as searchable PDFs, not text. Don’t expect perfect OCR accuracy. The advantage of the image layer of PDFs is that it is an accurate representation of the original. So that image layer can be used as the authoritative copy, if a text version contains OCR errors.

Once you have captured a searchable PDF into your database you may wish to make a rich or plain text copy. Select the PDF and choose Data > Convert > To plain (or rich) text. It’s much easier to chop up a text version of a book by chapter, sub-section, etc. than if one works with PDF splitting. The advantage of rich text is that you can create hyperlinks within your database, which isn’t possible with plain text.

  1. If you replicate chapters or other portions of text into different groups, any change you make to one replicant will be made to all other instances of that document. So if you plan to annotate the document, either in the body or in the Comment field, you would be better off duplicating rather than replicating extracts that are to be stored in more than one location.

  2. You will need to work out a scheme, as a document loses any identity with a group if it’s moved to a different group. My approach is to give an excerpt or comment note the same name as the source (“Reference Name”), then extend the name, e.g. “Reference Name - Chapter 1”, or “Reference Name - early life pp. 32-58”. The advantage of this approach is that one can select the “base” part of the name, “Reference Name” and do a Lookup. This will result in pulling together in the search results all portions of extracts and notes related to a particular source document – that’s often useful.

Another, often complementary approach, can be to use hyperlinks in rich text documents. So in an excerpt or commentary note I can link to the source document, perhaps reminding myself in the note that my comment refers to paragraph 3 on page 154. (That’s useful fodder for footnotes or endnotes in the finished project.) Linking can also come in handy when you wish to relate an excerpt/note to multiple references.

And of course there’s the simple-minded (sometimes a good thing) approach of noting in the text document or in its Comment field which group it belongs to. :slight_smile:

Great advice, Bill. Many thanks.

By the way, is there an effective limit to the size of a database? Is there a point - whether number of items, or overall size - at which things slow down appreciably? Any other potential pitfalls I should worry about?

Another simple-minded solution to work with a lot of material is to use DT for the “capture” stage and build a shorter and more flexible preparation for each chapter in a different system. In addition to capturing in DT, I use index cards for the preparation of my chapter, sometimes also mindmaps (in Freemind).

But for me, the index card approach worked best so far: As I browse my database, I write each idea for a chapter on an individual card. Just a few words to remind me. Then, I reorder the cards, and put them next to the computer. Writing the first draft is just following the ideas on the cards, looking up the information in DT as needed and, well, writing. This way, I can just rearrange the structure on the go, by reshuffling a few cards. I really love to work on my computer, but desk space is still cheaper than screen space, and I can look at the cards without switching applications. Just my 2 cents…