MS Word sections and imports

A topic that is often raised with DT is the fact that while notes and clippings and even web pages are a good size for indexing and classification, whole documents typically are not. What’s needed is something that will break a big doc into 500-word chunks in ‘sensible’ places and import them separately.

I had a 15,000-word document in Word that I wanted to do this with, and the closest I came was using the Outline mode in Word. If the document is structured with headings and sub-headings, this can be a good start; it’s still a rather manual process, but it was worth it for me.

This is what I did:

  • Created a DT group for the imports & opened it in its own window
  • Selected outline mode in Word and collapsed it to show only level-1 headings
  • For each heading, I would:
    [list]
  • double-click the little ‘+’ icon to expand the whole thing. This also selected all the text under that heading
  • drag the selected text into the DT window
  • close the heading and move onto the next one
  • I then had a DT folder with a separate entry for each top-level section in my original doc.
    [/list:u]
    Unfortunately, DT seemed a bit buggy here - the text didn’t always get indexed as it was dragged in. I found two ways to solve this: exporting and re-importing the data, or double-clicking on each section which opened it in an edit window, making some small change and then saving. I wanted to do a little tidying so this was not too great a chore. Is there any other way to force a re-index?

If I can find some time to do some scripting at some point I’ll try and automate my process, unless anyone knows of other good solutions!

This also makes me think that sometimes it might be worth splitting PDF files into seperate pages before importing them. Yes, the boundaries would be arbitrary, but the smaller chunks might still be useful. Should be easy to do with PDF Services…

Thanks for those tips.

I’m finding that text in a Rich Note that is a link (i.e. underlined) isn’t always getting indexed. Any ideas how to force it to get indexed? Non-linked text in the same document is indexed, however. Weird.