Large text library

Ward_Smith · August 25, 2005, 6:06pm

Hi –
I have a pretty large collection of texts that I want to analyse. I have them stored in Devonthink Pro, and they are basically html pages downloaded from the web.

I want to preserve the structure of the html pages – table of contents linking to pages and named anchors etc.

The problems I’m having are of two types: the idea of categorization, and the meanings of “Import Site,” “Capture” and “Capture as webarchive”

CAPTURING, ARCHIVES, AND IMPORT SITE

What are the differences between these?

I import a site and it creates an “archive” which more or less mimics the structure of the site being captured. ok so far.

If I’m offline, and click the back button on an article – which should take me to the archived table of contents page, I get a message that I’m offline. This tells me, as you know, that the page has not captured into the database but referencing the original source. If I’m online, it opens the source page in Safari.

I then “capture” the page and delete the original link. If I do this to all the pages, it works ok.

The problem I have here is that it is A LOT of text, mebbe 300 or so files. As far as I can see, it is impossible to automate the “Import-Capture-Delete Original” process, either in automator or applescript.

If I ctrl-click on the list view, I cannot capture a group of pages at once, however if i call up an individual page, and ctrl-click on the page itself, I see the capture option available. Shouldn’t the same options be available in both of these situations.

So, it is quite a laborious process to have an internally referenced archive of the site.

CAPTURE/IMPORT/ARCHIVE: A SOLUTION?
It seems like it would be feasible and a good option to have an option in the Import site dialog box where you could set it to “Capture pages on import.” This would download a captured, internally referenced version of the page, and eliminate a whole lotta work. I really don’t think this is an obscure request, a lot of people could benefit.

Also, to make ctrl-clicking on an item in the list view and in the page view more consistent. Also, the terminology in DevonThink and DevonAgent is different. Archive means something different in DT than in DA?

SECOND PROBLEM: CATEGORIZATION

In this particular instance, since I am archiving a site, I don’t necessarily want to change the structure of it. However, I do want to categorize it!

As I understand it, there is only one classification that DevonThink can automatically place an item in.

Additionally, you can replicate or duplicate an item into a new group creating multiple classifications for an item.

This tree/folder view, though, is pretty cumbersome for multiple-classified documents. DT seriously needs metadata! I would really like to just have a tag-cloud for my page. Comments are far to cumbersome and limited to work with.

LAST THOUGHTS/SUGGESTIONS:

Why can’t you create a new browser in DevonThink? when I am collecting and archiving newsclippings, I have to browse in DevonAgent, archive, then import to DT. seems unnecessarily laborious.

CONCLUDING THOUGHTS:

I really like the basic idea of DT. I’ve tried just about everything! Tinderbox has some great ideas, but can’t handle big files well. other ones have similar probs. Everybody seems to be missing a key piece of the puzzle and there is no transparent way to make them work together.

I wish DT/DA had an inspiration/OmniGraffle-like brainstorming-outlining component.