DT as a database of literary texts

Timotheus · June 28, 2005, 8:14pm

I intend to use DT, among other things, for the building of a large database of literary texts, in order to analyze both their content and their linguistic form. Is there among the users of DT anyone with some experience pertaining to a similar project, who perhaps could give some good advice?

howarth · June 28, 2005, 8:58pm

I’ve worked as a literary editor, so I’ll give what advice I can. These might be your initial steps:

collect literary editing resources, starting with this site at Emory University: web.library.emory.edu/subjects/h … index.html.
establish a copy-text standard–which online editions are most accurate and best reflect authorial intention?
put texts into a standard format for DTP to read. RTF is a minimum, XML if you want Web-publishable output.
input the texts and analyze with the tools DTP provides, using and adding scripts as needed.

It would be wise to start with a small body of texts and later build them into a larger database.

ChemBob · June 28, 2005, 9:57pm

This is an area with which I am relatively unfamiliar but I find very intriguing. What exactly are the tools that DTP provides that you use for these efforts, how do you use them and what is the final outcome upon having used them? What is the output and what does it tell you? What do you do with it?

ChemBob

howarth · June 29, 2005, 1:55am

I’l like to hear from Timotheus to learn about the scope of his project, what he wants to achieve, and what literary texts he wants to put into a database.

The historical period matters: many online resources are available for 1500-1930, but the last 75 years are not public domain, and before 1500 the online materials are scarcer.

I’ve not actually used DT in this manner, but I would say that a first stage would be experimenting with Search, Classify, See Also, and More, to see what kind of results apppear.

Next, use the Concordance command to create an index of all words, and on any one example, Search to find the instances of that word or use Similar to get a graph of words most commonly appearing with that word.

Several of the DT Scripts might also yield results, especially those in the folders named Comments, Data, and Smart Group. I’m not a scripter, but others in the forum might contribute suggestions, once we learn what T is after.

Finally, links could help make connections between various parts of the database. That’s all I can think of for now.

Timotheus · June 29, 2005, 11:06am

Well, my field of interest is not English literature (or literature in English), but classical Italian literature, both in Italian and in Latin. But for determining the strategy of the building of the database this shouldn’t make any difference.

Finding the texts themselves is not so difficult: there is much, very much material on the internet, if you know where to search, and there are CD-Roms with a rich choice of important literary works too. It must be said, however, that these CD-Roms are often devoid of search tools worth the name: usually, you can only read / search one text at a time, with very basic strategies. As research instruments, therefore, they are rather useless. And the CD-Roms with decent search tools I know of are all “Windows only”. And often very expensive.

No, my problems / questions are of a different kind. For instance:

what is the most suitable format (provided there is a choice) for import: TXT? RTF? PDF? Something else? And why?
is it better, in order to speed up the searching process, to break up a long text into many small documents, each containing a chapter, a paragraph, a canto, etc., or is it better not to to so?
which exactly are the boolean operators DT puts at our disposal? And is there a way to enrich / refine them, should this result desirable?
PDF’s of old and rare texts are ususally just photographical reproductions of old printed editions. In these editions, the form of certain letters is different from the modern form. Is there a way to teach DT to recognize these old lettertypes, or is this only possible with OCR-applications?
what is the best way to archive search results?

Provided the database is well organized and the boolean operators are powerful enough, the object of the research can be almost anything: the semantical development of a word, syntactic structures, graphical conventions, flowers in Renaissance poetry, Ideas about women in the Gothic novel, whatever you like.

howarth · June 29, 2005, 2:27pm

TXT, RTF, and PDF are all searchable and the Concordance tool indexes them. They range in size from skinny (TXT) to fat (PDF), so an all-PDF collection would be hefty, and readability might be an issue (see answer 4). If you have offline documents, stored as URLs, use titles and comments so that Search will find them.
Search speed varies with the size of a database, but not the number of its units. I suggest making folders for major text divisions (volumes or chapters), and then have sub-sections (cantos or parts) as items within folders, each item containing several paragraphs or lines of text.
The DTP Tutorial explains how to set and limit searches, for All to Any words, and also Phrase or Wildcard (using * and ?). You may also search the entire dabase or just a selected group. It may be possible to enrich/refine searches through scripting–others may have ideas about that.
This would take some testing. An autograph manuscript would not index, though you could provide a transcript of it in a separate TXT or RTF document. A difficult early font (like “blackletter”) might be hard for DPT to read. You’d have to OCR it and translate it into a comparable Apple font.
Really good question. You may save found sets in DevonAgent, but not DTP, unless you export them as files and folders. You may duplicate, replicate, or group and ungroup found sets, and in the links, try Smart Group–others may have suggestions along these lines.

Timotheus · June 29, 2005, 7:35pm

Thanks, howarth! Well, as far as I can see the standard boolean operators of DTP are not particularly well suited for the kind of searching I have in mind. Wildcards like XXX, XXX (words that begin / end with XXX) and ? (which stands for any character) are applied only to (names of) documents, not to words within the documents. In other words: unless I’m missing something, for the kind of research I have in mind these wildcards are rather useless.
Moreover, it is not possible to define the maximum (or minimum) distance between two search terms, which undoubtedly is another important desideratum.

Bill_DeVille · June 29, 2005, 9:18pm

Timotheus:

Christian has noted that in a future version DT and/or DT Pro will have more comprehensive Boolean search operators, like those already present in DEVONagent.

Here’s a semi tongue-in-cheek suggestion, that may give additional logical capabilities for some of your purposes:

[1] Convert your DT Pro database to a Web site.

[2] Search/download that Web site using DEVONagent’s search capabilities. (Take a look at DA’s Help file for Formulating Queries. NOT, BEFORE, AFTER, NEAR and other operators.)