jostwald here. People use DT many different ways. I would consider myself an extreme user who tries to wring every possible feature out of it, to replicate the relational database I’d created for note-taking 20 years ago. Whenever I come across a new source I throw it into its own hierarchical tag (primary source-newspapers-Daily Courant) and know exactly where it is if I need to look up by source (vs. by topic), plus it’s useful for browsing the tag hierarchy. If the source is full-text it will be searchable right away. Otherwise, when I have time, I can go through and take notes/transcribe the image PDFs in separate RTF files, add groups, etc.
My main database is 111 GB, 86 million words (some of them are French, though I have other databases with other languages too) and about 100,000 files, two-thirds RTFs and one-third PDFs (many of them image-only, including thousands of archival scans). This includes ~10,000 newspaper issues (English, c. 1700-1714) as single issue PDFs. Unfortunately all the newspaper files are image only (too poor quality to OCR), so I wish I had DanDT’s problem.
How much effort you put into grouping/tagging/filing depends on:
- how big of a project you have/how long your database will be used
- how granular your analysis needs to be (newspaper are very granular since I’d guess the usual level of analysis is at the article level and there are dozens of articles on numerous topics in a single issue and there are hundreds of issues per year),
- how perfectionistic you are,
- how much time you have to spend on it.
As an academic historian I find myself on the extreme end of all of those spectra: my DTPO database will support several multi-year projects (a few books, several journal articles), and I need extreme granularity, e.g. I have tens of thousands of letters and each one-page letter might talk about 4 totally distinct subjects that need to be separated (about domestic politics, about logistics, about family affairs…).
So I process my documents more than others might, because 1) historians need to wallow in their original sources, and 2) I want to find and analyze together documents with multiple criteria (e.g. find all the documents written by Marlborough before 1705, with “Dutch” within 10 words of “suck”, and sort chronologically).
If DanDT only needs the newspapers to look up words (places, people, events) with the search feature or sort by date (in Spotlight Comments, a group/tag setup, or in the metadata fields), you probably don’t need to develop a sophisticated metadata/grouping/tagging system.
In fact, my pristine newspaper files are almost exactly like DanDT’s current setup - chrono tags with boring names like “1705.04.06 Flying Post”.
But I take notes in separate RTF documents, and those go in a topical group (actually I have them as replicants in both).
So if you want to do what I want to do, which includes using the AI to find similar newspaper articles on the same topic, and sort a group/search list by one or more pieces of metadata (date, section of paper, author, byline…), then you should develop a more robust system and try to parse out the different bits of info among the various possibilities: file name, tags, groups, Spotlight Comments, other metadata fields, and the content of file.
Most important, I’d recommend DanDT split up files into individual newspaper articles if he’s not already - do that as you are ‘clipping’ from the originals. That way an article on a local business fire won’t have its text combined with a story on the annual Easter Egg Hunt or local valedictorians - that will make the AI less effective. As was mentioned already, you can combine parts a-c together (merge the PDFs or combine the texts together into a single RTF file). At the least, split up the pages so similar types of stories are in their own files - separate sections of the paper, distinguish ads from articles, etc. If you have all the articles separated out in individual files, you could then put them directly into topical groups, or use Classify or AutoClassify to do that for you.
Then comes the question of how to add additional metadata. Since article names serve pretty well as a summary of their content, I’d probably start by naming each file the name of the article and its author if possible. Admittedly a lot of work - an alternative could be to use AutoClassify and see how the AI groups all your articles.
Mine is a bit more complicated: since I only have image PDFs, I have a script that creates a linked RTF document for notes/transcription based off the original image PDF, links back to the original, copies the citation info from the title of the original to the Spotlight Comment of the new RTF, and I then summarize the content in that file’s title. For now (till DT adds more metadata features - please!
, I put the citation info in Spotlight Comments (since that’s the name of the file originally), e.g. “1705.04.03 Flying Courant” and then I add p8 at the end for that particular note. That way, as DanDT notes, in a group or smart group or search list I can sort by that column to see them in chronological order. Ideally there’d also be separate batch-editable columns to record other metadata like author, source, date of article (not the same as date of newspaper issue), etc., so you can sort by chrono and also by author within that.
If you have the provenance/citation info in the Spotlight Comment, you wouldn’t need to use the tags for that purpose, but I like a separate place in tags for the pristine original, as well as whatever transcripts/notes I take.
Including the date and page info in the file title is admittedly safer (I have the drive space so I import everything instead of just indexing), but depending on the size of your screen, you may need to resize columns in order to see the distinguishing info at the end (e.g. your John Doe obit part at the very end of the title). But if you split the citation info into a separate field (e.g the Spotlight Comment) you can see both parts relatively easily in the columns. Plus, you could even begin all your file titles with standardized info that would allow you to search your group/search lists by either column, rather than just one. (This is a reason why DT could really use more metadata columns to parse out all the info - for sorting by multiple criteria).
[FWIW, in my system, putting the provenance info in the Spotlight Comment could conceivably replace using the tags as provenance, but I want a place to browse the originals. Plus I don’t want the original source files cluttering up the topical groups, and the full-text originals would mess up the group AI since they cover many different topics.]
So you can create tags/groups for all sorts of things, but think about how you’ll be processing the results. You might, for example, want some of that metadata in a column that you can sort by. Do you want Obit as the first word in the file name, or maybe Obit should be a tag or group? If it’s a group, you can use the AI to find other obits. But if you only note the obits by file name, you can’t easily find others without going through every one (this is a little bit more efficient using See Also, but I still prefer to send them to permanent groups). That’s the power of AI.
Note that in korm’s example where he has chronological groups, you can use “See Also” to find documents similar to any document within any of those groups, but that doesn’t strike me as the most important use for groups. I’d think your results from Classify won’t make much sense, unless there’s some fundamental distinction between reporting for 1881 vs. 1885 (and you could get that by sorting your search results chronologically and looking for patterns). I’d assume you want to analyze things not by date, but by type of article, or something more substantive. His example of topical groups is what you should be using (you said you’re interested in themes after all), since groups are the only place where you can manually define relationships between documents, and those are permanently defined. But that leaves the question of where to put the provenance/citation info. I think Spotlight Comments and tags provide the most flexibility, but not every database needs that flexibility.
[Sidenote: I’ve always wondered how putting the same document within multiple, contradictory (or overlapping) grouping schemes influences the AI results. Does the AI get confused when document A is in the chronological 1881 group with documents B and C, and document A is also in the topical Obituaries group with documents D and E, when otherwise B/C and D/E really have nothing in common? Does the AI pick up on different word patterns, and are these meaningful associations? For example, I have narratives of an army’s movements in Italy, and another of an army’s movements in France. The place and person names are the most distinctive, but will a group on army movements (vs. other aspects of war) get confused with all those proper nouns? In any case, it seems like See Also wouldn’t be able to figure out that more general distinction of army movements, compared with the more obvious distinction between Italian and French place names? I tend to use tags for people particularly, since they could be in different countries from one year to the next.]
But back to the main point: Figure out what’s most important to you - analyzing the papers by date? by section of the newspaper? by type of article? by text within? What exactly do you want to do with the results?