How to organize database for archives of local newspaper?

DanDT · March 16, 2015, 7:04pm

I started a project to enter all available back issues of my town’s local newspaper into a DT database and would appreciate your advice as the best way to do this.

Our newspaper is now available online and the service I am using allows us to clip articles (or even the whole page) and download them as a .pdf file. Then I perform OCR on the .pdf file and import it to DevonThink. (I can not import a whole page at a time since the resulting image is too small for accurate OCR results.) So what I have been doing is clipping each column, one third at a time. Meaning each column has three parts: “a” the top, “b” the middle and “c” the end of the column. This results in pretty accurate OCR.

I have been organizing the clippings this way:

I am making a group for each year from 1881 - 1970 and naming the clippings using the scheme in the shown above in the screenshot. However after reading this post in Skulking In Holes and Corners I think I am grouping things incorrectly:

https://jostwald.wordpress.com/2013/07/21/organizing-with-devonthink/

Here is a quote from that blog post:

Most historical users of DT (at least those that blog) use Groups to record provenance, mimicking the file folder hierarchy where we used to store our documents. This has been indirectly encouraged by the DT developers, who repeat the “whatever works for you” line (you know how much I hate that), and who insist that Tags and Groups are essentially the same thing. DT may indeed treat them similarly behind the scenes, and they can both be organized into multi-level hierarchies.

But there’s one huge difference between the two: the Classify AI only works on Groups, not Tags. That means that we should organize our information by whichever Groups we want the AI’s help with. So we should make our Groups thematic if we want the AI to help us organize our documents thematically. But wait – don’t we naturally think of tags as being topical? Yes, and therein lies the problem. To get the most from DT, you have to upend the conventional thinking about groups and tags: Groups are for topics, Tags are for provenance. Not the other way around.

If I understand what he is saying, then I should be

using groups in a topical manner. By that I mean each group should look something like this: Local news, National News, International News, Gossip, Obituaries, Prominent Citizens, Local Industry, Events. Those groups could have subgroups that then get more specific such as Coal Mining, Local Businesses etc.
use tags for indicating the Year, date of publication.
I should be appending keywords to .pdf file name. An example file name might be 1881_05_23_pg4_column_1_Obituary_John_Doe

So, should I change my method of grouping, tagging and file naming? Any help would be greatly appreciated. I just want to make sure that the organizational methods I use fully utilize DevonThink’s AI capabilities.

korm · March 16, 2015, 9:07pm

I believe you can accomplish both things without using standard Tags. If your groups are enabled for tagging (see File > Database Properties for your working database) you can put your PDFs into a date hierachy and tag them with the names of ordinary groups. Doing so automatically replicates the PDF to the “Topic” hierarchy. No standard Tags are created. See the image.

The AI will then operate against everything in the Year hierarchy and everything in the Topics hierarchy. My example is simplistic, of course yours is more sophisticated.

(BTW, you can merge *a, *b, *c with DEVONthink’s Merge command in the contextual menu – though you might not want to do that.)

Bill_DeVille · March 17, 2015, 12:46am

I’ll stick with “use what works for you”.

Drag a tag outside the Tags group and it becomes a group. Uncheck the option to remove groups from tagging, and groups become tags (but not included in the Tags group). .

Groups don’t necessarily need to be topical. Tags can be used for many purposes other than provenance.

The vast majority of the groups in my main research database are topical. That is, their contents each contain documents that contain a vocabulary and associations of words that represent a scientific or technical discipline, or tend to be about similar subject matter using similar vocabulary and associations of words. Classify is very useful.

The groups in my financial database are not topically based. Rather, the groups are based on characteristics of documents, such as contracts, invoices, receipts, statements, tax forms &c. – and contain subgroups by calendar year. The contents of each group tend to vary widely in terminology and associations of terms, so that Classify will not be useful. On the other hand, I don’t need Classify, as I know where the documents are to fit in that structure. Often, I’ll use a convention in naming documents that allows sorts, such as VendorName and Date (expressed as YYYYMMDD). I might use tags to distinguish costs by project.

DanDT · March 17, 2015, 2:35am

Thanks for your replies. I am so new to DevonThink that I don’t understand terms “Replicant”, “Classify”, “See Also” or quite know the difference between categories and tags yet. I am going through the manual though and as I learn I will come back to post a few more questions. (I don’t want you to think I am ignoring your replies). I need to get up to speed with DevonThink I guess to understand what questions I should ask!

In my previous post I forgot to include this screenshot: (click to enlarge)

The top part of the image shows search results that Skulking gets when he searches for “bribery” . Below that are my results when I search for “street”. When I compare the results, his are much more descriptive. My search results just show dates and column position which is much less helpful.

So as a first step, I think I need to at make the file name more descriptive (although this takes much more work) I guess my file names should include the date, page and column position to make it easy to browse a newspaper page in order (a page usually includes around 15 individual files) but I will also append a few keywords at the end of the file name.

Under my current system none of the search results are descriptive of what the newspaper article actually contains. From what I understand, DevonThink tags aren’t displayed in the search result box whereas categories are displayed. So I guess I must write a much more descriptive file name for the search results to be useful at a glance. (And I am guessing that I almost have to treat categories as tags for them to appear in the search results?)

This is my thinking at the moment. I am sure things will change as I become more familiar with DevonThink. I just want to be careful not to get too far into the project before I realize there is a better way and I have to go back and change everything.

ostwaldj · March 17, 2015, 4:13pm

jostwald here. People use DT many different ways. I would consider myself an extreme user who tries to wring every possible feature out of it, to replicate the relational database I’d created for note-taking 20 years ago. Whenever I come across a new source I throw it into its own hierarchical tag (primary source-newspapers-Daily Courant) and know exactly where it is if I need to look up by source (vs. by topic), plus it’s useful for browsing the tag hierarchy. If the source is full-text it will be searchable right away. Otherwise, when I have time, I can go through and take notes/transcribe the image PDFs in separate RTF files, add groups, etc.

My main database is 111 GB, 86 million words (some of them are French, though I have other databases with other languages too) and about 100,000 files, two-thirds RTFs and one-third PDFs (many of them image-only, including thousands of archival scans). This includes ~10,000 newspaper issues (English, c. 1700-1714) as single issue PDFs. Unfortunately all the newspaper files are image only (too poor quality to OCR), so I wish I had DanDT’s problem.

How much effort you put into grouping/tagging/filing depends on:

how big of a project you have/how long your database will be used
how granular your analysis needs to be (newspaper are very granular since I’d guess the usual level of analysis is at the article level and there are dozens of articles on numerous topics in a single issue and there are hundreds of issues per year),
how perfectionistic you are,
how much time you have to spend on it.
As an academic historian I find myself on the extreme end of all of those spectra: my DTPO database will support several multi-year projects (a few books, several journal articles), and I need extreme granularity, e.g. I have tens of thousands of letters and each one-page letter might talk about 4 totally distinct subjects that need to be separated (about domestic politics, about logistics, about family affairs…).
So I process my documents more than others might, because 1) historians need to wallow in their original sources, and 2) I want to find and analyze together documents with multiple criteria (e.g. find all the documents written by Marlborough before 1705, with “Dutch” within 10 words of “suck”, and sort chronologically).

If DanDT only needs the newspapers to look up words (places, people, events) with the search feature or sort by date (in Spotlight Comments, a group/tag setup, or in the metadata fields), you probably don’t need to develop a sophisticated metadata/grouping/tagging system.
In fact, my pristine newspaper files are almost exactly like DanDT’s current setup - chrono tags with boring names like “1705.04.06 Flying Post”.
But I take notes in separate RTF documents, and those go in a topical group (actually I have them as replicants in both).

So if you want to do what I want to do, which includes using the AI to find similar newspaper articles on the same topic, and sort a group/search list by one or more pieces of metadata (date, section of paper, author, byline…), then you should develop a more robust system and try to parse out the different bits of info among the various possibilities: file name, tags, groups, Spotlight Comments, other metadata fields, and the content of file.

Most important, I’d recommend DanDT split up files into individual newspaper articles if he’s not already - do that as you are ‘clipping’ from the originals. That way an article on a local business fire won’t have its text combined with a story on the annual Easter Egg Hunt or local valedictorians - that will make the AI less effective. As was mentioned already, you can combine parts a-c together (merge the PDFs or combine the texts together into a single RTF file). At the least, split up the pages so similar types of stories are in their own files - separate sections of the paper, distinguish ads from articles, etc. If you have all the articles separated out in individual files, you could then put them directly into topical groups, or use Classify or AutoClassify to do that for you.

Then comes the question of how to add additional metadata. Since article names serve pretty well as a summary of their content, I’d probably start by naming each file the name of the article and its author if possible. Admittedly a lot of work - an alternative could be to use AutoClassify and see how the AI groups all your articles.
Mine is a bit more complicated: since I only have image PDFs, I have a script that creates a linked RTF document for notes/transcription based off the original image PDF, links back to the original, copies the citation info from the title of the original to the Spotlight Comment of the new RTF, and I then summarize the content in that file’s title. For now (till DT adds more metadata features - please! , I put the citation info in Spotlight Comments (since that’s the name of the file originally), e.g. “1705.04.03 Flying Courant” and then I add p8 at the end for that particular note. That way, as DanDT notes, in a group or smart group or search list I can sort by that column to see them in chronological order. Ideally there’d also be separate batch-editable columns to record other metadata like author, source, date of article (not the same as date of newspaper issue), etc., so you can sort by chrono and also by author within that.
If you have the provenance/citation info in the Spotlight Comment, you wouldn’t need to use the tags for that purpose, but I like a separate place in tags for the pristine original, as well as whatever transcripts/notes I take.

Including the date and page info in the file title is admittedly safer (I have the drive space so I import everything instead of just indexing), but depending on the size of your screen, you may need to resize columns in order to see the distinguishing info at the end (e.g. your John Doe obit part at the very end of the title). But if you split the citation info into a separate field (e.g the Spotlight Comment) you can see both parts relatively easily in the columns. Plus, you could even begin all your file titles with standardized info that would allow you to search your group/search lists by either column, rather than just one. (This is a reason why DT could really use more metadata columns to parse out all the info - for sorting by multiple criteria).

[FWIW, in my system, putting the provenance info in the Spotlight Comment could conceivably replace using the tags as provenance, but I want a place to browse the originals. Plus I don’t want the original source files cluttering up the topical groups, and the full-text originals would mess up the group AI since they cover many different topics.]

So you can create tags/groups for all sorts of things, but think about how you’ll be processing the results. You might, for example, want some of that metadata in a column that you can sort by. Do you want Obit as the first word in the file name, or maybe Obit should be a tag or group? If it’s a group, you can use the AI to find other obits. But if you only note the obits by file name, you can’t easily find others without going through every one (this is a little bit more efficient using See Also, but I still prefer to send them to permanent groups). That’s the power of AI.

Note that in korm’s example where he has chronological groups, you can use “See Also” to find documents similar to any document within any of those groups, but that doesn’t strike me as the most important use for groups. I’d think your results from Classify won’t make much sense, unless there’s some fundamental distinction between reporting for 1881 vs. 1885 (and you could get that by sorting your search results chronologically and looking for patterns). I’d assume you want to analyze things not by date, but by type of article, or something more substantive. His example of topical groups is what you should be using (you said you’re interested in themes after all), since groups are the only place where you can manually define relationships between documents, and those are permanently defined. But that leaves the question of where to put the provenance/citation info. I think Spotlight Comments and tags provide the most flexibility, but not every database needs that flexibility.
[Sidenote: I’ve always wondered how putting the same document within multiple, contradictory (or overlapping) grouping schemes influences the AI results. Does the AI get confused when document A is in the chronological 1881 group with documents B and C, and document A is also in the topical Obituaries group with documents D and E, when otherwise B/C and D/E really have nothing in common? Does the AI pick up on different word patterns, and are these meaningful associations? For example, I have narratives of an army’s movements in Italy, and another of an army’s movements in France. The place and person names are the most distinctive, but will a group on army movements (vs. other aspects of war) get confused with all those proper nouns? In any case, it seems like See Also wouldn’t be able to figure out that more general distinction of army movements, compared with the more obvious distinction between Italian and French place names? I tend to use tags for people particularly, since they could be in different countries from one year to the next.]

But back to the main point: Figure out what’s most important to you - analyzing the papers by date? by section of the newspaper? by type of article? by text within? What exactly do you want to do with the results?

ostwaldj · March 17, 2015, 5:16pm

DanDT’s second post prompts me to remind everyone that you don’t need to tailor your structure to be most useful in the Advanced Search window. There are different ways to “search”.
I tend to use the formal Search features for things that I haven’t already classified into groups (when I don’t know what I’ll find or I’m creating a new group and want to find docs to populate it), or when I’m searching by multiple criteria. I’ll add tags to new documents I discover in the search window (Tag Bar visible) as I go along.

On the other hand, I “search” (small s) for topics I know I’ve already set up by simply browsing straight to the tag or group. That’s more permanent than rerunning a search, and then you can use See Also on documents within that group to find more files with similar subjects, or discover additional terms to search for new sources with a formal Search.

You could also take advantage of smart groups (saved queries) so you’re not running the same searches over and over. Another way I “search” is, with the dates in the Spotlight Comments, to make a smart group for each Year.Month under a Chronology tag. All the files, regardless of the source, will show up there, where I can sort by chronological order or whatever and look through them. (Again, metadata would allow me to further sub-sort those easily.)
That way, between tags, groups and smart groups, I have all my documents available by source, by date, and by topic. With more metadata I’d like to also have it by author, by recipient, by side of author…
You can also include smart groups under other tags/groups: say I’ve got a tag for documents on person X. I also create a sub-smart group that displays all documents that have that person’s name anywhere in the content. I can browse those whenever I want, or, I can go through those and weed out the false hits while adding the person tag to those that are relevant. When done, I just delete the smart group and leave the docs with the tag.
So you could, for example, make an Obit smart group if you put the “obit” in the file name and just save a search for all file names with *obit in them. But again, that wouldn’t be as useful for the AI as if you made an Obit group.

Many ways to skin a cat, but some are much better than others for any given cat.

Frederiko · March 17, 2015, 6:04pm

@ostwaldj Those were two great posts with some fascinating insights. Thank you. 111 Gb is a serious database!

I have just one comment about incorporating dates in meta-data. The most obvious place to store the date of the newspaper articles is in the date created field. This is field is never altered by DT after the document is created and is perfectly safe to store information. It also means you can cease to worry about storing the pieces by date because at anytime you can create a smart group to just display the relevant documents for a period in time. You can also turn the date created column on and off as you need to see it and save the valuable name space for something else.

For documents prepared outside of DT I prefix the name with the date in the form YYYYMMDD and then use a script, when they are imported, to strip the date from the filename and set the date created field.

And of course, I agree please can we have fields to store meta-data Christian. It would make importing citations, page references and all the other peculiar things you need to associate with a document, so much easier.

Frederiko

ostwaldj · March 17, 2015, 8:29pm

Thanks Frederiko.
You can indeed use Date Created in that way, but it’s less than ideal IMHO.

It’s a kludge, replacing metadata X that might be useful with other data Y that might be useful in a different way. There should be a separate spot for each piece of useful data. Sometimes I might want to actually see when I created the document, not when the original author wrote the letter. To give just one non-hypothetical example: maybe I created a whole bunch of documents (notes) when I was in the archives on day X several years ago and now I want to find all the other docs from that day as well (they were from different collections, or maybe I even visited two different archives that day). My wife also does this kind of organizing/searching with photos all the time. I don’t want to lose one piece of metadata just to use another. Plus, I’d think this would be particularly problematic if you’re indexing your files and not importing them, since that changes the only copy of the file you may have.
Ideally we’d have other date metadata as well. Mine is probably an extreme case, but I have five potential dates for each letter: date written in Old Style calendar, date written equivalent in New Style calendar (in my old Access database in addition to DateWritten and OS/NS fields, I included a separate field that converted all dates to the standard Gregorian calendar), date letter was received in OS, date received in NS, and date of content discussed in document (I’d break those up into sub-records) in the standard NS calendar. The other date fields aren’t of use here: the Date Modified field gets overwritten all the time, as it should, and you might also want to sort records by Date Added. It’d be great if such date metadata fields maintained their date property for searching and mathematical manipulation, like relational databases do with field data type.
Can you batch-edit that field? Spotlight Comment is so incredibly useful because you can batch-edit it within the 3-pane view (select a bunch with the same Spotlight Comment and type away), as well as use code to modify it. This is particularly important if somebody is importing in a whole bunch of records and would like to then batch-edit where they came from, or other metadata. I don’t think you can do that with the other OS X metadata fields (Author, Subject…). It looks like you can modify the Date fields with code (at least there’s Touch Creation Date), but it doesn’t look like you can do any of that manually (i.e. select a bunch of records, open up Get Info and change them all at once). Anyone importing a bunch of data in from elsewhere needs the ability to batch-edit the metadata fields, and Spotlight Comments seems to be the only field that you can do that with.

So we definitely need more custom metadata fields.