Archive and Source of Truth for Living Documents: DT to WIKI

Spivak · January 14, 2022, 5:50pm

Hi All,

I’m a member of a Labor Union which is governed by its constitution through its members. Our history goes back decades, and all of our agreements, rules, and procedures are the result of democratic processes. This means that, as a member, if you encounter a rule or contract clause, there was at least one meeting at which it was discussed and approved. As is human nature, we sometimes have disagreements amongst ourselves and our employers over the source, wording, or meaning of a particular rule or procedure and sometimes spend hours in the back office flipping through drawers of meeting minutes, agreements, and arbitrations to find the documents to support a resolution. With the exception of the rigorous democratic process, this scenario is probably not too different from what one may experience in any organization with a little history.

When this project started, we were convinced we needed to scan and OCR all of our meeting minutes and agreements to speed up our searches and feel a bit more confident that the search was exhaustive. That’s when I thought to dump our documents into DT, using it at first as a more powerful search engine to sift through them. This helped immensely, but we still had some problems. Number one was OCR errors, and the possibility that something could be easily missed in a search. The second was our own evolving terminology. Over the decades, the lexicon of our organization has changed, and it became very easy to overlook a topic if you didn’t know what to search for. So I started thinking about classification.

Although our meeting minutes were the authoritative source of information, the subject matter of a single meeting varied so much that it was difficult to assemble a comprehensive thread on the history of an issue by just searching the PDFs. Since the individual motions approved at the meetings were where the magic happened, it was becoming clear that I should take the time to isolate and classify them. When researching different procedures I could use to do this, I discovered this wonderful script in the forums that could do about exactly what I wanted. It has taken about a hundred or so hours of my free time over the last couple of years to go through 60 years of minutes, but I have a beautiful collection of “note cards” that contain exactly one motion with a link to the original source that is pretty extensively categorized with tags. I was able to fix the OCR errors as I went along and tags helped eliminate the issue of lexical inconsistency. It’s a great resource for me, but now it’s time to share this work with the rest of our organization.

Unfortunately, not everyone in our Union uses a Mac or has the time or tech savvy to use DT the way I do. So I am attempting to move my work into an online WIKI format where it can live on and be utilized for searches and further categorization by others. I’m in the early stages now, prototyping a proof-of-concept before I create the real thing.

I thought I’d share my experience here as an example of how DT has helped me do my work and what hurdles I may need to clear along the way. If it helps someone else conceptualize a similar solution to their own knowledge problems or inspires someone to take it in a surprising direction, I’ll feel like it was worth sharing. Please feel free to question my choices or offer suggestions along the way if you are interested. I have a tendency to mull things over too much and can easily get paralyzed in indecision. If I’ve learned one thing while learning to use DT, it is to just jump in and not worry too much about wasted effort once you get going. There’s always a way to pivot on to another track if you need to. Nothing is perfect, and no matter how good it is, there’s always something that could be improved. On to the project:

Platform Selection
When choosing a platform to make my work available to others, I considered many options: DT Server, a shared storage volume on the office network, a website on the office network, and a website on the internet. Each had drawbacks. I thought DT Server would be too fussy for my non-tech colleagues to use, and the potential for catastrophic screw-ups made me nervous, even with back-ups. The shared storage volume on the office network seemed a little safer, but ultimately lacking in features for collaboration and organization. A website on the office network sounded like it could be customized a little more, but one on the internet would allow people to research and work from home or out in the field where they might need something to help them make an argument in a dispute on the job. So then, what kind of website?

My first instinct was to use a blog. Most good blogs support tags, search, and comments, but after spinning up an instance of Ghost on a Digital Ocean droplet, I soon realized it would require too many custom views and too much time to customize once I migrated the information into the system. Finally I settled on a WIKI. I wouldn’t have to customize it so much, and I would get the benefit of page history to go along with tags, search, and comments. Most people know what a WIKI is, from the basic consumer of information to the contributors. I chose WIKI.JS since Digital Ocean has a one-click droplet to get you started and the style of the site is modern and straightforward. The jury is still out as to whether or not this is the best choice, but in the interest of moving forward, I’m going for it. I may end up using MediaWIKI for its more familiar layout and mature code base, but I have my own preferences to indulge for the time being.

WIKI.JS
OK, I have a properly configured server with SSL and user permissions to keep away prying eyes. How do I get my stuff in it? I have a few goals. Upload our individual motions as markdown pages with their tags, upload the original PDFs as a reference source, and upload the OCR’d text from the PDFs as markdown pages to be searched (and ultimately formatted and corrected when folks can get around to it). The PDFs and text sets of complete minutes will have to be uploaded first, as they are more static and to be referenced by the motion pages. Thanks to cgrunenberg’s script, I can see that I should be able to automate the tagging and upload of my individual motion files to the blog without too much trouble. I’m still working out how to make the connection to the server to make requests via the GraphQL interface, but I’m hoping I can just use the script editor on my Mac. I’ve been remiss in learning this tool, and now may be the time.

I like when a plan comes together, and I can already foresee adding to our repository of organizational knowledge after I get the meeting minutes up there. We are buried in documents and reference them constantly to guide our daily work life. It’s a real shot in the arm to know how you fit in an organization and how to change things when necessary, and I hope this resource will help others alleviate the feeling that it’s all this faceless bureaucracy that bears down on us imposing it’s inhuman will.

That’s where I’m at for now. I’ll add to the post or the thread as interesting things happen. I should have some useful scripts to share by the time I’m done. Thanks for reading this far.

Spivak · January 19, 2022, 1:20am

UPDATE

So I’ve been using a little trial and error (mostly error!) to answer some of my questions above. Here are some observations so far:

Uploading the PDFs

WIKI.js calls files to be downloaded or embedded in pages “assets”, and as far as I can tell, these need to be uploaded via the admin interface of the wiki. Due to my lack of experience configuring docker containers and a lack of interest in tracking down the appropriate Apache configs to alter the upload limits, I ended up spending about 45 minutes uploading about a thousand PDF files, ten at a time, to be the official reference used to back up any editable text one may encounter while browsing the wiki. They are all up there and can be referenced and linked to from within any page.

Uploading pages

Pages in WIKI.js are individual editable files (markdown in my case). In addition to the thousand pages of editable OCR’d PDF text, I have many thousands of individual motions to upload. Suffice it to say I have zero interest in doing this by hand. I’ve spent the last few days exploring the GraphQL API that WIKI.js exposes and some test versions of my files that I will upload via API. I made a Python 3 script that will take a folder full of .md files and run through it recursively to parse each file into a GraphQL Mutation that can be run via the API. I’ve run a test on a couple of nested sets of folders with .md documents in them and had the files appear as pages in the wiki. This is not the speediest operation, and I could see it taking an hour or two to crunch through all of my stuff when I do it for real. No worries though. I can relax and watch a movie while the script does the hard work.

What’s next?

Now I get to run through my DT database and prep the files for processing. I need to convert all of my PDF’s to markdown, exploring any way I might be able to do some basic formatting for headings and stuff without screwing things up too much. With decades of files, the formatting has changed along the way, making this perhaps a dicey proposition. Add OCR errors, and it could look like mulch.

After that, I will need to use the cgrunenberg script to embed my tags right into the files, along with some other attributes that will be translated to the wiki like links back to original source PDF’s and the editable versions of them on the wiki.

As a bonus exercise, I am going to think on how to possibly embed links to the individual motions in the editable sets of minutes that they came from. Previously, I merely highlighted the relevant text in the PDF to indicate it had been atomized. A link in the md version would be really cool, but this may have to be inserted manually later.

That’s it for now. Four days later, and all I have are a couple of hundred lines of Python 3 script to show for myself. I do have a job though, so not too bad I guess. I’ll share my work later after it went off without a hitch and I’ve had a chance to add copious comments.