Automating DT with JavaScript: Splitting Markdown

chrillek · August 10, 2021, 12:38pm

Occasionally, people have asked for a way to split Markdown files in new documents. There are basically two methods: the simple one that uses “split here” markers which are thrown away in the process. So you could insert something like “$$$” into your Markdown files wherever you want. This is described in the first example.

The second one is a bit more complicated. It allows you to define a marker as a regular expression and generates at the point where these expressions are found. An obvious example would be to split at a certain level of headlines.

Splitting Markdown, simple case

The next example splits markdown records at a predefined marker. If the prefix variable is set to ‘’, the script will generate new records with the name of the original ones, appending “-1”, “-2” etc. If prefix is set to to something else, the script will generate record names prefix-1, prefix-2 and so on.

If no (markdown) records are selected, the script will bail out with an error message. Also, if one of the selected records does not contain the marker at all, an alert is displayed and has to be acknowledged by the user.

Note that the marker disappears in the process, it is purely meant as a “cut here” indicator. Cf. the next example for a “keep the marker” example.


(() => {
const marker = '$$$' // Marker to split at. Should be on a single line.
/*
Prefix for new records. 
New records will be named 'prefix-1', 'prefix-2' and so on. 
Use '' to use original record's name as prefix
*/
const prefix = 'prefix';

const app = Application("DEVONthink 3");
app.includeStandardAdditions = true;
/*
* get all markdown records from selection
*/
const MDrecords = (app.selectedRecords()).filter(r => r.type() === "markdown");

if (MDrecords.length === 0) { // Abort if no MD records selected.
  app.displayAlert(
    `No Markdown documents selected`, {
    as:  "critical",
    buttons: ['OK'],
  });
  return;
}

// Loop over all Markdown records and split them

MDrecords.forEach(m => splitFile(m, marker, prefix === '' ? m.name() : prefix ));


/* 
Function to split document 'doc' at pattern 'at' into a bunch of new documents named 
'prefix-1', 'prefix-2', 'prefix-3' and so on
*/

function splitFile(doc, at, prefix) {
  const group = doc.parents[0]; /* get the group of the current MD document */
  const chunks = doc.plainText().split(at); /* get the Markdown's text and split it in chunks at the marker */
  if (chunks.length === 1) {
  /* Abort if only one chunk is found, since then there's no marker in it */
    app.displayAlert(
      `No matches found for ${at} in document "${doc.name()}"`, {
      as:  "critical",
      buttons: ['Cancel'],
    }) 
    return;
  }
  let counter = 1;
  chunks.forEach(c =>  newRecord(`${prefix}-${counter++}`, group, c));		
}

/*
Function to create a new markdown record 'recName' in 'inGroup' with plainText 'content'
*/
function newRecord(recName, inGroup, txt) {
  app.createRecordWith({name: recName, type: "markdown", "plain text": txt},
  {in: inGroup});
}

})()

Split Markdown records at headlines

If you want to split Markdown records at headlines, you’ll most probably want to keep those. That’s not possible with the preceding example, since it uses JavaScript’s split method which throws away the strings it splits at. So in order to split somewhere and keep that text, you need a different approach. To illustrate, let’s assume that you have a longish Markdown document that you want to split at the second level headlines. Those are indicated by ## at the beginning of a line.

So assuming you have a Markdown record like this

# Titel

introduction

## First headline

first paragraph

## Second headline

second paragraph

you’d get three new records: The first one containing everything from “#Titel” to just before “## First headline”, the second one everything from “## First headline” to just before “## Second headline” and the last one everything from “## Second headline” to the end.

The previous script only needs minor modifications. Set the marker like so:
const marker = new RegExp(/(^##\s+.*$)/, "gm");
This defines a regular expression as two “#” signs at the start of a line, followed by at least one space character, followed by anything up to the end of the line. The "gm" makes the expression global (“g”) and “m” lets ^/$ match beginning and end of lines, respectively.

In the function splitFile, change the lines
const chunks = doc … if (chunks.length === 1) {
to this:

const text = doc.plainText();
const matches = [...text.matchAll(at)]; /* get all matches into an array */
if (matches.length === 0) {

Here, you save the text of the record in its own variable text, which you’ll need later on. Then you get all matches for the regular expression (at) in the array matches. In order for matchAll to work, the regular expression needs to be defined as “global” as shown before, otherwise you’ll see an error.

Finally, you need to iterate over the matches to create the new records like so:

let start = 0;
matches.forEach(m => {
  newRecord(`${prefix}-${counter++}`, group, text.substr(start, m.index - start));
  start = m.index;
})
// handle last match
newRecord(`${prefix}-${counter++}`, group, text.substr(start, text.length - start));

Every element of matches is itself an array with a special property index. It contains the numerical position where this match starts. The first new record should comprise everything from the beginning of the text just before the first headline, i.e. the first match. So the variable start is set to 0, and the function newRecord is passed the part of the text starting at 0 and consisting of all the characters before to the first match (m.index - start). After that first step, the script adjusts start so that it points at the beginning of the first match… and so on.

You may have noticed that the first match is saved in the second new record. So at the end of the forEach loop, the text starting at the last match (i.e. the last headline) has not been written yet. That’s what the final line above takes care of.

Click here for the full script

(() => {
    /*
	Regular expression to split at. You can also use a simple string like /## /, but that would match anywhere in the text, too. 
	*/
	const marker = new RegExp(/(^##\s+.*$)/, "gm"); 
	/*
	Prefix for new records. 
	New records will be named 'prefix-1', 'prefix-2' and so on. 
	Use '' to use original record's name as prefix
	*/
	const prefix = '';
const app = Application("DEVONthink 3");
app.includeStandardAdditions = true;
/*
* get all markdown records from selection
*/
const MDrecords = (app.selectedRecords()).filter(r => r.type() === "markdown");

if (MDrecords.length === 0) { // Abort if no MD records selected.
	app.displayAlert(
		`No Markdown documents selected`, {
			as:  "critical",
			buttons: ['OK'],
		});
	return;
}

// Loop over all Markdown records and split them

MDrecords.forEach(m => splitFile(m, marker, prefix === '' ? m.name() : prefix ));

/* 
Function to split document 'doc' at pattern 'at' into a bunch of new documents named 
'prefix-1', 'prefix-2', 'prefix-3' and so on
*/

function splitFile(doc, at, prefix) {
	const group = doc.parents[0]; /* get the group of the current MD document */
	const text = doc.plainText();
	const matches = [...text.matchAll(at)]; /* get all matches into an array */
	if (matches.length === 0) {
	/* Abort if only one chunk is found, since then there's no marker in it */
		app.displayAlert(
			`No matches found for ${at} in document "${doc.name()}"`, {
				as:  "critical",
				buttons: ['Cancel'],
			})
		return;
	}
	let counter = 1;
	let start = 0;
	matches.forEach(m => {
	  newRecord(`${prefix}-${counter++}`, group, text.substr(start, m.index - start));
	  start = m.index;
    })
	// handle last match
    newRecord(`${prefix}-${counter++}`, group, text.substr(start, text.length - start));
}

/*
Function to create a new markdown record 'recName' in 'inGroup' with plainText 'content'
*/
function newRecord(recName, inGroup, txt) {
	app.createRecordWith({name: recName, type: "markdown", "plain text": txt},
	 {in: inGroup});
}

})()

BLUEFROG · August 10, 2021, 1:51pm

Very nice automated approach.
Thanks for sharing (and explaining) it

Yuv · November 26, 2021, 7:58am

Hello, @chrillek,
I wish to use your first script, but my knowledge of javascript is zero. So, after I copy your script, what do I do then?
If you or anyone else here can guide me, it would be great.
Thank you,
Yuval

rmschne · November 26, 2021, 8:10am

A good first start in the world of automation with DEVONthink is to read the relevant portions of the “Automation” Appendix in the DEVONthink Handbook. Page 181 of Ver 3.8 of that outstanding document.

Yuv · November 26, 2021, 8:15am

Thank you, @rmschne,
I read that, but I still don’t know how to activate the script.
If it’s an AppleScript, I save it with mac built-in scripts editor and then put it in the scripts folder of DT to use it from there. It doesn’t seem to be the case here.
So, what should I do?

chrillek · November 26, 2021, 8:23am

What makes you think that? Did you copy / paste the code in Script Editor (changing its language to JavaScript) and save it to DT’s script folder from there? What happened? What did not happen? Any error messages?

“It doesn’t seem to be the case here” is unfortunately not a helpful problem description.

Yuv · November 26, 2021, 8:33am

@chrillek, that’s what I was missing. Now that I changed the language, everything is working great!

When I said, “that’s not the case,” I meant I don’t even know what is the right question.
Thank you very much for the script

rmschne · November 26, 2021, 8:39am

Also take a look at the blog post by @chrillek linked at DEVONtechnologies | How to Use JavaScript for Automation (which I found via Google for you).

Yuv · November 26, 2021, 9:08am

@rmschne, thank you for that link. I will check it out.

latl · December 12, 2023, 7:53pm

I’m digging up this old post because it mostly solves something I’d like to do: split very long markdown files of notes at headers and subheads into separate files. I’m cataloging some physical research materials and have been keeping my initial notes in roughly the following way:

Each box gets a group in DT titled with a pre-determined ID on the box (basically noting the box’s physical location in my storage, but it’s arbitrary just so I know what I’ve already reviewed).
I start a new markdown file for my initial survey of the box’s contents and take notes of everything in there (the group may also get scans of documents/images, photos of objects, audio from digitized recordings, etc) with the filename along the lines of BOXTITLE-INVENTORY or whatever.
Each folder or other subcontainer in the box gets a heading in the markdown file (indicated as #), followed by descriptive text forming that heading for the folder based on its label or other descriptive information
Each item in that subcontainer gets a subheading (##). If those items are also containers, for example a smaller envelope of photo prints, the additional container gets another subhead (i.e. ###).
I then describe the item (s) remaining in the container as text

This is working great for me overall, but I’d eventually like a quick way to split the note files and then have an associated markdown file at least for each subcontainer if not each item that I might user as a descriptive file to share with people I’m working with.

I’m only just reading up on scripting, batch processing, etc. I’ve successfully split an example note file into separate markdown files but I’m not quite digesting what I need to do next if I want those results to not just have a list of files titled prefix-1, prefix-2, etc. (in my case, BOXTITLE-INVENTORY-1, etc.).

What do I need to do next? Add to the script? Make a smart rule to run the script then do a regex renaming? something else?

It’s very possible i’m not making sense, but I suspect what I’m looking to do is possible, I just don’t usually do so much automation in my day-to-day so I don’t quite have the language to express what I’m doing (I like learning ways to automate a lot, just don’t always have much time to give to doing so or looking up/trying solutions when I’m trying to get work done).

Thanks!

BLUEFROG · December 12, 2023, 8:06pm

I’m curious why you’re not just creating smaller note right now instead of imagining yourself splitting them later. It appears you have thought through a process of what your section headers, etc. mean, so… ?

latl · December 12, 2023, 8:22pm

Ah! A simple answer: I didn’t realize I’d need to until I’d made a bunch of these large files and realized they’re about annoying at that length.

BLUEFROG · December 12, 2023, 8:32pm

While Zettelkasten doesn’t make sense to everyone (myself included), it seems approaching your notes in a bit more atomic way would be a good start.

And here’s a tip for you… You can actually collate separate Markdown documents in an ad-hoc manner using MultiMarkdown’s file transclusion feature in DEVONthink.

Here is a Markdown document with three Markdown documents transcluded into it…

latl · December 12, 2023, 9:04pm

This is great stuff, but I’m still hoping to figure out ways to break down the stuff I’ve already done (these aren’t the only such examples).

Also, for what it’s worth, my dabbling in transclusion hasn’t really worked for me and how I think. But I see how for some it could be really useful!

BLUEFROG · December 12, 2023, 10:44pm

Using Tools > Split Document will split a document at the insertion point in the source of a Markdown document, yielding a new document with everything past the insertion point.

latl · December 12, 2023, 11:11pm

Right. I have the splitting part done. It’s the renaming part. Specifically renaming multiple pieces (ie dozens per file) once I’ve split the doc. I am trying to avoid repeatedly splitting and splitting again, hence automation. But I want the renaming to come from headers in the document as described above.

I’ll accept if that’s not possible, but that’s what I was trying, again, as described above

cgrunenberg · December 13, 2023, 6:03am

Well, it’s not JavaScript but a simple AppleScript can handle this too:

-- Split Markdown document into sections

tell application id "DNtp"
	set theSeparator to ((ASCII character 10) & "# ") as string
	repeat with theRecord in selected records
		if type of theRecord is markdown then
			set theText to plain text of theRecord
			set {od, AppleScript's text item delimiters} to {AppleScript's text item delimiters, theSeparator}
			set theSections to text items of theText
			set theGroup to location group of theRecord
			create record with {name:paragraph 1 of (item 2 of theSections), content:(item 1 of theSections & theSeparator & item 2 of theSections), type:markdown} in theGroup
			repeat with i from 3 to count of theSections
				set theSection to item i of theSections
				create record with {name:paragraph 1 of theSection, content:(theSeparator & theSection), type:markdown} in theGroup
			end repeat
			set AppleScript's text item delimiters to od
		end if
	end repeat
end tell

chrillek · December 13, 2023, 7:05am

It’s a bit unclear to me what you have working already, so I’m assuming. For example, what do you want to happen with text before the first heading?

I’ll refer to the script in its latest incarnation from this post.

To summarize:

The script works with all currently selected records in DT. As it’s written, it can’t work as a smart rule script. If you want that, minor modifications are needed.
It filters out all records that are not Markdown
With the rest, i.e. only the currently selected Markdown records it calls the function splitFile with the parameter marker.
marker is a regular expression defined at the top of the code that matches all lines with a second level headline (^##\s+). If you want it to split at first level headlines, remove one of the #.
If there’s only one 2nd level header in a record, it will not be split ((if matches.length === 0) …) and the function splitFile continues with the next record
Otherwise, it calls the function newRecord, passing it three parameters for the name, the group, and the content of the new record.

Since you want to name your records after the headings, you’ll have to modify the first parameter in the call to newRecord, which is currently `${prefix}-${counter++}`. The heading ## heading is already in the variable m (matches.forEach(m => …). All you have to do is remove the leading stuff like so:

matches.forEach(m => {
  /* remove leading "##" and space(s) from heading and put in "name" */
  const name = m[0].replace(/##\s+/,""); 
  newRecord(name, group, text.substr(start, m.index - start));
  start = m.index;
})

That code replaces the complete matches.forEach… loop in the original script. I did not it, though. Please use with caution.

I don’t know yet how to handle the last match, though (nor if there’s any special handling needed). But perhaps you can figure that out, if necessary.

latl · December 15, 2023, 6:14pm

Thank you for this. Apple Script is fine (I mean, I’m just learning it, but I’ve used it for other things I need done). Apologies for the late reply but I’ve been busy on an unrelated project.

latl · December 15, 2023, 6:19pm

Thank you and sorry for the delayed reply. Been busy with something else.

I think your assumptions were pretty accurate. I wasn’t looking for a smart rule, so that’s fine. I’m expecting to do this with manually selected markdown files as needed, so while it’s great that this filters out non-markdown records I’ve already done so.

I’m not too concerned with what happens before the first headings as really there just isn’t much text in any of these before then and it’s easy enough for me to double check I haven’t missed anything.

The real golden piece I needed was that last bit about naming the records after the headings.

I haven’t tried this yet, but I will. Thanks for the help and the warning to take caution when I do.