Smart-rule action: split at delimiter

Hello guys,

here is a suggestion/request for another smart-rule action: split at delimiter and/or split using regex pattern. Not sure how feasible this would be, but it seems that most mechanisms are already in place.

I have been dealing with long markdown files that need to be split and doing so using shell scripts is a pain in the arse. Same goes for applescripts.

Interesting concept; how would you set that up? At first incidence of delimiter, or last, or every? Or would that be an optional choice within the rule?

This script maybe could do it. It needs a little modification, but it should work to split plain text.

You might want to try Perl or Python perhaps. At least Perl is quite apt at text processing.

Example text and an example of why itā€™s not easily done via AppleScript could be helpful.

Tested the script with a 137.608 words markdown record. Took 0.55 seconds to create 60 new records.

First and foremost, I give this suggestion because I thing it would be a really cool addition to an already very powerful automation tool available in DT3 (one that many apps already have).

@chrillek, this is in my to do list, for sure (but impossible right now).

@Blanc, I think other apps aready provide a nice example of how this could work. See, for instance, Scrivener and/or Tinderbox.

@pete31, thanks, I will take a look at it. Right now it seems that it would be a bit slow though. To update my Wikis glossary (which is not even close to the whole thing) I need something capable of splitting some 1M chars into 500-800 files/records. I will test and post back with the results.

@BLUEFROG, as to my case in particular: I have transfered most of my writting to Scrivener. In order to update my Markdown DT3 Wiki, I now export the files as one long markdown text and split it at h1 markdown headers. With smart-rules it is already possible to properly set the tags, aliases and the name of the records using the option to scan the text with regex. The name for instance is given by the pattern ^# (.+?)\n

As for splitting, this is the only action I need to perform outside of DT3. I have been using a shell script as a folder action:

for d in "$@"; do
    cd ~/'Databases/md/MD_Splitter' && csplit -k -n 4 "$d" '/^# /' {1000}
done

Like I said, it is a bit of a pain in the arse, as I am not very proficient at shell scripting. It invariably spits out an error of no match found (but the task gets accomplished nevetheless). I find it slighly annoying that the files are generated without extension, since it throws an error, this in turn means that I needed to add another separate action to change the extension (since DT3 wonā€™t recognize the files as text files if they donā€™t have an extension).

Depending on your setup, you could also run the shell script to split the file from a smart rule. Iā€™d suggest a script like this one

#!/bin/sh
cd ~/Databases/md/MD_Splitter && (
  for d in "$@" ; do
     csplit -k -n 4 "$d" '/^# /' {1000}
  done
  for d in *; do
    mv "$d" "$d".md
 done
}

This should take care of the renaming/extension issue (but beware: I didnā€™t test it at all. Use at your own risk). Also, it changes into your working directory only once. You can of course choose another extension than ā€œmdā€ in the mv command.
Perl would actually be a bit of an overkill in this situation, since your condition to split is so simple.

Re your regular expression ^# (.+?)\n: Iā€™d go for a $ instead of \n because it matches end of line regardless of the current character(s) used for it. This might be relevant if your file comes from an environment where \n is not used for end of line (Windows comes to mind). The ? shouldnā€™t be necessary here because you want to gobble up all of your characters until the end of line anyway.

Thanks for the suggestion. Apparently, there is something buggy about the csplit that comes with MacOS. I installed coreutils via homebrew and it works without any problems.

Eventually, there will be two spaces at the end of the line, so the pattern is I am using is actually ^# (.+?)\h*\n, but I guess using $ instead of \n makes perfect sense.

I will see if I can fit it into a smart-rule and how it would work. I canā€™t remember right now if I can call a shell script directly from a smart-rule or if I have to use either a folder action or an applescript.

Thereā€™s no direct support to call a shell script from smart rules, you could go through AppleScript and use do shell script

For our shell script experts:

is there way to rename files using the first line of the text contents of the file (minus the #\h part)?

something like
name=`head -1 $file | sed -e ā€˜s/\h#//ā€™`
Get the first line of the file with head and feed it to see to remove what you donā€™t want. As always, I didnā€™t test it

And assuming I want to apply this to all files within a given directory, would it be something like this:

cd blah && for d in *; do name=`head -1 $file | sed -e 's/\h#//'` done

Either d/$d or file/$file.

1 Like

Perfect. Thanks, @chrillek!

if it is #\h, then why is the regex expression \h#? Honest question, regex and I have not yet become friends :see_no_evil:

Typing error. :wink:

Right :see_no_evil:

if either of you have time, could you explain? name=head -1 $d I understand; the -e option for sed Iā€™m not so sure - simply tells sed to execute the following script (or expression), correct? The following s is not part of the regex, but a command, I take it? Subtract, maybe? Then / marks the following \ as a literal character? The // at the end is there because?

In the shell script, what is the meaning of ā€œ&&ā€?

(Iā€™m happy for you to say go away, this is not a script/regex learning place; Iā€™m sitting in front of a number of websites, trying to teach myself what it is you have conjured up, but Iā€™m not finding it exactly self evident. As I say, only if you have time on your handsā€¦)

Why donā€™t you merely use a Change Name action with the Proposed Name placeholder?

For exampleā€¦

You can see it stripped the header characters and renamed it properly.

You can easily test this in a Batch Process.

How boring :wink: