Smart-rule action: split at delimiter

Bernardo_V · September 13, 2020, 3:11pm

Hello guys,

here is a suggestion/request for another smart-rule action: split at delimiter and/or split using regex pattern. Not sure how feasible this would be, but it seems that most mechanisms are already in place.

I have been dealing with long markdown files that need to be split and doing so using shell scripts is a pain in the arse. Same goes for applescripts.

Blanc · September 13, 2020, 3:18pm

Interesting concept; how would you set that up? At first incidence of delimiter, or last, or every? Or would that be an optional choice within the rule?

pete31 · September 13, 2020, 3:52pm

This script maybe could do it. It needs a little modification, but it should work to split plain text.

chrillek · September 13, 2020, 4:44pm

You might want to try Perl or Python perhaps. At least Perl is quite apt at text processing.

BLUEFROG · September 13, 2020, 6:46pm

Example text and an example of why it’s not easily done via AppleScript could be helpful.

pete31 · September 13, 2020, 7:29pm

Tested the script with a 137.608 words markdown record. Took 0.55 seconds to create 60 new records.

Bernardo_V · September 13, 2020, 9:21pm

First and foremost, I give this suggestion because I thing it would be a really cool addition to an already very powerful automation tool available in DT3 (one that many apps already have).

@chrillek, this is in my to do list, for sure (but impossible right now).

@Blanc, I think other apps aready provide a nice example of how this could work. See, for instance, Scrivener and/or Tinderbox.

@pete31, thanks, I will take a look at it. Right now it seems that it would be a bit slow though. To update my Wikis glossary (which is not even close to the whole thing) I need something capable of splitting some 1M chars into 500-800 files/records. I will test and post back with the results.

@BLUEFROG, as to my case in particular: I have transfered most of my writting to Scrivener. In order to update my Markdown DT3 Wiki, I now export the files as one long markdown text and split it at h1 markdown headers. With smart-rules it is already possible to properly set the tags, aliases and the name of the records using the option to scan the text with regex. The name for instance is given by the pattern ^# (.+?)\n

As for splitting, this is the only action I need to perform outside of DT3. I have been using a shell script as a folder action:

for d in "$@"; do
    cd ~/'Databases/md/MD_Splitter' && csplit -k -n 4 "$d" '/^# /' {1000}
done

Like I said, it is a bit of a pain in the arse, as I am not very proficient at shell scripting. It invariably spits out an error of no match found (but the task gets accomplished nevetheless). I find it slighly annoying that the files are generated without extension, since it throws an error, this in turn means that I needed to add another separate action to change the extension (since DT3 won’t recognize the files as text files if they don’t have an extension).

chrillek · September 14, 2020, 7:31am

Depending on your setup, you could also run the shell script to split the file from a smart rule. I’d suggest a script like this one

#!/bin/sh
cd ~/Databases/md/MD_Splitter && (
  for d in "$@" ; do
     csplit -k -n 4 "$d" '/^# /' {1000}
  done
  for d in *; do
    mv "$d" "$d".md
 done
}

This should take care of the renaming/extension issue (but beware: I didn’t test it at all. Use at your own risk). Also, it changes into your working directory only once. You can of course choose another extension than “md” in the mv command.
Perl would actually be a bit of an overkill in this situation, since your condition to split is so simple.

Re your regular expression ^# (.+?)\n: I’d go for a $ instead of \n because it matches end of line regardless of the current character(s) used for it. This might be relevant if your file comes from an environment where \n is not used for end of line (Windows comes to mind). The ? shouldn’t be necessary here because you want to gobble up all of your characters until the end of line anyway.

Bernardo_V · September 14, 2020, 12:51pm

Thanks for the suggestion. Apparently, there is something buggy about the csplit that comes with MacOS. I installed coreutils via homebrew and it works without any problems.

Eventually, there will be two spaces at the end of the line, so the pattern is I am using is actually ^# (.+?)\h*\n, but I guess using $ instead of \n makes perfect sense.

I will see if I can fit it into a smart-rule and how it would work. I can’t remember right now if I can call a shell script directly from a smart-rule or if I have to use either a folder action or an applescript.

chrillek · September 14, 2020, 1:23pm

There’s no direct support to call a shell script from smart rules, you could go through AppleScript and use do shell script

Bernardo_V · September 16, 2020, 12:05pm

For our shell script experts:

is there way to rename files using the first line of the text contents of the file (minus the #\h part)?

chrillek · September 16, 2020, 12:33pm

something like
name=`head -1 $file | sed -e ‘s/\h#//’`
Get the first line of the file with head and feed it to see to remove what you don’t want. As always, I didn’t test it

Bernardo_V · September 16, 2020, 12:38pm

And assuming I want to apply this to all files within a given directory, would it be something like this:

cd blah && for d in *; do name=`head -1 $file | sed -e 's/\h#//'` done

chrillek · September 16, 2020, 12:50pm

Either d/$d or file/$file.

Bernardo_V · September 16, 2020, 12:50pm

Perfect. Thanks, @chrillek!

Blanc · September 16, 2020, 12:55pm

if it is #\h, then why is the regex expression \h#? Honest question, regex and I have not yet become friends

Bernardo_V · September 16, 2020, 12:56pm

Typing error.

Blanc · September 16, 2020, 1:18pm

Right

if either of you have time, could you explain? name=head -1 $d I understand; the -e option for sed I’m not so sure - simply tells sed to execute the following script (or expression), correct? The following s is not part of the regex, but a command, I take it? Subtract, maybe? Then / marks the following \ as a literal character? The // at the end is there because?

In the shell script, what is the meaning of “&&”?

(I’m happy for you to say go away, this is not a script/regex learning place; I’m sitting in front of a number of websites, trying to teach myself what it is you have conjured up, but I’m not finding it exactly self evident. As I say, only if you have time on your hands…)

BLUEFROG · September 16, 2020, 1:20pm

Why don’t you merely use a Change Name action with the Proposed Name placeholder?

For example…

You can see it stripped the header characters and renamed it properly.

You can easily test this in a Batch Process.

chrillek · September 16, 2020, 1:25pm

How boring