Extract Aliases from second line of text


I probably have a very naive question here. In response to this fascinating post about using wikilinks as a translation tool, I decided to import my own text-based Latin dictionary into DEVONThink. I was able to split each definition into a different rich text file, but what I would like to do now is extract parts text within each file to use as aliases. Fortunately enough, the format of each file is pretty regular. Unfortunately, I don’t know how to use the scan text smart rule function with enough finesse to get the result I want.

Here’s what I’m dealing with. Most of my text files look like this:

ie. with the word presented in the first line without macrons, the word presented again with macrons followed by its gender, and then a definition in italics.

I would like to be able to extract all of the second line before the “(f)” and use that to create new aliases. I know how to use the scan text function to search for a string before “(f)”, and then use a placeholder to change the aliases the results. What I don’t know is how to look for a string only on the second line. Any help would be appreciated!

Thanks in advance! (and thanks to DEVONThink for creating software that allows people with no experience in programming or the alike take their first steps in automation)

This Smart Rule script adds aliases from the second line.

Make sure to process each record only once.

-- Smart Rule - Add aliases from second paragraph

on performSmartRule(theRecords)
	tell application id "DNtp"
			repeat with thisRecord in theRecords
				set theText to plain text of thisRecord
				if theText ≠ "" then
					set thisParagraph to paragraph 2 of theText
					set thisParagraph_clean to (characters 1 thru -5 in thisParagraph) as string
					set newAliases_list to my tid(thisParagraph_clean, ", ")
					set oldAliases to aliases of thisRecord
					if oldAliases ≠ "" then
						set oldAliases_list to my tid(oldAliases, {", ", "; ", ",", ";"})
						set oldAliases_list to {}
					end if
					set allAliases to my tid(oldAliases_list & newAliases_list, ", ")
					set aliases of thisRecord to allAliases
				end if
			end repeat
		on error error_message number error_number
			if the error_number is not -128 then display alert "DEVONthink" message error_message as warning
		end try
	end tell
end performSmartRule

on tid(theInput, theDelimiter)
	set d to AppleScript's text item delimiters
	set AppleScript's text item delimiters to theDelimiter
	if class of theInput = text then
		set theOutput to text items of theInput
	else if class of theInput = list then
		set theOutput to theInput as text
	end if
	set AppleScript's text item delimiters to d
	return theOutput
end tid

Wow, thanks! Works like a charm!

Out of interest, could I have done this without scripting? I’m very grateful for this script, but there’s no way I could have come up with it, at least with my current level of expertise.

I guess I’m asking for more documentation as to how to use the scan name and scan text features.

1 Like

You could probably get the desired part of the second paragraph with Scan Text and a regex but it seems there’s no way to split the result.

But in your case the desired string is already formatted with a comma, so there’s no splitting and further processing necessary, you indeed could use Scan Text and append the complete result string via Change Aliases. Didn’t check whether that’s possible with default Smart Rule actions before.

How would I do that? I’m sorry, I’ve tried to work out a way to scan for a string with a comma, but I really can’t figure it out. I’m quite illiterate in these matters.

No need to scan for a comma.

From help:

Regular Expression: Items in parentheses are captured; items outside parentheses are ignored. You can specify multiple captures in an expression. Using the captured text in subsequent actions is specified by using backslash, , and the number of the capture, starting at 1.

Here’s a working regex (?<=\n)(.*?)(?=\(f\))

Change Name/Aliases/Comment/Label/Rating: Change the specific attribute of the matched file. For items with an existing attribute, e.g., a comment, a placeholder will preserve the existing value.

However with default Smart Rule actions it doesn’t seem to be possible to add the regex result to existing aliases in a clean way:

If aliases exist and we place the necessary comma between the Aliases placeholder and the regex result …

Aliases, \0

MyOldAlias, influentia, influentiae

… but if no aliases exist we would end up with …

, influentia, influentiae

… the comma at the start might not be a problem but it’s not what I personally would like to see when looking at a record’s aliases.

If you know that you’ll only be using the regex result as aliases (i.e. there are no aliases before you use the Smart Rule) you can of course omit the Aliases placeholder - if not then just use the script.

This RE starts at the beginning of the line (^), captures all non-commas ([^,]+), matches a comma plus an arbitrary number of spaces (also none) and then captures to the opening parenthesis '([^(]*)`. I know, it looks as if a drunken monkey fell on the keyboard.

If the RE matches, it will “save” two capture groups: the first one for everything up to but excluding the comma, the second one everything after the comma, excluding optional space(s). These capture groups can then be accessed as \1 and \2 (or $1 and $2, depending on the RE dialect).

As to the OP’s question:

I don’t see why this is necessary as long as the “second line” is always of the form described before: nominative, genitive (gender marker). As long as neither the third nor the first line contain two words separated by a comma and possibly a space, the RE will only match the 2nd line anyway.

What’s wrong with

? It does match the second line - the problem is not regex matching but adding aliases if one doesn’t know whether aliases exist or not. Or am I missing some advantage of your approach?

I didn’t really think about the aliases, only about the “how do I match the 2nd line” thingy. In fact, I wrote my post before I saw yours.

As to your regex:

  • Matching for a \n is ambitious :wink: Many RE engines work only with a single line unless told otherwise. I don’t know about DT in this context (which apparently uses the /i modifier without saying so explicitly). Instead, you could go for ^ which is always matched. With the (f) (see below) at the end of the line, you’re on the save side, I’d say.
  • The (f) at the end is ok in this case. Given that this is about Latin, I’d go for (?=([fmn])). But that’s not really important.

The comma thingy in Aliases … yes. One would need scripting for that, In JavaScript, I’d do something like
aliases = [...oldAliases.split(','),...match.split(',')].join(,);
Ah, the joys of arrays and string operators :wink:

I find it an interesting choice that DT implements a list of aliases as a string separated by commas/semicolons. Whereas tags, cells and columns are proper lists.

Thanks to you both! I’m responding late, because I wanted to follow along and it took me a while to get a handle on this Regex syntax. All of your suggestions are really helpful.

I was actually able to complete all of my nouns, thanks to the script @pete31 supplied (thanks again!), but I was looking to come up with a similar smart rule for different formats of words (ie. adjectives, pronouns, etc.). I don’t think I have it in me to come up with a script to do the job, but I just might be able to come up with a regex myself building from what you’ve suggested here. I’ll probably post again if I still need help.

Thanks again!

1 Like

You don’t show an example with these different options. That would be helpful.

pete, i wish i could see all your posts! you have so many helpful scripts. :slight_smile:

Use ‘script @pete31 after:2019-01-01 in:title’ to find script threads I created, i.e. scripts that are well tested.

Use ‘tell application @pete31 after:2019-01-01’ to find all AppleScripts I posted. Also tested but among these are questions and things that didn’t work.

Yep :grin:

I’m just cleaning up the next document right now. When it’s ready, I’ll post back here again.