AppleScript to Extract Domain for Custom Metadata in DEVONthink 4

Hello there,

I’m currently trying out DEVONthink 4 and I’m looking for an AppleScript that can remove the http/https/www parts and subdomains from a URL, then save just the domain as custom metadata when I add a new bookmark.

My goal is to sort bookmarks alphabetically by domain.

Any help would be appreciated. Thanks!

Welcome @Bahadir
This is fairly easily done with a smart rule or a batch process…

… and a little vanilla AppleScript…

on performSmartRule(theRecords)
	tell application id "DNtp"
		set od to AppleScript's text item delimiters -- Cache the delimiter
		set AppleScript's text item delimiters to {"https://", "http://", "www.", "/"} -- Set the URL delimiters
		repeat with theRecord in theRecords
			set recordURL to (URL of theRecord as string) -- Get the URL
			if recordURL contains "://" then -- Check for a protocol
				set urlParts to text items of recordURL
				if (item 2 of urlParts) is "" then -- the URL contained "www."
					set theDomain to (item 3 of urlParts)
				else -- the URL didn't contain "www."
					set theDomain to (item 2 of urlParts)
				end if
			else -- If no protocol exists, e.g., the URL is just "somewhere.com"
				set theDomain to text item 1 of recordURL -- but it also contains more info like "/objects", grab only the first part
			end if
			add custom meta data theDomain for "Origin URL" to theRecord
		end repeat
		set AppleScript's text item delimiters to od
	end tell
end performSmartRule

And vanilla in the sense it’s pure AppleScript, as we like to offer people.

And in the end…

1 Like

It is working as intended, thank you very much.

Just for the fun of it and since the task was screaming “Regular Expression”: an implementation in JavaScript.

function performsmartrule(records) {
  /* Filter out all records without URL and with mailto URL */
  records.filter(r => r.url() !== '' && !/^mailto:/.test(r.url())).forEach(r => {
    
    /* Get the URL's host part, i.e. everything following '://' and ending before '/' */
    const host = r.url().match(/^(?:.*:\/\/)?([\w.]+)/)[1];

    /* Get the domain: at least one word character followed by a dot
       followed by at least one word character anchored at the end of the host name */
    const domain = host.match(/.*?(\w+\.\w+)$/)[1];
    app.addCustomMetadata(domain, {for: "OriginURL", to: r});
  })
}
}

I find that a tad more concise :wink:
Here’s what the regular expressions do:

  • url.match(/^(?:.*:\/\/)?([\w.]+)\/?/): Match any number of characters (.*) at the beginning of the string (^), followed by a colon and two slashes:\/\/. That part is the protocol (http:// etc.). Since it is optional, pack it into a non-capturing group (?:...) than can occur not more than once (?). Stuff all word characters and dots after that ([\w.]+) into a capturing group. The [1] following the match call grabs this capturing group. It is the host name not including a possible port number.
  • host.match(/.*?(\w+\.\w+)$/): Then match looks for at least one word character \w+ followed by a dot \. and at least one word character \w+ anchored at the end of the string $. \w+\.\w+ are stuffed into a capturing group. It contains the domain name and is accessed with [1] after the match call.

IMO, the AppleScript code has some shortcomings (one of them being its length ;-). Running it on

{"https://sub.domain.org/", 
"www.example.com", 
"http://example.com?query-string", 
"https://example.org:81/",
"mailto:booking@information.lufthansa.com?subject=Re:%20Vielen%20Dank%20f"}

in Script Editor gives me

(*sub.domain.org*)
(**)
(*example.com?query-string*)
(*example.org:81*)
(*mailto:booking@information.lufthansa.com?subject=Re:%20Vielen%20Dank%20f*)

ie: A sub domain is not removed, an URL without protocol but with www is completely ignored, a query string is not removed, nor is a port. And a mailto URL is not dealt with at all. The JS code above gives me that output on these URLs:

domain.org
example.com
example.com
example.org

As it should, because the script skip mailto links completely. Perhaps the OP could advise what to do with them?

Also, both scripts fail with domains like sub.example.co.uk – AS returning that, JXA returning co.uk.
I also asked Qwen to write the JS code. Even in the third iteration, it failed miserably because it tried to use the URL class which is only available in some environments (notably browsers and Node.js). Here’s what it said: " So the earlier suggestion is indeed invalid in practice ."
I just love it.

1 Like