AppleScript: How to query HTML?

pete31 · March 12, 2023, 8:26pm

Hi, first off I have no idea of HTML.

Currently to extract something from HTML I use regex, but afaik that’s not the correct way to do it.

What’s the best way to query HTML via AppleScript(Objective-C)?
Is it xPath?
Are the apps that help to learn querying of HTML?
I do have Pathology: A Mac-native XPath Debugger and Visualizer, but either I’m doing something wrong or it can’t be used with HTML.

Again, no idea of HTML

Tagging @chrillek and @winter as I saw them in a thread about HTML.

DTLow · March 12, 2023, 8:47pm

Can you provide specific examples of what you’re trying to accompish
My default note format is formatted note (html)
and I often query and extract

chrillek · March 12, 2023, 9:12pm

In my opinion, xpath is like getting your teeth pulled out by a drunken barber. I’d rather write Assembler code than this mess. YMMV, of course.

On a more constructive note: I’d try to load the html in a WKWebView and then use JavaScript’s DOM methods to query it. Which is The Right Way(™)

What are you looking for? If we’re talking about DT, you can load the record in a viewer window and then run JavaScript there.

Regular expressions can sometimes help, but there are cases when they fail.

cgrunenberg · March 13, 2023, 10:55am

For certain common tasks (e.g. retrieving links or images) the script suite includes dedicated commands.

chrillek · March 13, 2023, 1:23pm

My initial idea to use WKWebView might work, but I can’t get it to do so with JavaScript/JXA (lack of support for code blocks). You might have more luck with AppleScript/ObjC.

If we’re talking about querying HTML from inside DT for a record stored in DT, this works:

(() => {
  const app = Application('DEVONthink 3');
  const rec = app.getRecordWithUuid('2F437490-F4EC-4E38-8747-8F4FCC073F86');
  const thinkWindow = app.openWindowFor({record:rec});
  const result = app.doJavaScript(`var headings = document.querySelectorAll('h1,h2,h3');JSON.stringify([...headings].map(h => h.innerText));`, {in: thinkWindow});
  console.log(result);
})()

As I said before, you can run JS code in a think window displaying an HTML record. The JavaScript code proper is this

var headings = document.querySelectorAll('h1,h2,h3');
JSON.stringify([...headings].map(h => h.innerText));

It uses the DOM method querySelectorAll to find all headings of level 1, 2, and 3. The resulting value is not an Array but a nodelist, which [...headings] converts to an Array. Then map extracts the innertext property from the HTML element, which is just the text of the heading. The return value of map is again an array. Since doJavaScript wants to return a string, JSON.stringify converts this array into a string. In your calling JavaScript code, you can simply use const myArray = JSON.parse(result) to turn it back into an Array.

querySelectorAll expects a CSS selector as its first parameter. That makes it fairly convenient (at least a lot easier to use then the wretched XPath grammar).

BLUEFROG · March 13, 2023, 2:22pm

What are you actually trying to accomplish?

pete31 · March 13, 2023, 11:51pm

Thanks everyone! Sorry I didn’t make it clear: there’s no specific task to solve (and if a browser would be necessary I would prefer Safari). I would just like to learn how to query HTML “the correct way”, as every now and then it would have been useful to know how to query HTML (instead of fiddling around with different apps, e.g. TextSoap). Extracting from HTML is something I currently avoid whenever possible (as I never got a grip on it, and actually never really tried) - but I know that it would be very useful to know how to do it.

If possible without learning JavaScript - simply because I can imagine that I would get confused quite often if I started to use AppleScript and JavaScript. It would probably be easier for me to learn xPath (or something else) than trying to fiddle with two scripting languages …

I couldn’t find a better example, so here’s a task that I already somehow solved via regex:

Given this URL and a list of e.g. {"Tagesthemen", "maischberger"}

How would one

test whether an item from the list appears on this page
if it appears: grab the start time

BLUEFROG · March 14, 2023, 2:51am

IMHO there isn’t a correct way to query HTML without using alternate languages. Also, I would not settle on RegEx for it. And again, IMHO you’re looking at one-off solutions here, not some universal snippet that will apply to all (or even a large minority of) sites.

If you’re going to do scraping, outside of some basic things like getting links from the source (which DEVONthink provides a command for), I’d suggest you dig into Python and likely Beautiful Soup.

chrillek · March 14, 2023, 6:53am

XPath as well as the DOM methods of JavaScript are not about strings appearing somewhere in the HTML, but about the structure. To use either, you must know something about this structure.
In your example, the structural correlation between the strings and the start time is not clear (unless one opens the HTML and investigates). These strings are simply the contents of HTML elements, and they could appear in principle anywhere. Therefore, using a RegEx might be the best approach.
Why you’d rather learn XPath (which is hardly breathing, afaict) than the proven and most used approach is beyond me.

pete31 · March 14, 2023, 7:12am

…

I can imagine that I would get confused quite often if I started to use AppleScript and JavaScript.

Maybe it’s time to tackle JavaScript, but I really don’t want to deal with another scripting language (I’m even confusing AppleScript and Tinderbox’s Action Code sometimes, really not keen to add more)

pete31 · March 14, 2023, 7:22am

Again, no idea of querying HTML, but the (example) task is pretty clear (I think).

chrillek · March 14, 2023, 8:14am

That’s not as simple as it seems. While XPath has an expression for that, there doesn’t seem to be a simple DOM approach (javascript - getElementsByTagName() equivalent for textNodes - Stack Overflow).
And what about multiple appearances of the text – which one are you looking for?

How? Where? Does it precede or follow the text? Is it part of the same text node or somewhere else entirely?

Frankly: a sufficiently tolerant regular expression might be your best choice here, (\d\d[.:-]\d\d.*item)|(item.*\d\d[.:-]\d\d)

Alternatively, in JavaScript you could do something like this (in the browser!):

const itemlist = ['Maischberger','Tagesthemen'];
const RE = new RegExp(itemlist.join('|'), 'i');
const itemNodes = [...document.querySelectorAll('span.title').filter(element => RE.test(element.innerText));
itemNodes.forEach(item => {
  const timeNode = item.parentNode.querySelector('span.date');
  const time = timeNode.innerText;
})

This code

builds a Regular Expression from the strings in itemlist (Maischberger|Tagesschau), ignoring capitalization ('i');
gets all span elements with a class of title from the document with querySelectorAll
converts this nodeList into a JavaScript Array with […]
filters the array for those nodes whose innerText match the Regular Expression
thereby creating an array of nodes whose content matches one of your strings (itemNodes)
it then takes each of these nodes and finds the first span with class time with the same parent
and finally extracts the time from it

So it is feasible, it’s not too much code, but it is in JavaScript, and it needs the browser to run (no DOM methods in JXA). Or perhaps node.js, with the appropriate modules installed.

And that’s off-topic here, too You know how to search Apple’s documentation, and they have a whole bunch of XML objects out there. Initialize an NSXMLDocument with an HTML document, get its RootElement, and throw an XPath at it with rootElement.nodesForXPath('//span[contains(text(), "maischberger")]', error). That’ll give you an NSArray of NSXMLNodes. For each of them, find the preceeding-sibling with class date, extract its content, and you’re done.
You have to repeat that with each item, or you simply grab all spans with class title in an NSArray first and then filter that one for your items (which seems a more practical approach to me) using indexOfObjectPassingTest with a code block (possible with AS? I have no idea). Edit An XPath expression with the appropriate match condition (see post here) would fetch all items matching one of your strings.

Yes, it can be done. And it can be done in AS. You have to learn XPath, instead of using CSS selectors. And you have to type a lot more. Suit yourself.

pete31 · March 14, 2023, 8:58am

Chrillek! I didn’t ask for a JavaScript solution. Come on, you’re telling me “while xPath has an expression for that” - I literally mentioned “xPath” in my initial question. Also mentioned that I don’t want to learn JavaScript (if there’s any way to do it via AppleScript). Did you actually read what I wrote or are you just posting?

The question was: how to query HTML via AppleScript(Obj-C).

pete31 · March 14, 2023, 9:17am

If anyone knows an answer to my initial question. Please let me know.

chrillek · March 14, 2023, 9:25am

I amended my previous post to spell it out in NSXMLDocument terms. The approach is really not that much different with XPath than with DOM methods – you just need to use a path instead of selectors. And you do not have regular expressions in XPath, which makes looking for several items a bit more complicated.

Well, there might be other people interested in the question. And there’s no obligation to limit a response to exactly the terms the OP used. I don’t see a problem with showing how it could be done easily.

pete31 · March 14, 2023, 9:32am

Much appreciated! That’s something I can start from. Kudos!

chrillek · March 14, 2023, 9:38am

But be aware that HTML is not XML. You might well run into problems with creating an XML document from an HTML URL. I just fiddled around with Pathology and a perfectly fine HTML document that was not well-formed XML. The same could happen with any HTML.

pete31 · March 14, 2023, 9:51am

so …

… wasn’t my fault?! Anyway, thanks chrillek

chrillek · March 14, 2023, 11:20am

So, this is how it can be done in Another Language™. Since most of the stuff is ObjC anyway, you’ll get the meaning. Or head over to

and follow their lead.

const URL = $.NSURL.URLWithString($('https://programm.ard.de/TV/Programm/Sender?sender=28106&datum=14.03.2023'));
const error = Ref();
const XMLDoc = $.NSXMLDocument.alloc.initWithContentsOfURLOptionsError(URL, 
  $.NSXMLDocumentTidyHTML, error);
const root = XMLDoc.rootElement;
const itemNodes = root.nodesForXPathError('//span[matches(text(), "maischberger|Tagesthemen")]', error);
const result = [];
itemNodes.js.forEach(item => {
  const timeNode = item.nodesForXPathError('preceding-sibling::span[contains(@class, "date")]', error);
  result.push({name: item.stringValue.js, time: timeNode.js[0].stringValue.js})
})
console.log(JSON.stringify(result));

Output

[
{"name":"Tagesthemen mit Wetter","time":"22:15 x"},
{"name":"maischberger","time":"22:50 x"},
{"name":"maischberger","time":"01:50 x"}
]

Yes, there is this funny “x” following the time. In the HTML, it is only a comment () which is probably there because … well, who knows. It’s completely pointless, but it is part of the node’s value, apparently. Easily removed, but wouldn’t cause any trouble if you were using the DOM methods.

The flag NSXMLDocumentTidyHTML is supposed to convert the HTML to XHTML, cleaning it up in the process. It might ensure that you can parse HTML as XML, but perhaps not in all cases. Also, XPath does support regular expressions, so I stand corrected on that point and will edit my previous post accordingly.

XPath also supports the preceding-sibling:: which is missing from CSS. The requirement that class matches date is a bit lax because there could conceivably be spans with a class of datetime which would be matched as well.