AppleScript: How to query HTML?

chrillek · March 14, 2023, 11:20am

So, this is how it can be done in Another Language™. Since most of the stuff is ObjC anyway, you’ll get the meaning. Or head over to

and follow their lead.

const URL = $.NSURL.URLWithString($('https://programm.ard.de/TV/Programm/Sender?sender=28106&datum=14.03.2023'));
const error = Ref();
const XMLDoc = $.NSXMLDocument.alloc.initWithContentsOfURLOptionsError(URL, 
  $.NSXMLDocumentTidyHTML, error);
const root = XMLDoc.rootElement;
const itemNodes = root.nodesForXPathError('//span[matches(text(), "maischberger|Tagesthemen")]', error);
const result = [];
itemNodes.js.forEach(item => {
  const timeNode = item.nodesForXPathError('preceding-sibling::span[contains(@class, "date")]', error);
  result.push({name: item.stringValue.js, time: timeNode.js[0].stringValue.js})
})
console.log(JSON.stringify(result));

Output

[
{"name":"Tagesthemen mit Wetter","time":"22:15 x"},
{"name":"maischberger","time":"22:50 x"},
{"name":"maischberger","time":"01:50 x"}
]

Yes, there is this funny “x” following the time. In the HTML, it is only a comment () which is probably there because … well, who knows. It’s completely pointless, but it is part of the node’s value, apparently. Easily removed, but wouldn’t cause any trouble if you were using the DOM methods.

The flag NSXMLDocumentTidyHTML is supposed to convert the HTML to XHTML, cleaning it up in the process. It might ensure that you can parse HTML as XML, but perhaps not in all cases. Also, XPath does support regular expressions, so I stand corrected on that point and will edit my previous post accordingly.

XPath also supports the preceding-sibling:: which is missing from CSS. The requirement that class matches date is a bit lax because there could conceivably be spans with a class of datetime which would be matched as well.