I´m trying to scrape some data from Linkedin pages and I’m having difficulty the right way to do it.
From Safari I save an URL in webarchive to DT.
Then I would like to automatically fill in the custom fields I´ve created (Category & Employees)
Is there any possibility to create a smart rule and regular expression or script that will achieve this goal? I´m sharing an screenshot
Any hints or suggestions? Many thanks in advance!
The web archive is fine, is the complete HTML and then some.
Since I’m not at my Mac right now (and won’t be for some time) I’ll only outline the approach:
open the web archive in DT
run a script in DT that executes doJavaScript with a script like this
document.querySelector("").innerHTML;
Where the selector is a valid CSS selector, eg div.org-top-card-summary-info-list__info-item
The return value of doJavaScript() should be the content of the div addressed by the selector.
This approach is already something I’m exploring but it requires a content record, i.e., you’re looking at the document. Therefore, this really isn’t suitable for a smart rule.
And @Pompano, does it actually need to be a smart rule? DEVONthink isn’t built as a web scraper.
PS: You included the geographic location in your magenta underline. I’d guess that’s not intended to be part of the Category. So is it actually important to capture or not?
tell application id "DNtp"
if not (exists (content record)) or (type of (content record) is not in {webarchive, bookmark, html}) then return
set sel to content record
set od to AppleScript's text item delimiters
set linkedIn to do JavaScript "var a=[];const cInfo=document.getElementsByClassName('org-top-card-summary-info-list__info-item'); for(el of cInfo){a.push(el.innerText + ' ')};a.toString();" in think window 1
set AppleScript's text item delimiters to " ,"
set {category, followers, employees} to {text item 1, text item -2, text item -1} of linkedIn
add custom meta data category for "Category" to sel
add custom meta data employees for "Employees" to sel
-- Followers was added but not used in this example.
set AppleScript's text item delimiters to od
end tell
…and running this on another selected webarchive from LinkedIn…
You’re right, i was thinking of NodeList. But Array.from(document.getElementsByClassName("…")).forEach... should do the trick and still avoid the additional array and the push.
Aside: NodeList (as returned by querySelectorAll()) does have a forEach method now, but none of the other array methods like map