Need help with a smart rule

Hello all!

I´m trying to scrape some data from Linkedin pages and I’m having difficulty the right way to do it.

From Safari I save an URL in webarchive to DT.
Then I would like to automatically fill in the custom fields I´ve created (Category & Employees)

Is there any possibility to create a smart rule and regular expression or script that will achieve this goal? I´m sharing an screenshot
Any hints or suggestions? Many thanks in advance!

Perhaps. But impossible to say without the complete HTML. The metadata from it is not enough.

1 Like

Thank you for your prompt reply.
In that case, I would need to save the complete HTML, or would be enough with the webarchive?


The web archive is fine, is the complete HTML and then some.
Since I’m not at my Mac right now (and won’t be for some time) I’ll only outline the approach:

  • open the web archive in DT
  • run a script in DT that executes doJavaScript with a script like this
document.querySelector("").innerHTML;

Where the selector is a valid CSS selector, eg div.org-top-card-summary-info-list__info-item
The return value of doJavaScript() should be the content of the div addressed by the selector.

2 Likes

This approach is already something I’m exploring but it requires a content record, i.e., you’re looking at the document. Therefore, this really isn’t suitable for a smart rule.

And @Pompano, does it actually need to be a smart rule? DEVONthink isn’t built as a web scraper.
PS: You included the geographic location in your magenta underline. I’d guess that’s not intended to be part of the Category. So is it actually important to capture or not?

2 Likes

True. One could alternatively read the HTML and use a reg ex, but that’s more error-prone.

Yeah, scraping HTML with regex is :flushed::grimacing:

@Pompano

tell application id "DNtp"
	if not (exists (content record)) or (type of (content record) is not in {webarchive, bookmark, html}) then return
	set sel to content record
	set od to AppleScript's text item delimiters
	
	set linkedIn to do JavaScript "var a=[];const cInfo=document.getElementsByClassName('org-top-card-summary-info-list__info-item'); for(el of cInfo){a.push(el.innerText + '   ')};a.toString();" in think window 1
	
	set AppleScript's text item delimiters to "   ,"
	set {category, followers, employees} to {text item 1, text item -2, text item -1} of linkedIn
	add custom meta data category for "Category" to sel
	add custom meta data employees for "Employees" to sel
	-- Followers was added but not used in this example.
	set AppleScript's text item delimiters to od
end tell

…and running this on another selected webarchive from LinkedIn…

:slight_smile:

3 Likes
document.getElementsByClassName(...).forEach(el => el.innerText).toString();

is a bit more JavaScript’ish :wink: and shorter.

1 Like

I was waiting for your revision :wink:
But since it returns an HTMLCollection, not an array, this doesn’t work.

You’re right, i was thinking of NodeList. But Array.from(document.getElementsByClassName("…")).forEach... should do the trick and still avoid the additional array and the push.

Aside: NodeList (as returned by querySelectorAll()) does have a forEach method now, but none of the other array methods like map

1 Like

Absolutely amazing, thank you so so much for your kind help!!

1 Like

I really appreciate your help!

You’re welcome :slight_smile: