Script help - Set website og:image to thumbnail of a bookmark record

Hi, I’m not a developer, and I need a script to set the og:image metatag of an article to the bookmark thumbnail in Devonthink.

My try:

function performsmartrule(records) {
	var app = Application("DEVONthink 3");
	app.includeStandardAdditions = true;

	records.forEach (r => {
		var url = r.url()
		var markup = app.downloadMarkupFrom(url);
		var thumb = markup.querySelectorAll("meta[property='og:image']")[0].content;
		r.thumbnail = thumb
	})
}

I appreciate any help. I want to view the bookmarks like cards in the Icons View mode (not page screenshot).

  • What URL are you trying to run this on?
  • Does it have the og:image property?

For example: New HomePod Mini Colors Now Available to Order and for Apple Store Pickup - MacRumors

In HTML source code I need to get the content of meta tag og:image:

<meta property="og:image" content="https://images.macrumors.com/t/OgNqWcRkwT45SHXera2qosdj1qs=/1600x/https://images.macrumors.com/article-new/2021/11/homepod-mini-color-bars.jpg" />

That will not work since querySelector etc are DOM methods. You have no DOM, because the HTML is not loaded in a browser. All downloadMarkupFrom gives you is text (as clearly stated in the documentation). Also, querySelectorAll would return a list of DOM nodes (were it working at all in this context), not a single item.

Use a regular expression to get the data instead. Or rather use several of them. First, you have to get the “og:image” meta element like so
<meta.*?property="og:image".*?>
and then fish out the content with
content="(.*?)"

These two steps are necessary because a meta element could contain the attributes in any sequence, so you can’t rely on property coming before content.

The image URL will then be in the RegEx variable $1.
However, setting the thumbnail of the record to this URL will probably be a terrible performance hit if you do it for many records. So you might want to consider downloading the image from the URL first and setting the thumbnail to that image.

Also, your headline and text describe the exact opposite of what your script attempts to do: It is setting the thumb from og:image, not the other way round.

Edit: Another possibility would be to let Safari load the original URL (r.url()) and then execute doJavaScript(...) with your querySelector(...) call (note that it is not querySelectorAll!). This will however introduce a lot of visual noise by loading the URLs etc.

Thanks for reply :slightly_smiling_face:

I modify a script of yours, Favicon as icon for bookmark/web archive? - #6 by chrillek

I tried many regex expressions, but no success. :sweat:

The first one regex in the code get an image from the URL of record, but it isn’t the url in og:image content.

Can you help me?

I’m also trying to code in AppleScript, following the example code in Mac Automation Scripting Guide: Parsing HTML

Script:

function performsmartrule(records) {
	var app = Application("DEVONthink 3");
	app.includeStandardAdditions = true;
	

	records.forEach (r => {
	let thumb = "";
    let found = false;
    const URL = r.url();
    const domain = URL.split('/').slice(0,3).join('/');
	
	const HTML = app.downloadMarkupFrom(URL);
    const embImages = app.getEmbeddedImagesOf(HTML, { baseURL: URL});
	if (embImages.length > 0) {
      for (let img of embImages) {
	  found = RegExp('<meta.*?property="og:image".*?>' && 'content="(.*?)".*?').test(HTML);
	  /*found = RegExp('<meta.*?property="og:image".*?>' && '*content="([^"]+)".*\/>').test(HTML);*/
	  /*found = RegExp('<meta.*?property="og:image".*content="(.*?)"').test(HTML);*/
	  /*found = RegExp('<meta.*property="og:image".*content="(.*)".*\/>').test(HTML);*/
	  /*found = RegExp('<meta.*property="og:image(?::url)*".*content="([^"]+)".*\/>').test(HTML);*/
	  /*found = RegExp('<meta [^>]*property=[\"']og:image[\"'] [^>]*content=[\"']([^'^\"]+?)[\"'][^>]*>').test(HTML);*/
	  if (found) {
          thumb = img;
          break; /* exit the loop here */
        }
      }
      if (found && thumb !== "") {
        r.thumbnail = thumb;
      }
    }
  })
}

What is the purpose of this script?

Which was written for a completely different scenario (and unfortunately never worked very well). It was actually retrieving the embedded images from a HTML source, but this is not what you intend to do – a meta element is not an embedded image, it is just that – a meta element.

Your new code does not make sense. HTML is the raw HTML from the webpage. getEmbeddedImagesOfHTML gets the images from this HTML. Since a meta element is not an image, an og:image meta element will never be returned by using this call. If it were (and it is not), there would be no point at all to go over it with a regular expression, because the method already returns the image URLs.

So, your regular expression has to work on HTML. And as said before, you have to use two regular expressions : one to get the meta element with the og:image property, the next one that fishes out the content URL from the match of the first one.

Now, you cannot string to regular expressions together with && in the call to RegExp – that simply makes no sense at all. && is a logical AND, and you’re using it on two strings. What should be the outcome of ANDing two strings? I guess, since both are trueish, the result will simply be true, and feeding that to RegExp will certainly not give you a regular expression. Also, test will only tell you if there was a match, but not where to find it. But in this case, you need the matching string, so test is useless.

The basic procedure is this:

  • build two regular expressions re1 = /<meta.*?property="og:image".*?>/ and re2=/content="(.*?)".*?')/ (// is the equivalent of calling RegExp);
  • run the first one on HTML like so const result1 = HTML.match(re1);
  • check if result1 is defined (otherwise there was no match, so no og:image in the HTML) and if so, continue with
  • running the second regular expression like so const result2 = result1[0].match(re2);
  • if result2 is not null, result2[1] contains the URL of the og:image

I also suggest to write the script as a standalone script (i.e. outside of a smart rule). Put it in Script Editor, turn on its tracing features, select a record in DT and then start the script.

If you feel so inclined, you can of course try to parse the HTML data with AppleScript. This, however, is not for the faint of heart (as HTML parsing in general is not), and I’d strongly advise against. Also, the example you linked to parses an HTML file, which you do not have here. It works with opening/closing tags, which meta elements do not have. And the example will also probably break with nested elements, like <div>...<div>...</div>...</div>. It’s bad in many aspects, and you’re better off in my opinion to ignore it. But as I said before, you do not even need a parser here, because two regular expressions are sufficient.

I have bookmarks in Raindrop.io, I would like to import them into Devonthink, but for Devonthink the thumbnails are screenshots of the web page, not visual cards. I’m trying to reproduce the appearance of View Cards like Raindrop.io. I only need this to manage bookmarks in Devonthink. I think I’m a very visual person :slightly_smiling_face:

Lots of information, thank you very much. As I said before, I have no coding experience, but I am willing to learn. I’ll try to write the script again, being careful about the difference between the parsing of remote web page and parsing of local web page.

You do not parse a remote web page, you just match two regular expressions against the HTML of a local one. Nothing remote is involved here.

I’m sorry, I’m fuzzy yet, but i’ll give it a try again this afternoon, i’ll feedback to you of the results.

Hi @chrillek , I rewrote the script and I can getnow. The regex /<meta.*?property="og:image".*?>/ was returning all <meta property tags, so I replaced it with a switch case according to the domain name in the URL. Would it be possible to have a regex that works for any domain? If not, I add the domains in the switch case to match each case. Thank you so much, I almost there :slightly_smiling_face:

(() => {

//function performsmartrule(records) {
	

	
	var app = Application("DEVONthink 3");
	app.includeStandardAdditions = true;
	const records = app.selectedRecords.whose( {_match: [ObjectSpecifier().type, "bookmark"]})();

	records.forEach (r => {
    const URL = r.url();
	let domain = URL.replace('http://','').replace('https://','').split(/[/?#]/)[0];
    const HTML = app.downloadMarkupFrom(URL);
	
	switch (true) {
		case (domain == "medium.com"):
			re1 = /<meta data-rh="true" property="og:image".*?>/;
			break;
		default:
			re1 = /<meta property="og:image".*?>/;
	}
		
	//re1 = /<meta.*?property="og:image".*?>/
	//re2 = /content="((.*?)".*?)'/
	re2 = /content="(.*?)".*?/
	
  	result1 = HTML.match(re1);
	if (result1 != null) {
		const result2 = result1[0].match(re2);
		if (result2 !=null) {
			r.thumbnail = result2[1]	
		}

	}

  })
//}
})()

My attempt at the first RegEx was broken. Sorry for that. I append a version of the script that hopefully works, with comments.

(() => {
	/* var */ const app = Application("DEVONthink 3"); // do not use var unless the value is allowed to change */
	/* app.includeStandardAdditions = true; */ // not needed here since you do not work with dialogs etc.

	const records = app.selectedRecords.whose( {_match: [ObjectSpecifier().type, "bookmark"]})();


	/* Move invariants OUTSIDE of the loop - there's no need to redefine the same thing over and over again.
	  Also, the REs do not change, so make them constants.
	*/
	const re1 = /<meta[^>]*property="og:image".*?>/;
	const re2 = /content="(.*?)".*?/

	records.forEach (r => {
      const URL = r.url();
      const HTML = app.downloadMarkupFrom(URL);
	
  	  result1 = HTML.match(re1);
	  if (result1 !== null) { // use !== and === to compare for identy, avoiding implicit casts
		const result2 = result1[0].match(re2);
		if (result2 !== null) {
			r.thumbnail = result2[1];
		}
	}
  })
})()

The whole domain...switch thingy is unnecessary if the first regular expression is correct. And your default... re1 would not work with any meta element where the property is not the first attribute (like <meta content="..." property="og:image"...>. The new re1 should take care of all versions of meta elements with a property="og:image" attribute`.

One last thing: replace('http://').replace('https://') is not the most efficient way to remove the HTTP(S) protocol from an URL. replace(/https?:\//\/) does what you want in a single call to replace with a regular expression.

Thank you so much, it’s amazing! Your tips are very nice and I’m thinking about actually learn Javascript. The script is nice for my case.