Script: Fetch metadata for astronomical papers from the NASA Astrophysics Data System

I want to post a script that I hope may be helpful for some other DT users out there. The script fetches bibliographic metadata for scientific papers from the NASA / Harvard Astrophysics Data System (ADS). If you are working in astronomy or related fields, even as a hobbyist, you will be very familiar with the ADS. For many historical papers, it is the only source for digital copies.

I found myself with a big collection of these historical PDFs and wanted to have an easy way to get bibliographic metadata as well as tags into my database. The attached script is tailored to exactly this situation: You have PDFs from the ADS and a list of custom metadata fields. The script even allows localized or user-defined field names – the downside is that you will have to edit two property fields before being able to use it, one for this fieldname mapping, the other for your API key (which you can get for free by creating an account at https://ui.adsabs.harvard.edu ).

I have to admit that I did this mostly because I had no previous experience with AppleScript, so I used Gemini as an LLM-based code assist tool. Which was quite helpful, because I found out on the way that AppleScript seemingly can’t really handle JSON very well and can’t wrangle details from within PDFs. So in order to parse JSON within Applescript and detect links in PDFs, the script uses some rather annoying and unreadable ObjectiveC-based tricks and became huge… Because of this, it is quite possible that there are errors or stylistic catastrophes left in the code. It works fine for its intended purpose, however.

I would be very happy about any feedback! However, since I have very little time currently, I can’t promise I’ll be able to incorporate any suggestions in a reasonable timeframe, though.

use framework "Foundation"
use framework "PDFKit"
use scripting additions


-- ---------------------------------------------------------------------------------

-- Customize these two properties to match your own setup.

(* This record maps the field names from the ADS API JSON response
to the custom metadata field names you have defined in DEVONthink.
The keys are the JSON field names as used in the API and documented at
https://ui.adsabs.harvard.edu/help/api/api-docs.html#servers, and 
the values are your localized / custom DEVONthink field names. 
*)
property fieldMapping : {¬
	{"title", "Originaltitel"}, ¬
	{"author", "Autoren"}, ¬
	{"pubdate", "Datum"}, ¬
	{"pub", "Erschienen in"}, ¬
	{"volume", "Volume"}, ¬
	{"year", "Erscheinungsjahr"}, ¬
	{"page", "Page"}, ¬
	{"doi", "DOI"}, ¬
	{"isbn", "ISBN"} ¬
}

(* Paste your API token into this string. *)
property api_token : ""

-- ---------------------------------------------------------------------------------

(*
=================================================================================
 SCRIPT DOCUMENTATION
=================================================================================

WHAT THIS SCRIPT DOES:
This script automates adding metadata to academic papers from the NASA Astrophysics Data System (ADS) 
within DEVONthink. It performs the following steps:
1.  It inspects the selected document to find its unique 19-character "bibcode". It can find the bibcode from the
    document's filename, from an ADS URL, or by scanning the content of a PDF for a watermark link.
2.  It uses this bibcode to query the official ADS API to retrieve detailed metadata about the paper.
3.  It parses the API's response and populates your custom / localized metadata fields in DEVONthink (as defined in the
    `fieldMapping` property).
4.  It sets the document's URL to the ADS abstract page and adds any keywords from the API as DEVONthink tags.

WHAT IS THE ADS?
The NASA Astrophysics Data System (ADS) is a digital library and online database of scientific papers with a strong 
focus on astronomy and astrophysics. See https://ui.adsabs.harvard.edu/ or, since the UI is currently transitioning
to a new version, https://scixplorer.org.

IMPORTANT LIMITATIONS:
This script is specifically designed to work with documents that have an ADS bibcode. It will **not** work for papers
downloaded directly from other sources like arXiv, journal publisher websites (e.g., Elsevier, Springer), or other
academic repositories, even if the papers are *also* in the ADS -- the script requires bibcodes to be present.

SAMPLE API RESPONSE:
The script expects a JSON response from the API. The `fieldMapping` property is used to map the keys from this JSON
(e.g., "title", "author") to your custom fields in DEVONthink. A typical response for a single document looks like this:

$ curl -H "Authorization: Bearer ..." "https://api.adsabs.harvard.edu/v1/search/query?q=bibcode:1995ApJ...438...62W&fl=title,issue,keyword,pub,title,volume,year,pubdate,author"
{
  "responseHeader":{
    "status":0,
    "QTime":6,
    "params":{
      "q":"bibcode:1995ApJ...438...62W",
      "fl":"title,issue,keyword,pub,title,volume,year,pubdate,author",
      "start":"0",
      "internal_logging_params":"X-Amzn-Trace-Id=Root=1-692213f1-0c85a8155426ae546f3e1840",
      "rows":"10",
      "wt":"json"}},
  "response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
      {
        "author":["Wilson, A. S.",
          "Colbert, E. J. M."],
        "keyword":["Active Galactic Nuclei",
          "Black Holes (Astronomy)",
          "Luminosity",
          "Radio Jets (Astronomy)",
          "Cosmology",
          "Interacting Galaxies",
          "Quasars",
          "Radio Astronomy",
          "Astrophysics",
          "BLACK HOLE PHYSICS",
          "GALAXIES: ACTIVE",
          "GALAXIES: INTERACTIONS",
          "GALAXIES: NUCLEI",
          "GALAXIES: QUASARS: GENERAL",
          "RADIO CONTINUUM: GALAXIES",
          "Astrophysics"],
        "pub":"The Astrophysical Journal",
        "pubdate":"1995-01-00",
        "title":["The Difference between Radio-loud and Radio-quiet Active Galaxies"],
        "volume":"438",
        "year":"1995"}]
  }}


*)



tell application id "DNtp"

	if api_token is "" then
		display dialog "No API Token Found" & return & return & ¬
			"This script queries the ADS / SciX API, which requires an API token to function." & ¬
			"Register for a free account at https://scixplorer.org and create a token in your account settings." & ¬
			"After that, edit this script and paste your token into the 'api_token' property." & return & return & ¬
			"The script is located at:" & return & (path to me as string) ¬
			buttons {"OK"} default button "OK" with icon stop
	end if

	set theDocument to item 1 of (get selection)
	set theName to name of theDocument
	set detectedBibcode to ""
	
	-- Discover the Bibcode of the selected Document. There are three possibilites:
	-- 1. Downloaded documents will have a bibcode as their name. That's easy.
	if my isBibcode(theName) then
		set detectedBibcode to my decodeString(theName)
	-- 2. Files may have an ADS URL as their name.
	else if "adsabs.harvard.edu/" is in theName then
		set rawBibcode to my getRawBibcodeFromADSURL(theName)
		set detectedBibcode to my decodeString(rawBibcode)
	-- 3. the user seems to think that this object was downloaded from ADS nonetheless,
	-- so we'll scan for the watermark link that every modern ADS PDF has
	else if type of theDocument is PDF document then
		set foundURL to my scanPDFforADSLink(path of theDocument)
		if foundURL is not "" then
			set detectedBibcode to my getRawBibcodeFromADSURL(foundURL)
			set detectedBibcode to my decodeString(detectedBibcode)
		end if
	end if
	
	-- If we have a valid bibcode, query the API and set the metadata
	if my isBibcode(detectedBibcode) then
		set encodedBibcode to my encodeString(detectedBibcode)
		my updateMetadataFromAPI(encodedBibcode, theDocument)
	else
		display dialog "Could not find a valid Bibcode in the selected document."
	end if

end tell

(**
 * Checks if a given string is a valid 19-character Bibcode.
 * There is no single Regex to check this due to the limitatations in https://ui.adsabs.harvard.edu/help/actions/bibcode 
 * (and AppleScript's regex handling is not great) so let's just use super simple checks here -- 
 * worst case is an empty result set from the API.
 *
 * @param bibcodeString The string to validate.
 * @return true if the string is 19 characters long, false otherwise.
 *)
on isBibcode(bibcodeString)
	
	-- First, check for the correct length. This is the fastest check.
	if (count of bibcodeString) is not 19 then
		log "Bibcode failed length check: " & bibcodeString & " with length: " & (count of bibcodeString) as string
		return false
	end if
	-- Apple Script does not seem to have simple regex handling, sooo....
	set allowedChars to "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.&+!-"
	repeat with aChar in (characters of bibcodeString)
		if aChar is not in allowedChars then
			log "Found invalid character in Bibcode: " & aChar
			return false
		end if
	end repeat
	return true

end isBibcode

(**
 * Converts a full ADSABS URL into a 19-character Bibcode.
 * It extracts the bibcode, URL-decodes it, and validates its length.
 * There are two URL variants: Those starting with adsabs.harvard.edu/abs/ (which is used in the embedded
 * watermark-style links in PDFs) and those starting with adsabs.harvard.edu/pdf/ 
 * (when storing a PDF directly from the download page, DT uses this as the filename).
 *
 * @param urlStr The full URL string, e.g., "https://adsabs.harvard.edu/abs/2023A%26A...678A.135H".
 * @return The 19-character Bibcode as a string, or an empty string if conversion fails or length is incorrect.
 *)
on getRawBibcodeFromADSURL(urlStr)
	
	-- To handle both "adsabs.harvard.edu/abs/" and "adsabs.harvard.edu/pdf/" URLs,
	-- we can split the URL by "/" and take the last item, which will be the bibcode.
	set oldDelimiters to AppleScript's text item delimiters
	set AppleScript's text item delimiters to "/"
	try
		set rawBibcode to the last text item of urlStr
	on error
		log "Error while splitting URL: " & urlStr
		set AppleScript's text item delimiters to oldDelimiters
		return ""
	end try
	set AppleScript's text item delimiters to oldDelimiters
	return rawBibcode

end getRawBibcodeFromADSURL

(**
 * Find an ADS watermark.
 * Watermarked documents have a vertical link beginning with "https://adsabs.harvard.edu/abs/" on every page.
 * This method looks at PDF link annotations and tries to find one starting with that string.
 *
 * @param pdfPath The full POSIX path to the PDF file (e.g., from DEVONthink's 'path of theDocument').
 * @return The found URL as a string, or an empty string if no matching link is found.
 *)
on scanPDFforADSLink(pdfPath)

	-- Create a PDFDocument object from the file path using PDFKit
	try
		set theURL to current application's NSURL's fileURLWithPath:pdfPath
		set thePDF to current application's PDFDocument's alloc()'s initWithURL:theURL
		
		if thePDF is missing value then
			log "Error: Could not open the PDF file at path: " & pdfPath
			return ""
		end if
	on error errMsg
		log "Error creating PDF object: " & errMsg
		return ""
	end try
	
	-- Check PDF for links via annotations 
	-- (only the first page -- if there is an ADS link block, it's present on every page)
	set pageCount to thePDF's pageCount()
	if pageCount > 0 then
		set thePage to (thePDF's pageAtIndex:0) -- Get the first page (index 0)
		set theAnnotations to thePage's annotations()
		-- is one of the annotations a Link to adsabs.harvard.edu/ads/? If so, return it
		repeat with anAnnotation in theAnnotations
			if (anAnnotation's type() as string) is equal to "Link" then
				set urlString to (anAnnotation's |URL|()'s absoluteString()) as string
				if "adsabs.harvard.edu/abs/" is in urlString then
					return urlString
				end if
			end if
		end repeat
	end if
	return ""

end scanPDFforADSLink

(**
 * Percent-encodes a string for safe use in URLs.
 *
 * @param theString The string to encode.
 * @return The URL-encoded string.
 *)
on encodeString(theString)

	set NSString to (current application's NSString's stringWithString:theString)
	set allowedChars to (current application's NSCharacterSet's URLQueryAllowedCharacterSet)
	set encodedString to (NSString's stringByAddingPercentEncodingWithAllowedCharacters:allowedChars)
	return encodedString as string

end encodeString

(**
 * Decodes a percent-encoded string from a URL.
 *
 * @param theString The string to decode.
 * @return The decoded string, or an empty string on failure.
 *)
on decodeString(theString)

	set NSString to (current application's NSString's stringWithString:theString)
	set decodedString to (NSString's stringByRemovingPercentEncoding())
	if decodedString is missing value then
		return ""
	end if
	return decodedString as string

end decodeString

(**
 * Queries the ADS API with a given bibcode, parses the response, and updates a DEVONthink record with the retrieved metadata.
 *
 * This handler performs the core logic of fetching data from the ADS API and populating the fields of a DEVONthink record.
 * It constructs the API request, adds the necessary authorization header using the `api_token` property, and sends the request.
 * Upon receiving a valid response, it parses the JSON and iterates through the `fieldMapping` property to match API fields
 * to DEVONthink custom metadata fields. It handles multi-valued fields (like authors and keywords) by joining them into a
 * single string. Finally, it sets the record's main URL to the ADS abstract page and applies any found keywords as tags.
 *
 * @param theBibcode The URL-encoded 19-character bibcode string for the document.
 * @param theRecord A reference to the DEVONthink record that will be updated.
 *)
on updateMetadataFromAPI(theBibcode, theRecord)

	-- Construct the query URL from the bibcode
	set requestURLString to "https://api.adsabs.harvard.edu/v1/search/query?q=bibcode:" & theBibcode & "&fl=title,author,pubdate,pub,volume,year,page,doi,isbn,keyword"
	set requestURL to current application's NSURL's URLWithString:requestURLString

	-- Create a mutable request to add the authorization header
	set theRequest to current application's NSMutableURLRequest's requestWithURL:requestURL
	theRequest's setValue:("Bearer " & api_token) forHTTPHeaderField:"Authorization"

	-- Perform the request
	set {theData, theResponse, theError} to current application's NSURLConnection's sendSynchronousRequest:theRequest returningResponse:(reference) |error|:(reference)

	-- Parse the JSON response from the API and put metadata into DEVONthink
	if theData is not missing value then
		set jsonString to (current application's NSString's alloc()'s initWithData:theData encoding:(current application's NSUTF8StringEncoding)) as string
		try
			set theJSON to (current application's NSJSONSerialization's JSONObjectWithData:theData options:0 |error|:(missing value))
			set docsKey to (theJSON's valueForKeyPath:"response.docs")			
			if (docsKey's |count|()) > 0 then
				set theDocument to (docsKey's objectAtIndex:0)
				-- Extract keywords for later use as tags. This is an array.
				set theKeywords to (theDocument's valueForKey:"keyword")
				-- All other response fields go into custom metadata fields:
				repeat with aMapping in fieldMapping
					set jsonKey to item 1 of aMapping
					set theValue to (theDocument's valueForKey:jsonKey)
					set finalValue to ""
					set dtFieldName to item 2 of aMapping
					if theValue is not missing value then
						-- Arrays (like authors or keywords) are joined to serialize them into a String.
						if (theValue's isKindOfClass:(current application's NSArray's |class|())) then
							set finalValue to (theValue's componentsJoinedByString:"; ") as string
						else
							set finalValue to theValue as string
						end if
						
						-- Add the processed value to DEVONthink's custom metadata.
						tell application id "DNtp"
							add custom meta data finalValue for dtFieldName to theRecord
						end tell
					end if
				end repeat
				
				-- Set DEVONthink's URL field to the paper's ADS landing page as that is universal
				tell application id "DNtp"
					set (url of theRecord) to ("https://adsabs.harvard.edu/abs/" & theBibcode)
				end tell

				-- If we found keywords, set them as tags in DEVONthink.
				if theKeywords is not missing value then
					tell application id "DNtp"
						set (tags of theRecord) to (theKeywords as list)
					end tell
				end if
			end if

		on error errMsg
			log "JSON Parsing Error: " & errMsg
		end try
	end if

end updateMetadataFromAPI

Some comments:

The ADS is currently transitioning to a new UI under a new URL: https://scixplorer.org/ . The API will, for the foreseeable future, not change and still be available under the old URL. Also, links to adsabs.harvard.edu should, according to the documents, remain valid.

This was my first experience with an LLM coding assistant. While I can write code to some extent (I would not consider myself a programmer, but do code reviews occasionally on my job), I did not know AppleScript nor ObjectiveC beforehand. So the assistance was extremely helpful when it came to syntax and those tricky conversions between the two languages, like

set urlString to (anAnnotation's |URL|()'s absoluteString()) as string

but at least for these lesser-used languages, we still seem to be very very far from real “vibe coding”. Debugging and restructuring the script was still an almost completely manual task.

1 Like

That’s a helpful example evern for others not using that database.

Regarding AI-assisted Applescript - Claude 4.5 does a pretty good job if you provide it with the Devonthink 4 Script Definition File which you can export from Script Editor.

Or you could use the script assistant of e.g. Data > New > Script… which does this and more.

What am I doing wrong here? I chose Claude Sonnet 4.5. Any script I request is blank.

Never tried such a prompt so far but the script assistant is especially (and more or less only) intended to write scripts of the selected type for DEVONthink, e.g. for smart rules or to process a selection. But I got this, the script tries to average numeric values found in the name of 3 selected records or as a fallback the ratings:

Just an aside: If you’d use JavaScript instead of AppleScript, some of the tasks would be a lot simpler. Notably, JSON parsing and unescaping a URL-encoded string.

function isBibcode(bibcodeString) {
  return bibcodeString.match(/^[a-zA-Z0-9.&+!-]{19}$/) ? true : false;
}
function getRawBibcodeFromADSURL(urlStr) {
  const match = urlStr.match(/\/(.+?)\/$/);
  if (match) {
    return match[1]; // rawBibcode
  } else {
    throw `Error while splitting URL: ${urlStr}`
 }
}
function decodeString(rawBibcode) {
  return decodeURIComponent(rawBibcode);
}
function encodeString(string) {
  return encodeURIComponent(string);
}
...
const JSONObj = JSON.parse(returnSyncRequest);
const docsObj = JSONObj.response.docs;
if (docsObj) {
  const document = docsObj[0];
  const keywords = document.keyword;
  for (const key in fieldMapping) {
    const value = document[key];
    let finalValue;
    if (!value) {
      continue
   }
      finalValue = typeof value === 'object' ? value.join('; ') : value;
      Application("DEVONthink").addCustomMetaData(
        finalValue, {for: fieldMapping[key], to: record});
  }
}

And some of the checks seem a bit over the top, as usual with AI-generated code. pagecount > 0 for a PDF document, for example. If it had no pages, it wouldn’t be a PDF. In JS, I’d do this for the annotations

const links = page.annotations.js.filter(a => a.type.js === 'link');
for (const l in links) {
   if (l.URL.absoluteString.js.includes("adsabs.harvard.edu/ads/")
     return l.URL.absoluteString.js;
  }
}
return "";

Also, having a try block enclose not only the single line of JSON parsing but also all the interpretation of it and then throw a “JSON parsing error” seems weird. If an error happened in that part of the code, it might be anywhere, not only in the line parsing the JSON. I’d also rename “theJSON” to something like “JSObject” – JSON stands for “JavaScript Object Notation” and it is the format of the string you get back from the curl call. After parsing it, you have an object, not a string anymore.

3 Likes

Good points, @chrillek , thank you!

Thanks - something is still off in my setup.

I tried this:

and this is the result:

Are you pressing Return after typing the prompt?

OK some things are embarrassingly simple - I assumed OK and enter were equivalent :slight_smile: - 3 demerits for me

That said - are these instructions correct? I cannot figure out a way to edit an existing script except by copying/pasting the existing script into Script Assistant. The Help says that I can open a scripot in Script editor and then use Script Assistant to prompt for edits - is that correct?

Similarly is there any way to test/execute the script from the Script Assistant? I understand I can run the final script in any of the standard ways - but it would be even more useful if this were integrated in a way that made it simpler to edit/iterate an existing script and excute test versions of the script within that same environment.

I cannot figure out a way to edit an existing script except by copying/pasting the existing script into Script Assistant.

That’s not possible nor can you select an existing .scpt or .applescript file and load it via Data > New > Script (though I wish it would). Development would have to assess that.

The Help says that I can open a scripot in Script editor and then use Script Assistant to prompt for edits - is that correct?

The documentation does not say that.

Similarly is there any way to test/execute the script from the Script Assistant?

No. And again, that’s something development would have to assess.

Devonthink Manual does not say it. But the AI Chat under Help - Devonthink Help does say it (see screenshot above).

Other points noted - thanks for your help. And this indeed a notable upgrade from earlier versions of Script Assistant that I had tried. I had not realized it had progressed to that extent.

While the LLM in the Help Assistant is being fed DEVONthink-specific information, those responses are still governed by the external AI and could be incorrect in some ways. Just as the Script Assistant is also given specific scripting information but the draft scripts will likely require some modifications.

2 Likes