Where is the DOI saved in a PDF? Only in the pdf-text?

Hi, most scientific contain a DOI number in the pdf text itself. Is this DOI normally also somewhere else (specific annotation field or something like that?) and can be extracted directly from there?

AFAICT there’s not meta data field for DOI in the PDF. It’s a bit difficult to say because most information I found on PDF metadata is quite general, but at least according to

there’s not special field for DOI. You might be able to extract the DOI from a PDF and add it to a user meta data field in DT, though.

most scientific

I would say that’s a broad generalization. I’d say many files don’t contain this data.

Do you have an example to inspect?

Localizing a DOI in an arbitrary document seems to be a fun exercise:

These are two different things:

  • I’d agree with the OP that most scientific articles nowadays do indeed have a DOI assigned (the quoted PDF is from 2015…)
  • which does not mean that the DOI is easily identifiable in the text/PDF itself nor that even is part of the document.

Agree…I still get PDFs of current journal articles that don’t have the DOI embedded or printed in the document, but it is getting rarer.

Where are you getting the PDF? The DOI, if there is one, is usually part of the citation/metadata on the journal’s download page, but may not be part of the metadata for the PDF itself.

Searching for “DOI” in the PDF’s text layer will find it, too.

One of the PDFs in my library puts doi in the subject tags.

Title: Low Prevalence of Lactase Persistence in Bronze Age Europe Indicates Ongoing Strong Selection over the Last 3,000 Years
Subject: Current Biology, Corrected proof. doi:10.1016/j.cub.2020.08.033
Author: Joachim Burger
Creator: Elsevier
Producer: Acrobat Distiller 8.1.0 (Windows)

but this is an uncommon practice. It’s not as if there’s a doi tag in the metadata.

Sure. But how do you find the DOI itself? Will it be everything up to the next space?


Smart rules & batch processing (see Digital Object Identifier placeholder) and AppleScript (see digital object identifier property) might be able to retrieve and use the DOI. E.g. this works for the doi_handout.pdf.

This might be a bit off-topic, but still relevant if you would want to extract DOI automatically from any pdf: As I am working with Bookends and its somewhat awesome import-function I noticed two problems that might still be relevant for automatic processing in DT.

To explain the import-function briefly: If the paper/book etc. has a DOI, Bookends is able to grab that and fill in author/editor names, title, year, page numbers, journal etc. by itself. It then renames the file according to the rule I set up (i. e. from “9781529216097.pdf” to “williams2020.mapping good work.pdf”)

The problems:

  1. Sometimes there is a DOI, but it is not recognized due to the pdf itself. In the last days I had this problem mostly with articles from SAGE pub. If I would select the DOI in the text and copy it manually, it just spits out garbage. This is not a big problem, because the whole mentioned process would simply not happen and I end up with a reference that is titled: . Reference metadata not found online. Then I just manually type in the DOI.

  2. Everything seems to be working fine, Bookends recognizes a DOI, finds the according metadata and processes the paper. But the end result is still garbage. Why? Bookends did not grab the DOI, but another one and finds the according metadata. This can also happen if the paper itself has no DOI, but a DOI from the reference list is grabbed. To be honest, this happens very rarely, but it happens. I am guessing this is also due to how the pdf itself is created (i. e. You highlight a passage and end up highlighting some other text at the other end of the page, too.)

Back to automating DOI in DT: Problem 1 could be easily solved, but you have to be aware of Problem 2. You could accidently end up with an incorrect reference and thus citation.


Welcome @maikbode
Thanks for the information and cautions. I’m sure this can can help people avoid some missteps.

Most academic journals now include the doi number. Some retroactively included doi numbers for older articles (e.g., 1960s) but not all journals. Some started around 2000. If you need the metadata for a pdf article you can go to the journal home page and search for the article. Regardless if there is or is not a pay-wall (required to pay to download article), you should be able to download the citation for any article. This will give you the metadata that you need. I download the citation information for my EndNote app for all articles that I download (just copy and paste into EndNote). There are a number of different formats for downloading the citation. The journals do not charge for downloading the citation metadata even if they charge for the article. You can also copy the metadata in Devonthink if you wish to keep a copy with each pdf.

DOI is becoming the preferred standard, but unfortunately not yet industry-wide. I recently came across an interesting use of ‘augmented text’ to allow authors to simply highlight text from a pdf and copy it into their writing document - with this automatically providing a citation and reference. It does so by converting a pdf with DOI to ‘augmented text’. Here’s the link: https://www.augmentedtext.info Of course, for researchers this is only one small step. Working with a multiplicity of resource types - books, chapters, reports, parliamentary debates, newspaper articles, websites, and tweets means that citing and referencing is still labour intensive. Even with high-end software like Mendeley or ReadCube Papers.