Where is the DOI saved in a PDF? Only in the pdf-text?

vinschger · November 21, 2021, 8:36am

Hi, most scientific contain a DOI number in the pdf text itself. Is this DOI normally also somewhere else (specific annotation field or something like that?) and can be extracted directly from there?
Thanks.

chrillek · November 21, 2021, 9:24am

AFAICT there’s not meta data field for DOI in the PDF. It’s a bit difficult to say because most information I found on PDF metadata is quite general, but at least according to

there’s not special field for DOI. You might be able to extract the DOI from a PDF and add it to a user meta data field in DT, though.

BLUEFROG · November 21, 2021, 4:18pm

most scientific

I would say that’s a broad generalization. I’d say many files don’t contain this data.

Do you have an example to inspect?

chrillek · November 21, 2021, 4:49pm

Localizing a DOI in an arbitrary document seems to be a fun exercise:

BLUEFROG · November 21, 2021, 4:53pm

Wheeeee…

And to underscore my earlier point…

chrillek · November 21, 2021, 5:06pm

These are two different things:

I’d agree with the OP that most scientific articles nowadays do indeed have a DOI assigned (the quoted PDF is from 2015…)
which does not mean that the DOI is easily identifiable in the text/PDF itself nor that even is part of the document.

rpallred · November 21, 2021, 6:17pm

Agree…I still get PDFs of current journal articles that don’t have the DOI embedded or printed in the document, but it is getting rarer.

kewms · November 21, 2021, 7:06pm

Where are you getting the PDF? The DOI, if there is one, is usually part of the citation/metadata on the journal’s download page, but may not be part of the metadata for the PDF itself.

Searching for “DOI” in the PDF’s text layer will find it, too.

jerwin · November 21, 2021, 7:14pm

One of the PDFs in my library puts doi in the subject tags.

Title: Low Prevalence of Lactase Persistence in Bronze Age Europe Indicates Ongoing Strong Selection over the Last 3,000 Years Subject: Current Biology, Corrected proof. doi:10.1016/j.cub.2020.08.033 Author: Joachim Burger Creator: Elsevier Producer: Acrobat Distiller 8.1.0 (Windows)

but this is an uncommon practice. It’s not as if there’s a doi tag in the metadata.

chrillek · November 21, 2021, 8:14pm

Sure. But how do you find the DOI itself? Will it be everything up to the next space?

kewms · November 21, 2021, 8:29pm

Probably.

cgrunenberg · November 22, 2021, 10:27am

Smart rules & batch processing (see Digital Object Identifier placeholder) and AppleScript (see digital object identifier property) might be able to retrieve and use the DOI. E.g. this works for the doi_handout.pdf.

maikbode · November 24, 2021, 8:45am

This might be a bit off-topic, but still relevant if you would want to extract DOI automatically from any pdf: As I am working with Bookends and its somewhat awesome import-function I noticed two problems that might still be relevant for automatic processing in DT.

To explain the import-function briefly: If the paper/book etc. has a DOI, Bookends is able to grab that and fill in author/editor names, title, year, page numbers, journal etc. by itself. It then renames the file according to the rule I set up (i. e. from “9781529216097.pdf” to “williams2020.mapping good work.pdf”)

The problems:

Sometimes there is a DOI, but it is not recognized due to the pdf itself. In the last days I had this problem mostly with articles from SAGE pub. If I would select the DOI in the text and copy it manually, it just spits out garbage. This is not a big problem, because the whole mentioned process would simply not happen and I end up with a reference that is titled: . Reference metadata not found online. Then I just manually type in the DOI.
Everything seems to be working fine, Bookends recognizes a DOI, finds the according metadata and processes the paper. But the end result is still garbage. Why? Bookends did not grab the DOI, but another one and finds the according metadata. This can also happen if the paper itself has no DOI, but a DOI from the reference list is grabbed. To be honest, this happens very rarely, but it happens. I am guessing this is also due to how the pdf itself is created (i. e. You highlight a passage and end up highlighting some other text at the other end of the page, too.)

Back to automating DOI in DT: Problem 1 could be easily solved, but you have to be aware of Problem 2. You could accidently end up with an incorrect reference and thus citation.

Cheers.

BLUEFROG · November 24, 2021, 1:14pm

Welcome @maikbode
Thanks for the information and cautions. I’m sure this can can help people avoid some missteps.

jrthpt · November 27, 2021, 4:49pm

Most academic journals now include the doi number. Some retroactively included doi numbers for older articles (e.g., 1960s) but not all journals. Some started around 2000. If you need the metadata for a pdf article you can go to the journal home page and search for the article. Regardless if there is or is not a pay-wall (required to pay to download article), you should be able to download the citation for any article. This will give you the metadata that you need. I download the citation information for my EndNote app for all articles that I download (just copy and paste into EndNote). There are a number of different formats for downloading the citation. The journals do not charge for downloading the citation metadata even if they charge for the article. You can also copy the metadata in Devonthink if you wish to keep a copy with each pdf.

DarylAdair · November 28, 2021, 1:11am

DOI is becoming the preferred standard, but unfortunately not yet industry-wide. I recently came across an interesting use of ‘augmented text’ to allow authors to simply highlight text from a pdf and copy it into their writing document - with this automatically providing a citation and reference. It does so by converting a pdf with DOI to ‘augmented text’. Here’s the link: https://www.augmentedtext.info Of course, for researchers this is only one small step. Working with a multiplicity of resource types - books, chapters, reports, parliamentary debates, newspaper articles, websites, and tweets means that citing and referencing is still labour intensive. Even with high-end software like Mendeley or ReadCube Papers.