Identifying and extracting multiple dates from PDFs

Hi, request for help here.

What I’d like to do is have DEVONthink scan a database of (many, highly varied, unstructured) PDFs and extract any dates or possible dates it finds in them, and then save them as metadata which would then allow a user to search for, say, all documents (and if possible the relevant passage of the matching document) referring to a specific date or month, or year.

So one document, say, might include a date when it was created, a date when it was declassified, and may in the body include numerous strings that might be dates – a day of the week, a month, a year, a decade etc – which DEVONthink could identify and store as separate pieces of custom metadata.

If one were to search that metadata later, the user could ask for all documents that include reference to a specific date or range of dates, and see the matching documents with the relevant sections highlighted or otherwise identified.

Hopefully the code would be able to make some assumptions – it mentions a date – the 23rd not contiguous but near a month, June, say – which is itself not far from a year, 1987, say, and so the assumption could be made that the 23rd is June 23 1987.

I’ve seen a few other threads not far from this idea, but not seen any that refer to extracting multiple dates, and kinds of dates, per document. And I have no Regex knowledge or experience to speak of, so perhaps my request is far too ambitious. But I’d welcome any advice.

There is no such functionality built into DEVONthink and your request for it to parse out unformatted relative dates is beyond the abilities of any consumer/prosumer grade applications I am aware of.

1 Like

Well, classifying the dates might be a stretch, but you can use an “AI” tool like PDF Pals to help discover the dates a document might contain.

@GordonMeyer, thanks, much appreciated. I’ll give that a try. I have played around with most of the GAI offerings to see if they might be able to help, but your idea is a nice simple step which might throw up more ideas about trying to make this work.

The script property all document dates might be useful for this task.