Date Highlighting in PDFs

I’m using that batch tool to use the date in the content of the PDF as the Creation Date. It’s a dream tool. I love it.

When I’m going through large amounts of PDFs and trying to confirm the date is accurate, knowing that its impossible for it to always get it right.

It would be extremely helpful if there was a way to automatically highlight the dates in the PDF so I can see them quickly. Looking for the date manually really slows down the process and would go so much faster if it was highlighting what it thought were dates.

Could also be useful it there were keywords it could also be looking for (and highlighting) in the PDF. Like grocery receipts are things that come in very frequently and so the name of the grocery store could be a keyword its looking for.

You could use the toolbar search, DEVONthink automatically jumps to the first match in each document.

Example queries:

[0-9][0-9].[0-9][0-9].[0-9][0-9][0-9][0-9]

–> “10.10.2020”

[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]

–> “2020-10-10”

You can chain queries with a |

[0-9][0-9].[0-9][0-9].[0-9][0-9][0-9][0-9]|[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]

–> “10.10.2020” and “2020-10-10”

Since DEVONthink 3.6 it’s also possible to do this kind of search in Inspector > Search with option Enable Wildcards and Operators.

If you add your keywords to the query they’ll be highlighted.

In these cases I use an AppleScript to change the creation date to the selected text. It creates a temp record in order to let DEVONthink parse the date for me. However sometimes this won’t work due to bad document quality. When it failed copy the selected text and paste it somewhere - you’ll see that there’s actually no date in the OCR layer where you see one in the document.

-- Set creation date to selected text (via temp record)

tell application id "DNtp"
	try
		set theRecords to selected records
		if theRecords = {} then error "Nothing selected"
		if (count theRecords) > 1 then error "Please select a date in a record"
		set theRecord to item 1 of theRecords
		
		try
			set selectedText to selected text of window 1 & "" as string
		on error
			error "No text selected"
		end try
		
		set theTempRecord to create record with {name:"Temp - " & selectedText, type:text, plain text:selectedText} in incoming group
		set theDates to all document dates of theTempRecord
		
		if theDates ≠ {} then
			set theDate to item 1 of theDates
			set creation date of theRecord to theDate
			display notification "Creation date via temp record"
			delete record theTempRecord
		else
			error "No date found"
		end if
		
	on error error_message number error_number
		if the error_number is not -128 then display alert "DEVONthink" message error_message as warning
		return
	end try
end tell

It’s a good idea to also manually scan the text as DEVONthink can only find what’s in the OCR layer.

That’s tedious, I know, however you just have to verify the date - when I first posted in this forum there was no automatic document date extraction built in.

Here’s my first post

When I finally had cobbled together my own date extraction script (after years …) DEVONthink introduced the document date feature. Countless hours wasted …

However

that’s true :smile:

1 Like

Thanks for this info! I’ll parse through it and give it a try.

I was actually using a python CLI that I’d been slowly building. I point the .py to a directory and it goes through each file and presents me with a list of dates found that I can select from. Then it checks a database of grouped keywords I’ve come across in the past. So, for example, if the current PDF’s OCRd text contains “Ralph” and “Groceries” it can assume its a receipt for Ralph’s and labels it accordingly.

It’s one of those things that starts off slow but I’ve beefed up the database over time and it was getting pretty fast. I even had it analyzing dimensions so it can guess when something is a receipt or an 8.5x11 piece of paper.

It worked well but I’m no good at graphical interface and I kind of just assumed there had to be something already made out there that at least covered a good chunk of what I was trying to accomplish.

1 Like

Since version 3.6 you could also use the document search after enabling operators & wildcards in the inspector.