Ditching few remaining hazel rules and moving them to devonthink..reading inside pdf advice

hi all

i want to ditch hazel as it has never been as reliable as DEVONthink for me. i gave a few rules there that use text and dates inside the pdfs to allow for better renaming but it seems like in DEVONthink its much more diffuclt to understand how to extract data such as 3rd found date, how to deal with password protected pdfs (such as utility bills) etc.

here is an example pdf rule i want to move over to DEVONthink. how does one match a custom date found inside a pdf?

just to be clear i did search help first for both pdf content and pdf dates without much success :slight_smile:

best

Z

1 Like

How is customdate defined in Hazel?

There is the smart rule action “Scan text” that takes different parameters, also regular expressions. Difficult to tell you what to do if you don’t show how customdate is defined.

thx both , ahh this is excellent , i think the scan text could work

lets say i have this

which regex/scan method can i use to get the date of whats under advice date and then how does one use that in the smart rule as a ISO date, ie

thx alot! will set a few hours today to move from hazel completely :slight_smile:

This depends on the actual text layer in the document. What you see and how the underlying text is structured is not necessarily the same. Also, the text layer can vary based on the operating system.

Select the PDF and choose Data > Convert > To Plain Text. Examine the plain text to see where 06/06/25 occurs.

1 Like

Well, what did you do in Hazel to make it find that? Hazel is working with the very same underlying text, so you should use similar patterns in DT.
If the date follows the ‘Advice Date’ string in the PDF’s text (see the post of @BLUEFROG), you could use Advice Date* in your Scan Name Date action.

got it thx again both

seems like after converting to text the actual dates are in another line below

Employee Name P/P Begin P/P End Advice Date Advice No Emp Num Location ER 403B
FIRST FAM 05/18/25 05/31/25 06/06/25 468446577

should i use a regex for the 3rd occurrence of date in

CleanShot 2025-06-19 at 10.48.44@2x

hmmm so played with regext in the scanned document

how does one refer to what was found as a variable when you go over to rename?

thx!

Why don’t you tell us how you defined customdate in your Hazel rule? It’s the third time I’m asking you for that information, which would probably simplify answering you.

Posting an image of a regular expression with an arrow good right through it is not particularly helpful, either.

sorry about that @chrillek , dropped the ball on that request

here is how its defined in hazel

where in this case its matching on the 1st date (via hazels gui) but for other rules its matching on 2/3/or even 4th occurrences

also here is the full regex in text in case thats needed /^(?:.*?\b\d{2}\/\d{2}\/\d{2}\b){2}.*?\K\d{2}\/\d{2}\/\d{2}/

still unclear to me how the scanned texted can be used in the rename part
Apologies again

Z

For this particular document, try ([\d\/]{8})(?=\s\d{9}$) to capture the date as shown. If sufficient, that should likely work with other such salary documents from the same source.

You would then use \1 in subsequent actions.

That’s not the definition, but the dialog used to define it. Something like this would have been the definition:

Anyway, let’s start with the RE:

I’d try something like this instead
/(?:\d\d\/\d\d\/\d\d\s+){2}.*?(\d\d\/\d\d\/\d\d)/gm/
I’m not sure if the enclosing slashes are needed, though.
In any case, that defines a capturing group (the last sequence of \d\d\/s). And you refer to that capturing group in the “Change Name” action as \1. As describes the documentation:

Regular Expression: Items in parentheses are captured; items outside parentheses are ignored. You can specify multiple captures in an expression. Using the captured text in subsequent actions is specified by using backslash, \ , and the number of the capture, starting at 1 . Note we use Apple’s NSRegularExpression which supports the ICU regular expression syntax.

I have no idea what the \K in your RE is about, so I left that out. I tested the RE with regex101.com and it seemed to work as expected.

If you have the same string of digits following the relevant date as in your example (i.e. 468446577), you could use that as a postfix in a simple Scan Text Date action, instead.

Thx both again

does not seem like the smart rule actually renames the file. i tried both you guys regex syntaxs and in both cases it just moved the file without any renaming

this is the current smart rule state, am i missing anything in my syntax

Really appreciate both your help!

Z

Mine worked as I said it would…

image


You use neither of the REs that @bluefrog (which is better than mine, provided that your date is always followed by a space and nine digits) or I suggested. So, yes, you are missing something.

I suggest you slow down a bit. Perhaps having a read of the manual, of all the posts and then take a look at your smart rule again.

thx

i did of course try with multiple RE’s before without sadly any success. i even tried 1:1 what @BLUEFROG suggested

i also tried running this over a text file

each time the file is just moved as is without any renaming to the target folder..

happy to further debug if im missing anything. for sure will keep reading the documentation and posts. worst case ill just leave it be with hazel

thx

Z

Just verified it again in 3.9.11 on macOS 15.5.
Is the PDF marked as locked or read-only?

thx @BLUEFROG sorry should have said im on latest DT4 beta and macos ( 15.3.2, can upgrade )

pdf is not locked or read only

thx

Z

You should stay current with the OS point releases, i.e., 15.5. Not necessarily upgrades.

I’ve loved Hazel for many years but since a lot of the time it’s processing files before upload to DT, this thread has caught my attention. Is there perhaps a good way to replicate Hazel’s “match the nth instance of this text” (i.e. in the example above, the third date) ?

I know you can often find consistent characteristics in the source text to avoid this construction (eg in Bluefrog’s example, the date must be followed by an eight digit number and an end string), and obviously you could do something like (?:[\d\/]{8}.*?){2}([\d\/]{8})(?=\s\d{9}$) but perhaps there’s a better way.