I am attempting to rename a PDF+Text document (bank statement) and include in the name the latest date being the date for the end of the period covered by the statement.
However, the renamed date is picking up an earlier date and returning it as the latest date as it is switching the month and day.
Here are the text components (“ “) in the document I am referring to:
“31 January 2022” - the correct latest date
“11/01/22” - this means 11 January 2022 and is not the latest date, but the rule is returning 01 November 2022 as the latest date based on this text.
There is a thread from 2018 that highlighted a similar issue and it was noted that DT does not reference the system setting for region or date and it was being looked into.
Is there any way around this problem?
(I would like to avoid using Hazel first as it adds manual steps. I have verging on 500 statements I have to rename over 30 accounts each year and have not yet been able to utilise DT to achieve this. I have considered AI but concerned about the confidential info in the statements.)
To be honest, as much as I use DEVONthink, I’ve found it much easier to use Hazel to automatically pre-process incoming financial statements. The PDF created by or from these institutions can be tricky and fiddly to match up the label with the info, e.g. date, account. As displayed in the PDF is not always how internally the text lines up. Hazel has a pretty good interface to help figure that out. I don’t know what you mean by manual steps as I’ve been able to fully automate it.
As I deal with 30+ bank accounts, with Hazel I believe I would have to set up 30+ separate routines as each file name should include the bank account name (in an abbreviated format I use).
In DT4, I add each PDF to the relevant group based on which bank account it is. DT automatically adds a tag to PDFs in each group with the tag being that abbreviated format bank account name. The smart rule in DT adds that tag to the document name.
The problem I have is that:
Hazel can do dates well, but seemingly cannot easily manage bank account naming. (It could if I could have a statement in Hazel, say, if this customer number appears, include this abbreviated bank account name in the file name, if not, then if that customer number appears, include that abbreviated bank account name in the file name, if not, etc etc (ie.nested if statements); but I don’t believe it has that capability). I would be pleased if there is an easy way to handle this in Hazel; and
DT4’s smart rule will happily include a tag in the name (thus bank account name solved), but I cannot determine how to achieve the dating accurately, due to dates being interpreted as being in a different format to what they are. Hence the call for advice on getting around the date recognition problem I am experiencing.
It will be interesting to see if system language is a factor following @cgrunenberg ‘s question.
When I run this smart rule, I get the wrong value being 12 Jan 2022.
However when I change the date format to the following:
and run the rule again I get the value 1 dec 2022.
In other words the scanner follows the system settings. If you have mixed values you will have to use regular expressions to capture the individual components and use those to manually craft the proper date.
Hi there. I had similar challenges with dates. I too was very heavily reliant on Hazel for date extraction in the file renaming process. Over the recent holiday, I had time to convert Hazel rules to DT Smart Rules. I strongly recommend to convert the OCR’d document to Plain Text within DT and inspect that file to see how the data was OCR’d. If the DT supplied date variables do not produce the expected results, try using RegEx with the ScanText function. You’ll get there.
Agree. May well engage there depending on the outcome of DT investigations.
With my past use of Hazel, I have also been able to extract things like financial institution (insto) name and account number.
My concern with using Hazel is the need to set up a different rule for each account (over 30) and the way Hazel finds the data (such as account number) seems to depend on the format of the statement (with DT’s smarts, not so I believe). I find I am regularly faced with an insto changing the format of its statements: it’s back to Hazel to fix. I’m not trying to find a solution here to this Hazel issue and will take it to the Hazel forum if I conclude worthwhile trying, but I thought DT with use of non-cloud AI might be able to find this info without setting up separate rules for each insto or account. That is what I will now explore with help from the fab community here and my review of DT4’s ability to work with non-cloud AI such as Ollama, as documented in DT’s excellent help and articles.
Could one of the following be possible to recognise DD/MM/YY:
a user option to favour interpretation of one or the other; or
DT be made smart enough to interpret what 11/02/22 means. This could be possible in some cases (and not all, so not an optimal solution), such as the document I am using in the example where there is a long list of transactions including dates such as: 13/01/22, 14/01/22, (which surely provides a strong clue as to the intended meaning of 11/01/22)?
I don’t have scripting skills and my experience to date is limited to modifying others’ (Apple Script) scripts kindly posted to this forum, but I’m willing to learn.
This produced the awry results outlined in the opening post.
As per your hint, I changed the date format to:
Restarting DT and applying the smart rule to my document, this resulted in 11/01/22 now being ignored as the latest date in favour of correctly identifying 31 January 2022 as the latest date, and returning 2022-01-31 in the renamed file (which is result I wanted in the format I wanted!) instead of the erroneous 2022-11-01 previously.
So does this mean DT will correctly interpret dates if the Date Format setting in System Settings is ‘appropriately set’ (in the format the dates appear in the subject document) and not only support MM/DD/YY; the latter as indicated by @cgrunenberg in an earlier post?
I would prefer my system setting to be YYYY-MM-DD as I find in an ordered list of dates it is logical to my eye and will sort correctly when read as text (and it’s the ISO standard). However, to make running this rule work I could flip this setting to DD/MM/YYYY and change back later.
Thank you for showing the significance of the Date Format in System Settings to this outcome. It solves my problem for this particular financial institution’s statement.
PS Just renamed 45 statements for this account in seconds!
Good suggestion. I took a look and it seemed to be correctly OCR’d.
Thanks for your encouragement. As noted in a prior post my scripting capability is near zilch, but I will investigate RegEx if I run into difficulties, noting I solved the opening post statement. I am confident there will be more issues to overcome and will keep your advice in mind.
RegEx can seem daunting but it is simple given some time. Use ChatGPT to get the regular expression syntax for the information you are looking to grab. Use Regex101.com to test. Be patient, good luck and continue to leverage this great forum.