PDF Smart Rule Date Problem

authsec · June 11, 2019, 8:52pm

Hi,

I am using the new Smart Rule feature available in DT3.

I was wondering if you can implement some more precise control over which date you are extracting from the PDF Document. In the below account statement for example there are dates all over the place and selecting the newest one from the document is not always correct.

For example there may be a descriptive text where it says “For the planned trip on 01.01.2020” where the algorithm would detect that as the newest date.

Maybe it is possible to give it a location on the paper to look for a date? E.g. on the first half of the first page find me a date that looks like “DD. Month YYYY”. Also in my case the document language is German while my Mac is running English as the preferred language.

Hazel seems to have something similar implemented where you can specify the format of the placeholder the algorithm is looking for, but maybe there are better options. Maybe the location/format is also something that can be hinted to the A.I./Machine Learning feature so it learns which date is the correct one?

cgrunenberg · June 12, 2019, 6:59am

The algorithm (no AI, no machine learning, no regex - just old-fashioned but fast & flexible code ) doesn’t depend on the layout, document format or system language. It scans only the plain text of the document. Therefore…

…this might be an option in the future. But right now I’m wondering which date you would actually prefer in this case. 31. Mai 2018? And which placeholder do you currently use?

Stephen_C · June 12, 2019, 6:59am

I agree it would be good if DT3 could be a little more “Hazel like” in detecting dates in PDF files. I have the same problem with some of them as you do.

Stephen

authsec · June 13, 2019, 3:39pm

Understood. However, if you ever need a reason to fiddle with ML this might be it

That sounds promising, I hope the not too distant future . In this case I want it to use 31. Mai 2018 as the documentDate. Currently I am using the newestDate option as this will give me the best results. However, every 3rd month there is some text in the statement where it references a date further ahead in the future, which unfortunately breaks it.

cgrunenberg · June 13, 2019, 3:41pm

Simply skipping future dates while scanning the document should probably fix this.

authsec · June 13, 2019, 3:49pm

WOW, you are fast!

I’m not sure how I can implement that, can you give me a hint how to do this? Do I need to write a script or something?

BLUEFROG · June 13, 2019, 4:14pm

He’s referring to the underlying process that scrapes the data for dates.