[DT3 Beta1]"Document date" place holder in smart rule not always working

chrillek · May 10, 2019, 7:35am

I’m trying to automatically rename German Telekom’s bills with a smart rule. It’s fairly easy to recognize the PDF as a bill. However, using the “document date” placeholder in a smart rule is unreliable. It works sometimes to retrieve the full document date. But not always – no idea, why and when. However, getting at the month and year parts of the document date never seems to work. Even if the document date is correctly recognized as a whole (i.e. 2019-05-06), the month/year part are always zero (or empty, in any case they’re displayed as zeroes).
In addition, double clicking on an placeholder in the smart rule does not open the corresponding setting (like the date format for a “document date” placeholder) but converts this field into something like %sortabledocumentdate% in the rule. Which then doesn’t work at all. It might be a good idea to either ignore double clicks on the placeholder or do something reasonable or make sure that the %…% part is converted into something working again.

cgrunenberg · May 10, 2019, 9:09am

Is the date in the text after converting the PDF to text via Data > Convert?

chrillek · May 10, 2019, 9:34am

Yes:
IHRE DETAILLIERTE FESTNETZ-RECHNUNG FÜR MAI 2019 02.05.2019

I tried the original process again with a new rule. This time, it worked like a charm. However, I’m quite certain that yesterday I had the problem with document date year/month I mentioned in the post. As I said in the subject line: “not always”

cgrunenberg · May 10, 2019, 9:51am

In case of reproducable issues a screenshot of the rule would be great.

chrillek · May 10, 2019, 10:12am

OK. I’ve set up a new rule for mobile phone bills (they differ from the land line onesin that is only PDF, not PDF+Text). I’ll include a screenshot of the rule. It is working partially:

OCR is done on the original file (however, the original is not moved to the trash, as per the OCR global settings)
The name of the file is changed, but the document date is not inserted.
I suppose (!) that in this case the document date is not available to the rules engine?

Bildschirmfoto 2019-05-10 um 12.09.08.png1246×600 149 KB

If need be, I’ll send you the original PDF via direct e-mail.

cgrunenberg · May 10, 2019, 10:15am

Smart rules don’t depend on preferences. But using the action “OCR > Anwenden” instead should make this work.

chrillek · May 12, 2019, 4:50pm

It makes this work “sort of”. In fact, there’s now a document date available when I use “OCR > Anwenden” (apply) instead of “OCR > searchable PDF”. However, the date is not what it should be: Instead of the billing date, I get the date when the amount will be deducted from my account. To clarify, I include the relevant page (somewhat redacted for privacy reasons). As you can see, the document date is on the upper right hand, whereas the “deduction date” is on the lower left.
However, if I convert the OCR’ed document to pure text, the latter date appears before the former one.
Rechnung_2019_04_25104137000771 Kopie.pdf (469.1 KB)

Bozol · May 12, 2019, 6:24pm

Hi, I have the same problem. For example (I want the file name of OCR’ed PDFs already in DT3 to be changed to document date and group name) I sometimes have the problem that the document date is either not recognized at all or a second (wrong) date in the document is selected. Also, from time to time future dates are created. File names of documents with unsuccessful recognition then start with 0-00-00.

Unfortunately it is not possible to revoke the execution of the process with CMD-Z that was not executed as expected and thus not restore the old state.

Also the routine for recognizing the document date seems to me to get problems if in the document behind the date still the time when it was printed is written.

How do you think about a preview function for the rules which allows to check the expected result in advance for possible errors?

Greetings
Fred

BLUEFROG · May 13, 2019, 2:02am

However, if I convert the OCR’ed document to pure text, the latter date appears before the former one.

Note: The underlying text in a PDF does not necessarily match what you’re looking at onscreen. PDFs aren’t built like page layout or word processing application documents.

The date I’m getting is 2019-05-09. This date is also appearing in the text first after doing OCR (German language, but also tested English), then converting to plain text.

chrillek · May 13, 2019, 9:23am

Hi,
I’m aware of the “PDF is not a page lay out”. I tried again with the original version of the document and I get 21-05-2019 as document date (german formatting, obviously). I converted the original PDF to PDF+Text with German as main language.
If you’re interested, I could send you the original document in a personal e-mail.

BLUEFROG · May 14, 2019, 3:43pm

Hold the Option key and choose Help > Report bug to start a support ticket and attach any documents you feel would be helpful.