Help with very basic RegEx

Connor · November 11, 2023, 10:27am

Dear,

I am currently trying to get a grip on my documents using smart rules, as I realise that I am not able to classify all manually.

One of my regular invoices looks like this:

Bildschirmfoto 2023-11-11 um 11.17.47

It is not an OCRed, but a PDF with datastream directly downloaded. I want to extract the invoice number (or Rechnungsnummer in this German example). To do so, I tried it with an RegEx looking like this:

^R[0-9]{13}$

and the full smart rule looking like this

Bildschirmfoto 2023-11-11 um 11.23.39

And while the RegEx was validated on several RegEx checker sites with the structure of the invoice number, I cannot get DevonThink to write the invoice number in the corresponding field.

So - what am I missing here?

Thank you for your support,
regards

Connor

chrillek · November 11, 2023, 10:59am

Your RE is probably too restrictive. To make sure, you should convert the PDF to plain text in DT (which basically extracts the text layer) and check that against your RE.
I’d forego the start/end anchors in the RE and simply use R\d{13}. Or perhaps use word delimiters \bR\d{13}\b

Unrelated: Why are you duplicating information into custom metadata? If you have “Rechnung” in the file name, what’s the point of having “Rechnung” in the “Dokumentenart” as well? Same for the “Unternehmen” and “Belegdatum” (the latter is actually triplicated: It’s the “Dokumentdatum”, it’s in the file name, and in a custom metadata field).
The more duplicate data you have, the more difficult it is to keep them in sync.

BLUEFROG · November 11, 2023, 3:33pm

Why are you using a regular expression here?

Does Kundenummer appear in more than one place in the document?
If not, just use Scan Text > String > Kundennummer * and the String placeholder.

chrillek · November 11, 2023, 3:57pm

The question was about Rechnungsnummer And the issue with many PDFs (at least German ones) is that their text layer is weirdly organized in columns. Much like Abbyy likes to to, too. If you have a simple layout like

Date            Amount
2023-01-07      12.34
2022-07-02      56.78

they’ll provide text as

Date
2023-01-07
2022-07-02
Amount
12.34
56.78

No fun to work with that – in the example, “Kundenummer” would be followed by “05.06.2021”.

In the current case, however, the “Rechnungsnummer” is an R followed by 13 digits, so a simple RE should match.

BLUEFROG · November 11, 2023, 4:05pm

Gotcha. It would be interesting to see a PDF or two.

Connor · November 11, 2023, 5:48pm

A lot of action here - thank you both for your insights. I will try to answer your thoughts one by one

@chrillek : Your RE is probably too restrictive. To make sure, you should convert the PDF to plain text in DT (which basically extracts the text layer) and check that against your RE.
I’d forego the start/end anchors in the RE and simply use R\d{13} . Or perhaps use word delimiters \bR\d{13}\b

I tried your different suggestions, but it simply did not pick up the expression. Therefore, I am following your suggestion to re-OCR the file, which gives the possibility to use a “simple” string text scan. However, what I realise is that it creates a second file - still need to mitigate this.

As @chrillek already explained - due to structure reasons. However, I learned that I need to add a closing term as well, because now, the result of Rechnungsnummer * is

R2023007700824 Kundenreferenz
Kundennummer 38XXXXX
Verbrauchsabhängige Kosten wie Telefon-Verbindungen werden für den zurückliegenden Kalendermonat berechnet.
Ankündigung zur SEPA-Lastschrift: Der Rechnungsbetrag in Höhe von 4,57 € wird mit Fälligkeit zum 11.09.2023 von Ihrem Konto abgebucht. Sofern Sie nicht selbst die Kontoinhaberin / der Kontoinhaber sind, möchten wir Sie bitten, die Rechnung an die entsprechende Person weiterzuleiten.
Sofern nicht innerhalb der vereinbarten Frist von 8 Wochen nach Rechnungszugang eine begründete Einwendung in Textform eingeht, gilt die Rechnung als genehmigt.
Fragen zu Ihrer Rechnung?
Telefon: +49 (30) 80951020 - kostenlos aus dem easybell Festnetz
E-Mail: billing@easybell.de
Servicezeiten von Mo.-Fr., 8:00-20:00 Uhr & Sa., 9:00-18:00 Uhr
easybell GmbH
Brückenstraße SA 10179 Berlin
Tel +49 (30) 8095 1020 fax +49 (30) 8095 1009 billing@easybell.de
Commerzbank AG
IBAN: DE35 4804 0035 0760 6890 00 BIC: COBAOEffXXX
Geschäftsführer: Dr. Andreas Bahr Steffen Hensche Martin Huth
Joris Van Rymenant
Amtsgericht Berlin-Charlottenburg HRB 137060
UST-ID 0E249984363
Einheiten
20 Min 2 Stk 1 Monat
Betrag (brutto)
0,00 € 1,58 € 2,99 €
3,84 € 0,73 €
4,57 €

Which is a little to much information for the invoice number So I changed the condition to Rechnungsnummer * Kundenreferenz

On the way via chat - since they contain my postal address, I wouldn’t want to have them in the forum.

You are perfectly true. Maybe it is my inner Monk, but I do not like to have technical PDF names in my DMS. From my “logic”, I wanted to have a readable filename if I open the corresponding group, but I also know that the names aren’t really good for real searching. This is why these things are at least duplicated. For the “Belegdatum”, I just had no idea that there is something like the “Dokumentdatum” - just learned it by yesterday. Maybe I will take it to think about my metadata structure.

Putting it all together: @BLUEFROG and @chrillek : Thank you both for your immediate support!

Regards

Connor