Extract Date from Regular Expression to Rename PDF file

arne · July 18, 2020, 1:02pm

OK, I apologise in advance but I am smart rule newbie:

From my credit card invoices, I try to extract the date of invoice (“Rechnungsdatum”), which is displayed in the format DD.MM.YYYY, and use this to rename the file into “YYYY-MM-DD Credit card invoice”. For the hell of me I cannot extract the right date and convert it to the right format for naming the file.

Screenshot 2020-07-18 at 14.57.00

chrillek · July 18, 2020, 1:11pm

There’s a thread about exactly this question in the main forum: How to specify exact date-form with additional characters (e.g. for Batch Processing?)
Maybe that helps.
Hint: I’d go for a regexp like Rechnungsdatum\s+\d{2}\.\d{2}\.\d{4}
Not tested, though.
If you want to split the parts of the date for renaming, you have to group the day, month and year using parenthesis. Again, see the other thread.

arne · July 18, 2020, 1:28pm

Thank you, @chrillek!
That was one of the threats I looked at and didn’t understand anything…
I guess I have to go back to square one and first try to understand the syntax better…

BLUEFROG · July 18, 2020, 7:32pm

Hint: I’d go for a regexp like Rechnungsdatum\s+\d{2}\.\d{2}\.\d{4}
Not tested, though.

And this may not work if the file has been run through OCR.

People need to be aware that what they see in a PDF is not necessarily how the text is structured in the text layer from OCR. In these situations, it can be useful to convert the PDF to plain text via the Data > Convert menu to assess the text.

wmc · July 18, 2020, 10:28pm

Amen. I have also seen the text structure change on occasion when scanning invoices from the same vendor that are visually in the same format from month to month.

chrillek · July 19, 2020, 8:14am

So one can only hope for the best. If one can’t rely on “Rechnungsdatum” predeing the desired date, one could extract all dates and use some heuristics to find the desired one. Here it would be the second newest one, possibly. But this also works only if at least the dates are stored as consecutive strings in the text layer after OCR.

wmc · July 19, 2020, 10:15am

Text structure changing in my experience happens, but not all that often; just be aware that you may occasionally see a file name result that is not what you expected. @BLUEFROG’s tip of converting to plain text is a good one to see what your algorithm is evaluating and what may need to change. I will rescan the document to see if it changes (it will, more than you might expect), or correct the file name manually and see if the problem persists with the next invoice.

chrillek · July 19, 2020, 10:23am

In your experience, is the text structure closer to the visual one in a PDF document that contains a text layer from the start (like when you print a Pages document to PDF or so) as opposed to the layer generated by OCR later?

wmc · July 19, 2020, 10:28am

Good question! More common in scanning to OCR, but that is also by far the majority case that I deal with.
edit: should have gone to @BLUEFROG.

wmc · July 19, 2020, 10:32am

One more comment: If you are scanning to OCR, you may get more consistent results by using a shorter string, e.g. “gsdatum” or whatever portion of the word is unique and preceding a date within the document. The more characters you are trying to match, the greater the chance of a scanning/OCR error causing an issue.

chrillek · July 19, 2020, 12:26pm

You’re right, of course. But we still don’t know if @arne is talking about a scanned document or not. Many of the receipts I get from German companies are already PDF+Text, so I don’t have to OCR them. With the notable exception of Deutsche Telekom: they send me one bill as PDF only, the other one as PDF+Text. Can’t remember which one is for mobile and which one for internet/fixed line.

wmc · July 19, 2020, 12:45pm

Good point; I missed that.

arne · July 19, 2020, 6:02pm

I do not scan the document but download the credit card invoice from the internet portal as pdf+text document.
I have seen the layout indeed changing over time, but one thing was constant: the invoice date was always called “Rechnungsdatum”.
Not sure what to do now…

wmc · July 19, 2020, 6:45pm

I created a file with the information in your original post. The following smart rules work here. The first uses regular expressions, the second built-in smart rule processing using placeholders. You will of course need to modify the “Search in” field, and add/modify conditions as needed. As shown, the file name will not be modified but an alert will appear showing the result.

When you are satisified of the result, copy the entry field after “Display Alert”, change the action from “Display Alert” to “Change Name”, delete the “Name” placeholder that is automatically added, and paste in the name template that you copied from the Display Alert action. You can also change “on Demand” to “On Import” or whatever trigger you would like.

To add the placeholder in the second example, right click in the field following “Display Alert”, choose Insert Placeholder —> Document Date ----> (Select the format you want at the top of the menu), then type in the text “Credit card invoice”

EDIT: I just noticed an error in the regex line in the first example…I didn’t escape the dots, which are special characters for “match any character”. It will work, but will also find 2170272018! So please use this, for accuracy’s sake:
Rechnungsdatum\s+(\d{2})\.(\d{2})\.(\d{4})
@chrillek got it right in his example above.

arne · July 20, 2020, 7:37am

Wow, thanks a lot!
I re-created both Smart Rules and both find 23 documents. I do not get any display alert, but both find the same amount of documents.
Can you explain for a newbie like me wha would be the advantage/ disadvantage of the different rules, please?

Thank you very much!

arne

wmc · July 20, 2020, 10:11am

As the rules are written, you need to apply them manually. Highlight one of the found documents and right click, then Apply Rules ----> (name of rule). That should give you an alert with the new name.

Simply different approaches to the same problem, one using regular expressions, one using the built-in function of smart rules. As long as the result is set to “display alert” you can’t do any harm (it won’t rename the file so if something is wrong you have a chance to correct it). Experiment with both to see how they work…consider it a learning opportunity.

arne · July 20, 2020, 4:54pm

I think when you said “display alert” I expected a pop up window with big letter “ALERT”
Indeed, the smart rule spit out the right documents.

I duplicated my database and run the smart rules: both with the same effect. Since I am blond, I stick with your second example as it seems much easier to remember…

Thank you, and all others, again for helping me out here and showing me the awesome possibilities of DevonThink. I am using it since more than 10 years, but to be honest have neglected the automation options so far completely. That might change now…

arne

BLUEFROG · July 20, 2020, 5:31pm

And the automation options in DEVONthink 3 far exceed what could be done in DEVONthink 2!

arne · July 20, 2020, 7:41pm

That might be true, @BLUEFROG, but I am still fighting to upgrade or not…
The absence of a constantly available sorter is really turning me off. I ran DevonThink only when needed, since I only use it keep & sort my private documents. Whenever I got an email with an attachment, I simply could drag and drop the attachment in the sorter and delete the email. At the weekend (or Friday afternoons), I started DevinThink to sort the documents and good was. To get something into DevonThink 3, I now need to start the program each and every time or keep it running in the background, which I am also not interested in. Unless I miss something, this breaks my workflow and the automation now I use to clean a couple of things up.
But I guess that is all material for a different threat…

BLUEFROG · July 20, 2020, 10:22pm

Bear in mind DEVONthink 2.x is out of development and automation is not the only improvement. We do not suggest planning on staying with the 2.x line for the long term unless you don’t plan on doing anything more with it, including syncing (which is improved (and improving) in 3).