Whitespace in RegEX

astro · July 17, 2021, 9:50pm

My bank statement is constructed as:

A1 A2
B1 B2

when trying to capture A2 and enter A1\s([A-Za-z0-9,]+).*

it grabs B1.

I tried the other variants of white space according to the regex cheat sheet

None of them apply. Any hints for this?

DrJJWMac · July 18, 2021, 12:41am

Presuming that all values in the A1, A2, B1, B2 have no white spaces, the second field from this expression should return what you need:

(\S+)\s(\S+)

When A1 … B2 have whitespaces, you will need further qualifier information about those fields (e.g. their length or a common start or end pattern).

–
JJW

astro · July 18, 2021, 2:19am

I apologize for my rather short description above, but it is a bank statement so a bit tricky to share.
The A1 and B1 should represent the textfields and their respective position when marked/selected.
It`s the rows with Name, Accountnumber, IBAN, Amount etc. and their respective values.
(Done directly in from the bank and allready searchable I guess.)
When copy/pasted into an editor, it shows this:
A1
B1
A2
B2

So I did an OCR of the pdf in DT again. It grew 2.5 times in size and got blurry but the texfields stayed neatly in their postion.
The smart rules RegularExpression produced exactly the same result.
So I guess it must be so that the RegularExpression is seeing A1 und B1 as the next item in line and does it columnwise from top to bottom.
An effect that can be reproduced by selecting table like text in a pdf.
But I can´t reproduce it with my pattern checking editor.
My best hope was that this phenomenon of “jumping” to the next line would be known issue.

chrillek · July 18, 2021, 5:58am

This has not much to do with regular expressions or DT.
Apparently, your bank uses a peculiar method to generate the text layer of the account statements. That’s not a rare thing, I’ve seen that too. OCRing the text twice is presumably not changing that.
If in the text Layer A1 is followed by B1 (even on a new line), there’s nothing a RE can do. It does of course not look at the visual appearance of the PDF, only at the text layer.

Frankly, I’d not try to solve this (unsolvable) problem. Rather, I’d use banking software to get at the data. Or download account statements as CSV, if that’s an option.

brookter · July 18, 2021, 6:56am

With some diffidence, because I’m not an expert… Assuming that you have the text

A1
B1
A2
B2
A3
B3

then the regex

%s/^\(.*\)\n\(.*\)/\1@\2/

turns it into

A1@B1
A2@B2
A3@B3

Which I think is what you’re after? (I’m using @ to differentiate between the two fields for clarity — obviously you can use whatever is appropriate. E.g. if you wanted, you could easily turn this into a markdown table with %s/^\(.*\)\n\(.*\)/| \1 \ \2/) |

| A1 | B1 |
| A2 | B2 |
| A3 | B3 |

All the regex is doing is capturing everything in the first line, then the return character, then the next line, before stripping out the return characters.

Caveat: I used Vim to do this as it’s the regex version I know best. Other versions may need slightly different syntax. I’m not wholly sure how you’d do this in native DT3. My approach would be to convert the pdf to plain text, open in it vim, then run this regex on the file. If you wanted to do this a lot then I’d have a go at an AppleScript using a sed expression, but that’s a bit above my comfort level…

Blanc · July 18, 2021, 7:10am

If you have a known before and after the block of text which you are trying to capture then you can cheat using something along the lines of the following script:

set documentText to plain text of theRecord
set t to offset of "this is the first text" in documentText
set t to t + 23 # length of first text + 1
set tt to offset of "this is the second text" in documentText
set tt to tt - 1
set theResult to texts t thru tt of documentText

It’s worth looking at the plain text of the document yourself, as it may deviate from what you see in the PDF. But I’ve used the above method on various documents and always been successful after some playing.

brookter · July 18, 2021, 7:13am

Thanks! I’ll have to study that — my AppleScript isn’t that great, but it looks useful.

chrillek · July 18, 2021, 8:55am

While this aproach is interesting, it only works if the text following the first match is exactly the same and the OCR recognizes it as such. Since the OP is not giving a lot of detail here, it is difficult to judge if this approach could be successful most of the time.

Another approach might be to convert the PDF to JPEG and OCR that. In that case, there’s not text layer at all, so the OCR should (!) just work from left to right, which seems to be what the OP is after.

@astro: While I understand that you do not want to post details of your account statements, I do not understand why you can’t use random, but meaningful data. REs are meant to capture (nearly) arbitarily complex patterns – reducing reality and complexity to a trivial case like here makes it at lot harder than necessary to solve the problem at hand.

Also, using \s to match space, is probably not a good idea in this case, since you can’t be sure if the OCR recognizes one or two or even three spaces. \s+ is a lot more robust.

AppleScript is ok, but it has no idea of REs (unless you go the extra mile and wed it with ObjectiveC). I’d rather go with JavaScript in this case (unless you’re aiming for a smart rule: in this case, it’s not yet possible to use external JS scripts).

Blanc · July 18, 2021, 11:37am

Agreed; in this case I understood the OP to be saying that the document comes with a text layer and to be highly standardised - the ideal candidate for my approach. It doesn’t even require brackets It’s an approach I have used numerous times (admittedly also due to my incompetence re RegEx).

chrillek · July 18, 2021, 12:30pm

Yes, if and only if the first text is always the same, this approach works. We don’t know if that’s the case, though
Apart from that: can anything without brackets ever be useful at all?

rkaplan · July 18, 2021, 1:15pm

Or perhaps do a “Print to PDF” of the PDF so that you flatten the file into only one text layer.

BLUEFROG · July 18, 2021, 3:42pm

This is always true if you’re going to try to do automation on PDFs.
PDFs of the same kind, like bank statements, should be conforming unless the layout changes at some point.
PDFs from new sources should be converted to plain text and the text examined for the real underlying structure.

astro · July 18, 2021, 9:29pm

Thank you very much for your responses.

I started with the first tip converting it to .jpg and then again OCR.

Result is the same. I added a screenshot of a piece of the document with no personal information.

I also tried to capture the effect when selecting the text. If I want to select the two items in the first row, all the items in the left column are selected before the one in the second column gets blue.

This is something I see often, so I thought it would be a common problem.

Just to give you some context: I am a pretty lazy person when it comes down to organizing stuff. But I really wanted to get to know DT better and working on my workflows during the summer break.

So I thought I push myself to start learning about automation on the most unimportant, simplest and standardized docs I have. The bank statements.

Little did I know about my fate…

I need to check if and how the script works. Reporting back.

Blanc · July 19, 2021, 4:42am

So, start by examining the plain text of your document; I would use the original (as it comes from the bank) to start off with, as if that works we can avoid all the other conversion steps.

This script will create a plain text copy of your document in your inbox.

tell application id "DNtp"
	set theRecords to selected records
	repeat with theRecord in theRecords
		set documentText to plain text of theRecord
		set theTitle to name of theRecord
		create record with {name:theTitle, type:txt, content:documentText} in incoming group
	end repeat
end tell

Copy the script to Script Editor, select the record in DT and then run the script from Script Editor.

The resulting plain text is what any further script (or your RegEx commands) see when working with the original file in question. See whether you can find the Information you are trying to extract. See what other information it is enclosed in. Convert another document, and see whether the pattern is the same. If so, my approach as posted above will work. You may actually even be able to use your RegEx approach, depending on the pattern you find.

BLUEFROG · July 19, 2021, 3:44pm

This is a good example of how a computer isn’t a human …

What you see as BIC being on the same line as NORSDE71XXX, OCR detected and put on separate lines. And this is an example of a conversion that is more closely related to the visual appearance. Many OCR’d documents’ text layer varies drastically from the visual presentation.

So in your case, there is no BIC then a space then some value. They’re on separate lines.

You could use a newline as in this example…

Note the filename has been changed here to the correct value (possibly ignoring the OCR may have detected an O as a zero )…

rkaplan · July 19, 2021, 3:56pm

You can flatten the PDF by doing “print to PDF” for the PDF.

Then these issues should be mostly eliminated.

BLUEFROG · July 19, 2021, 3:57pm

What issues?

astro · July 19, 2021, 4:02pm

May I ask how this order in the picture evolved?

In my pdf it puts column 2 below 1. That’s why I got the problem.
The values are not next (above or below) of each other but on a totally different position.

Bc this is the way they are selected I thought.

rkaplan · July 19, 2021, 4:14pm

Different visual layers creating a distinction between what is expected from OCR and what you get

BLUEFROG · July 19, 2021, 4:21pm

Post a screencap of your converted text file.