Is there any way to match text using regular expressions in the content of a OCRed PDF? I presume one difficulty could be the way the OCR engine handles a linebreak.
This would specifically help with renaming a file based on a content match.
The simplest regex of course is a text search for a word, which works great. If I now add for example a receipt from a restaurant or a store, I have a smart rule setup in DT3 to automatically rename the file with the text match (restaurant or store name).
I am aware of the regex rename script, but that takes the filename as a source. Any way to use the content?
For those who are interested in regex beyond the link above, check out this link for example. Also very handy in iOS Shortcuts by the way.
What I mean is whether or not the output of the OCR engine inserts a line break character you could match with a \n pattern. That can help in identifying a logical text unit when matching the text, but it is not a prerequisite of course.
Another option I considered was using the OCR output (content) as input for the existing RegEx script. Any idea whether this would be feasible?
Also, note the output of an OCR engine is inconsistent enough document-to-document that RegEX would likely be little help.
Well, you’re more familiar with the back-end of course. But I would have guessed a similar document would result in a more-or-less similar OCR output. Or at least similar enough to do some rough automatic renaming. But the question remains whether the content is available for a script in the first place.
What I mean is whether or not the output of the OCR engine inserts a line break character you could match with a \n pattern.
That is impossible to tell. OCR does not necessarily produce a text layer that mimics any of the layout of the page. It may, but I wouldn’t say it will be consistent enough to guarantee.
Another option I considered was using the OCR output (content) as input for the existing RegEx script. Any idea whether this would be feasible?
If you have OCR’d a PDF, you could try this construct on it…
tell application id "DNtp"
repeat with thisRecord in (selection as list)
plain text of thisRecord
end repeat
end tell
Again, you’d have to determine the quality of the text produced by OCR.
But I would have guessed a similar document would result in a more-or-less similar OCR output.
If you’re referring to two different documents, similar could be very subjective.
On a side note: A capture with an iOS device is less likely to produce a good result versus using an actual scanner. This is often due to poor lighting and a lack of contrast in the picture. (And yes, I refuse to call it a ‘scan’ in iOS because it’s not scanning anything. Pedantic? After 32 years in graphic arts and printing, Yes, yes I am on this one. )
If you have OCR’d a PDF, you could try this construct on it…
Thanks! I’ll give it a go. In MoSCoW terminology this is definitely a C for me.
And yes, I refuse to call it a ‘scan’ in iOS because it’s not scanning anything. Pedantic? After 32 years in graphic arts and printing, Yes, yes I am on this one.
Hmm, this brings back a memory about a mortgage application I went through, where one of many documents was refused by the bank as it wasn’t a “proper” scan, but an iOS photo. They were thorough, I must give them that.
After some internal sighs, and futile attempts to get a definition as to what they deemed a scanner, I dragged myself toward a likely “proper” scanner and scanned the document, after which it was accepted. So… beside the graphic arts and printing branch this particular bank also would have agreed with you
It may, but I wouldn’t say it will be consistent enough to guarantee.
Well, it might be the performance of my scanner, the OCR engine or the combination of both. But the output I get (for receipts at least) is pretty consistent. Sure, there are those that are skewed and misread etc. but those are exceptions in my case
If you have OCR’d a PDF, you could try this construct on it…
In order to have the script actually match the text correctly, I first had to dive into the sed syntax, which appears to be a thing on its own in combination with AppleScript
But with generous parts of the “Rename using RegEx” script and the “Append Selected Text” script I was able to automatically append the total paid amount printed on most of my receipts. Yay!
So now when I scan a receipt Devonthink will automatically:
rename the file based on the company/restaurant/store name
append to total amount paid printed on the receipt
archive the receipt in the correct folder
The possibilities are endless of course. One could add the purchase date and time, items, specific numbers or whatever. It just takes you some time detailing the RegEx pattern. As long as the document has some expected layout.
Hi I am looking for a similar soultion (hope I understand that Threat in the right way).
I like to rename with one smart rule ocr processed files.
The smart rule should be able to read a list with several shops I buy things.
I would setup such a little database with the names like amazon, thalia, hugendubel etc… it would be great to have a second list which contains things like rechnung, lieferschein, kontoausug etc…if smart rules detects one of this pattern in the ocr text layer it should replace that in file name…would this be possible?
It’s like here on the screenshot…but for several matches.
I think changing “All” into “Any” and add the various “content matches” rules for the stores should work.
This is somewhat different from my RegEx solution, as that actually matches the amount paid from the receipt (most of the times), and appends that to the store name,
So if I scan a receipt from store A and paid $5.00, the file renames automatically to:
Receipt (A) $5.00
But before you get into that, please take into account the fairly complex regular expression syntax sed uses. Once you get into it, it’s a pretty fun (though sometimes frustrating) activity
I’m not sure if I understood you correctly, but I thought you only wanted to append the store name of multiple stores to a file. I think you can simply use the contentmatches query and use “Any” in a smart rule to do that.
Might you want to learn about sed and regular expressions there are several tutorial you can find online:
a file comes in to the inbox, than the smart rule detects the new pdf from the scanner which has a file name like Scan-16012020–001.pdf
After that the smart rule should start (I am able to set the creation date at the moment) -> processing a ocr rund and ad the output to the pdf -> then the file get’s a stamp thats visible when the file was scanned -> the filename gets the date like 2020-01-16
No it would be great if smart rules goes trough the scanned text and looks for company name and probably for the invoice etc…I would place this hints in a file/excel list/database etc and then adding this to the file hope
At least for the document amount (e.g. $5.00) regular expressions are usually not necessary, there’s already a placeholder scanning the (OCRed) text for this information.
Wäre es nicht auch cool, wenn man die Möglichkeit hätte per RegEx auch schon in den Bedingungen den Text durchsuchen zu können ? Sozusagen als Selektion von Dokumenten.
Reguläre Ausdrücke sind leider viel zu langsam für das schnelle Durchsuchen großer Datenmengen, wie es von Smart Groups/Rules benötigt wird. Es gibt Kunden mit Datenbanken weit jenseits der empfohlenen Größe, d.h. mit Millionen Objekten und Milliarden Wörtern
Daher ist es praktischer, durch einfache Bedingungen erst die zu durchsuchenden Dokumente zu begrenzen und dann Name oder Text per regulärem Ausdruck zu durchsuchen. Das hat obendrein auch den Vorteil, dass mehrere Abfragen & Aktionen in einer Smart Rule erfolgen können. Falls eine Abfrage kein Ergebnis findet, aber eines von den folgenden Aktionen erwartet wird, so werden diese einfach übersprungen.
Und letztlich hat obige Scan Text-Aktion noch weitere Optionen, z.B. um ein Datum oder einen Betrag in einem Dokument einzugrenzen, falls die automatische Erkennung nicht das gewünschte Ergebnis liefert.
Stimmt auch wieder - werde das noch mal unter diesem Aspekt betrachten.
Ist eigentlich geplant die Regeln auch gruppieren zu können? Ich kann mir vorstellen, das da ganz schön was zusammen kommen kann an Regeln. Schon allein um nicht Monsterregeln zu schaffen.
Wie sieht es mit Abhängigkeiten von Regeln aus ? mach das aus Eurer Sicht Sinn ?