I have a small but persistently, annoying problem.
Several months back my bank changed some aspect of their bank statements. It seems that OCR does not recognizer the account number on a first pass but when the file is OCRed a second time, the account number is recognized.
The result in DEVONthink is the smart rule does not immediately recognize the account number (one of the triggers within the smart rule). After import, the smart rule recognizes the file as available for processing (if I click on the smart rule it’s sitting there unprocessed) despite the smart rule running “On Import”, “On Scan” and “On OCR” trigger.
Any suggestions? This behavior only happens with the two statements I receive from my bank. All my other statements (credit cards and other bills) are processed without a problem. The smart rules I have set up for the bank statements worked until 4-5 months ago when the bank obviously reconfigured the PDF to make it harder to scan for the account number.
Are you sure the bank doesn’t send PDFs with a text layer anyway, so that OCR is not needed at all?
Also, what do you see when you convert such a PDF (the original) to text in DT in the resulting text file?
Make a test before put bank statements in DT: open one of them in Preview, select some text. Copy and paste into a plain txt file. Do you see syllables instead of entire words? If you see “Wall mart pur cha se of a ti pi ti ki ta vi” instead of Wallmart purchase of a tipikitavi", they are using soft/hard hyphenation, one of the zillion of not supported stuff in macOS PDF framework. Another thing could be ligatures: “con dent” instead of “confident”.
Thank you for the feedback. Here’s the experiment I ran:
Turned off the Smart Rule.
Opened PDF from the bank website in Preview. Copied some text to a text editor. All text appears normal; however, I am unable to select the 15-digit account number.
Moved PDF to DEVONthink Inbox. Kind is “PDF+Text”; however, I am unable to select the account number in the DEVONthink PDF editor.
Smart Rule (triggers still turned off) does not recognize the file as eligible.
Performed an “OCR to Searchable PDF” on the file in the Inbox (even though it’s listed as PDF+Text).
After the “OCR to Searchable PDF” action is performed, the account number is now selectable. The Smart Rule now recognizes the file as eligible.
As I mentioned in my initial post, the file appears as eligible in the Smart Rule soon after it is imported; however, the extra timing (microseconds) required to recognize the account number misses the triggers I have established (E.g.: “On Import”).
Running the Smart Rule every minute would be a waste of resources, given that I access this file only once a month. Any other advice? Is there a way to establish a delay for the Smart Rule to allow the account number to be recognized before the file is processed?
I guess that they provide the account number as an image. Which is dumb. Perhaps the value is also part of the PDF file name, so you can grab it from there?
Alternatively, if you always download it manually from their website, you could put it in a particular folder that you index in DT. Create a smart rule that watches this folder and does whatever it needs with the file – you know the correct account number because the file arrives in this folder…
If these are coming from the same source (the bank), why aren’t you using data about that – name, URL, … – as a criterion for matching the files, then just doing OCR automatically? It may be then possible to use the Apply Rule action and pass the document to the other smart rule you’re using.
Here is a document with 1234 in the content (for OCR) and named 1234(as a criterion for detecting the document)…
Since the document matches the second rule, it is also executed. If the document wasn’t a match, the second rule would not run.
So replace Name matches 1234 with whatever identifier you have for this bank. It should do OCR then pass the document to the second rule. If the account number is matched in the second rule, it should execute.
I wasn’t aware I could create a second rule that would run as the result of the first.
Can you explain how the “On Demand” trigger works? I assumed it only ran a Smart Rule by manually executing the rule. Your solution suggests otherwise.