Zonal OCR to use for renaming of files

sebsez · June 29, 2019, 11:41am

Hi All,

I was wondering if anyone is doing the below (or similar) and can point me in the right direction:

Setup OCR zones for several different document types so that each pre-defined zone on a given document type captures specific data, i.e. the date of the document.
Zones would capture following data: Date, Document Subject, Document Author, Document Recipient
Automatically OCR documents.
Once OCR complete, process documents against OCR zone templates so that the specific zone template is applied for each document.
Extract the above mentioned data (in point # 2) from each document.
Rename the document based on a predefined naming convention such as: YYYY.MM.DD - Doc_Recipient - Doc_Subject - Doc_Author.pdf

There seem to be a number of commercially available applications that would do this, but all aimed at Enterprise level business, not for personal use.

Essentially I’m trying to save time by automating my document renaming process.

Has anyone done / doing anything similar? If not, how are other people automating their document naming?

Thanks & Regards

Seb

mbbntu · June 29, 2019, 2:51pm

I would OCR the whole document first, then use Hazel: https://www.noodlesoft.com. It’s a pretty remarkable utility, and not expensive.

sebsez · June 29, 2019, 9:22pm

Hi mbbntu,

Yes I already use Hazel for a number of activities but as far as I know Hazel is not able to extract data from the OCR layer of a document and use it to rename the document.

Thanks

Seb

pvonk · June 30, 2019, 12:59am

Oh, Hazel can extract data from an OCRed PDF and rename the file and or move it to an indexed folder. I used to have a complex system set up with Hazel that parsed bills and filed them away in appropriate folders. Have a look here… (click on the link, don’t just read what’s displayed below)

mbbntu · June 30, 2019, 5:34pm

Hazel certainly can extract text from the OCR layer, as pvonk states. I use that ability to find my bank account number, which allows me to identify pdfs which are bank statements, rename them with date which is extracted from the OCR layer, and file the pdf in the appropriate folder. This feature has been available for some time.

BLUEFROG · June 30, 2019, 5:54pm

I use that ability to find my bank account number,

By what mechanism is this done?

mbbntu · June 30, 2019, 6:29pm

I was guilty of being a bit too terse in what I wrote. Hazel watches my downloads folder and processes various things that land in it. For example, pdfs are automatically scanned to see if they need OCR, and if they do, they are sent off to have that done by PDFpen Pro (because I find it gives the best OCR results of the various options I have available to me) and if it does not need OCR, it is scanned for data that will identify what the PDF is – a bill, a receipt, a bank statement, etc. A bank statement will have my account number and the sort code of the bank in it (I’m in the UK, so UK terminology) and various other identifiers that will tell Hazel that, for example, it is not a receipt with my account number in it. The PDFs are then renamed, beginning with the ISO date (so that they sort properly). The date of the item is extracted automatically from the text in the PDF. Hazel can do some pattern matching and manipulation of data, so a date in standard UK/European format in the text of the PDF can be turned into an ISO date for renaming very easily.

These are not new techniques. Macsparky has been using them for years and has written about them extensively in his field guides.

pvonk · June 30, 2019, 6:36pm

Right. I think my link above (Revisiting Hazel, click on the l.ink, don’t just read what you see above) goes into many details. The post in that thread by Greg_Jones gives screenshots of rules and ways of picking up text in the PDF that is assigned to a variable and later used for renaming the file.

mbbntu · June 30, 2019, 7:09pm

Yeah, and I see that thread is from 2012, so Hazel has been capable of reading text in a pdf for at least the past seven years. It pays to check what a program will actually do! (Not that I always do …)

pvonk · June 30, 2019, 7:30pm

" It pays to check what a program will actually do! (Not that I always do …)"

I hear you. I have many apps that I’ve used for years. Once I’ve determined a workflow for a given app that works for me, I can easily miss some of the new features that are added subsequently. For example DTP 3 is one I’ll really have to focus on. I see a few new features (esp. metadata) that I find intriguing, but there will be others that slide in under the radar. That’s why I regularly follow forums for my important apps to learn new tricks. Even then, I’ll run into a post that opens my eyes.

sebsez · July 6, 2019, 10:35pm

Many Thanks guys, I was overthinking / overcomplicating the solution while a far simpler solution was staring me right in the face!

@pvonk - Point well made about “It pays to check what a program will actually do!”

I’ve now created rules using the custom date attribute capability in Hazel for a couple dozen document types and they all work perfectly!