After OCR rename file based on certain words or parts in the PDF?

Doyanole · October 3, 2019, 12:20pm

Hi

is it possible to rename a Searchable Document (PDF) based on certain parts in that particular document ?
For example on the right side there is the name and on the left side on top there is the Contract number , and i would like to remain the document in : Contract - Name - Contractnumber

Thanks in advanced

cgrunenberg · October 3, 2019, 12:27pm

DEVONthink doesn’t support this on its own, it’s only able to recognize document dates & amounts. Therefore the only options are AppleScript or third-party tools.

Doyanole · October 3, 2019, 12:37pm

do you know by any chance of any third party tools ? I don’t think apple script can do this .

cgrunenberg · October 3, 2019, 12:40pm

Maybe Hazel.

wmc · October 3, 2019, 1:16pm

Hazel can indeed do this very well, depending of course on the accuracy of the OCR.

BLUEFROG · October 3, 2019, 2:52pm

Can you provide an example of such a rule?

wmc · October 3, 2019, 3:15pm

Yes, but will be later as I’m away from my computer.

The basic idea is to provide a bit of context found in the document, such as “Due date:” which is adjacent to the date you want to capture. You can specify the expected date format, or let Hazel try to interpret. A date such as “October 3, 2019” found in the content could then be reformatted as 2019-10-03 and used as a component in renaming the file.

BLUEFROG · October 3, 2019, 3:19pm

Thanks!
I suspected as much. This definitely requires uniformity in the text content, as well as a good text layer in OCR’d documents.

Doyanole · October 3, 2019, 3:26pm

@wmc if you could show the rule later on , it would be much appreciated. I kind of struggle there.

wmc · October 3, 2019, 3:27pm

Yes, especially the latter. I use Hazel to rename PDFs of bills, and have a number of rules to match each vendor. Once created, they all act on a particular folder so I don’t need to select the specific rule needed…only the correct rule will match its corresponding file.

wmc · October 3, 2019, 3:27pm

Will do.

wmc · October 3, 2019, 3:31pm

If you’re interested, Hazel’s documentation is quite comprehensive

Hazel Manual – Noodlesoft

Doyanole · October 3, 2019, 3:34pm

Thanks . Already tried that one earlier… for me that’s quite a challenge i have to say.

wmc · October 3, 2019, 6:34pm

@Doyanole Short description of my workflow. The example is a Verizon Wireless bill. ExactScan scans the document, gives it a temporary name starting with VZW to ID it later for processing, puts it into a Hazel watched folder. A Hazel rule recognizes that it needs OCR and adds a tag “NeedsOCR”, then moves it to the Devonthink Inbox. Devonthink smart rule performs OCR, removes that tag and replaces it with “OCR by DT3” (So Hazel won’t enter the file into an infinite loop), then exports to the same Hazel folder. Now Hazel can read the text layer and renames it to iinclude the document date (In this case, the year/month of a bill). The renamed file is returned to Devonthink’s Inbox where a Smart Rule moves it to its final Group.
The Hazel rule that reads the date looks for a line that on the printed copy looks like

Billing Period Aug 16, 2019 to Sep 15, 2019

but due to OCR formatting reads

Billing period Account number Invoice number
Aug 16, 2019 to Sep 15, 2019

(The Billing period, Account Number and Invoice number entries are in a column with the respective data in the next column. The OCR reads down one column then over to and down the next).
I don’t need the start of the billing period, as all I’m after is the closing date to use as the bill date, so my rule looks for “period” followed by anything until seeing a date of the form “to Mmm dd, yyyy”

The field after Contents > Contain Match expands as shown here:

and the Date attribute looks like this:

Finally the renaming action uses the “Date” attribute but changes the format:

Note in the first screenshot a Preview button. This shows whether each condition matches, and in the case of looking at contents lets you see what the text layer looks like so you can spot errors or anomalies like the data reading down columns instead of across.

Sometimes this can be quite exacting, but once set up works well (until the vendor changes the bill format, or you get a bad OCR for some reason)…the latter is why I run the file into DT3 for OCR, then export for Hazel to do its work, then back to DT3…the ABBYY engine in DT3 is more consistent than the one in ExactScan.

BLUEFROG · October 3, 2019, 10:15pm

Thanks for the interesting post and…

(until the vendor changes the bill format, or you get a bad OCR for some reason)

Yep. That’s definitely the issue people have to accept: non-uniform data. If you can get things set up for one, there’s no guarantee it will work for anything but that set of criteria, so you have to evolve it, create different versions, and error-trap more.

But I applaud you taking on the challenge. It’s fun stuff

wmc · October 4, 2019, 12:12am

Once you understand the process, it is fairly easy to duplicate the rule and make changes for different vendors. But it’s definitely one of those programming decisions where you weigh the time in setting it up once, and adapting when something in the target document changes, against the time saved over processing each instance manually.

Edit: Yeah, it’s also fun.

galsom · October 12, 2019, 12:48pm

My setup has been like yours for many years in combining Hazel and DT. I just upgraded to DEVONthink 3 Pro and would like to use its OCR engine as well rather then a separate one.

Hence I’m trying to figure out how to move the OCRed files out of DT so that Hazel can rename them and afterwards add them again. How did you accomplish this parcitular step of exporting in your setup?

wmc · October 12, 2019, 2:13pm

Welcome aboard. I think you’ll find this community very helpful, and the DT staff support is superb.

First pass with Hazel recognizes a new scan in folder ~/Devonthink Scans , adds tag “NeedsOCR” and moves to DT3 inbox.

Then this DT3 smart rule OCRs and moves back to same Hazel watched folder for renaming and moving back to DT3.

The ebedded script does the export:

Hope this helps.

galsom · October 12, 2019, 2:35pm

This helps a lot indeed. I didn’t get the part that it was an AppleScript. I’ll try to implement it as soon as possible.

This allows me to really capture all my PDFs to the Global inbox using DT and DTTG and make sure they are OCRed and renamed correctly.

BLUEFROG · October 12, 2019, 3:36pm

I think this can be made even simpler.

I’m curious what the Hazel rule is. Can you provide a screencap or export the rule for me?