Auto-renaming PDF after OCR based on content

bosie · December 16, 2020, 5:30pm

Yup, exactly this. I would be quite interested in that perl/apple script if you would be willing to share it?

eburgwedel · December 16, 2020, 6:30pm

Sure: https://github.com/eburgwedel/devonthink-smart-rename

I’m not a developer, so please forgive the ugliness of my code. You need a few perl packages for Unicode and date handling, and the Apple Script for DNtp still requires a hardcoded path to the perl script. I’m sure there is a better way, advice welcome.

Let me know if you have any questions.

BLUEFROG · December 16, 2020, 6:38pm

This is my favorite line…

# Change "May" to "5" -> crazy

bosie · December 16, 2020, 8:08pm

for not being a developer you sure picked the two most convoluted programming languages to get it done

so if i understand it correctly, you are importing it into DT and then select the row, execute the AS which in turn changes both the name of the record as well as the actual file on disk?

eburgwedel · December 17, 2020, 10:31am

Yes, that’s how it works. DT thankfully makes it super easy.

As for the languages, I did have my fair share of pain, indeed. I had neither written Perl before nor Apple Script in any relevant way. I do marvel at Perl’s elegant and concise ways for certain tasks, but even the smallest snag could turn into hours of research. And don’t get me started on Unicode. But there is nothing you cannot copy-paste from Stack Overflow.

@BLUEFROG Seriously, if you just the read the next line, you have no idea what this is doing. ‘Implicit Power Cast’ might describe it somewhat more euphemistically.

bosie · December 17, 2020, 1:43pm

Wait a minute… this is your first script with perl? that is hilarious. i thought perl is your weapon of choice since the 80s or something along those lines.

as for: https://github.com/eburgwedel/devonthink-smart-rename/blob/main/smartrename.applescript#L10 i found this: https://stackoverflow.com/questions/48915301/how-to-pass-multiple-arguments-to-osascript
that way you might get around hard coding the path in the oascript

eburgwedel · December 18, 2020, 10:11am

I guess I used Perl because research indicated a superior handling of regular expressions and dates. I wouldn’t do it again, to be honest, the learning curve left a lasting impression. I wonder how a Perl buff would have done it.

As for the tip, thanks — where (and how) would you pass the install location to the script? I’m probably missing something here. How is deployment of OSA and additional dependencies ideally done?

bosie · December 18, 2020, 11:09am

I put stuff like this in my keyboard maestro config and then the KM just uses it when calling the bash or applescripts.

rmschne · December 18, 2020, 11:33am

If you have urge to experiment with regular expressions, or just plain string processing, give Python a shot. IMHO more intuitive and useful. Perl was a mess for me and I gave up (15 years ago).

chrillek · December 18, 2020, 12:32pm

I’d have simplified the regular expressions somewhat. I think that there basic types should suffice

yyyy#mm#dd
dd#mm#yyyy
dd#M#yyyy

Where yyyy could be only two digits, # stands for one of [-/._ ] and M for the name of a month (this last one is the worst because of abbreviations and language dependency)
I didn’t really read all of it but was impressed by the perlishnes of it. You apparently dived in quite deep.

eburgwedel · December 18, 2020, 3:50pm

I indeed started with simple regex, but whenever I thought I had caught them all, some new exception would come up and somehow I ended up with this. Somewhere midway I hated my code and threw it all away. I also didn’t manage to make more concise because I would no longer be able to understand the implications of my own code, like an Excel Table with to many too long formulas.

My takeaway from this is that DT is one of the most powerful applications I know and that I would love a native plugin structure to do crazy things like this. I guess there could be a whole marketplace around it and I certainly would pay almost any amount for a trainable plugin to classify and rename my documents.

@rmschne I guess my hunger for regex is somewhat satisfied for now, but yes, Perl no more.
@bosie Thanks, that should certainly work!

jambamana · February 28, 2021, 3:14pm

I too am looking for auto-classification of documents. It seems this is being worked on at the enterprise level but I am eager for a consumer solution and DT would be a great candidate for this feature. Here’s an article about it: Automatic Document Classification with Machine Learning and AI

BLUEFROG · February 28, 2021, 8:31pm

Welcome @jambamana

DEVONthink already has some ability to auto-classify documents. It learns as you file things manually, gradually making better suggestions over time.

jambamana · March 1, 2021, 3:51am

Yes but the files I’m looking to classify are pdf’s from financial institutions, invoices, etc. the naming of these files from their sources is often awful. I’m looking for a way to get them named well before or as I’m inputting to DEVONThink.

rmschne · March 1, 2021, 6:57am

FYI, I either rename files manually before importing, and if I can figure out the pattern and automation is warranted, I used Hazel to do it. I find Hazel easier to work with for this sort of thing. After interrogating the file for information relevant to the renaming, Hazel moves it into the DEVONthink Global Inbox and DEVONthink then imports it. Rules inside of DEVONthink move such files into the preferred locations.

bosie · March 1, 2021, 9:28am

do you mind sharing some non-confidential hazel rules for renaming?

rmschne · March 1, 2021, 10:01am

No problem. See DEVONthink and Hazel | Musings on Interesting Things for the overall description of what I do.

All the incoming documents go to one folder called “~/Dropbox/Scanner Output” from the scanner, from a Hazel rule, or from me just putting them there. It’s a Dropbox folder as I use multiple devices from which files originate. I don’t want Hazel looking everywhere. Hazel does the interrogation and redistribution. On that target folder I have about 15 different Hazel rules setup to detect files for which I have discerned a pattern and handle often enough to overcome my personal inertia to not automate. Too much automation is a problem in itself, sometimes. The screen shot below is the Hazel rule, including looking for the statement date, for AMEX credit card statements. I have just one AMEX account so I’m not extracting the account number to put into the file name, but I do that for bank accounts.

I add “000000” after the statement date simply because my standard file naming convention for these sorts of files for archiving (and probably never look at again) is YYYYMMDDHHMMSS so that I an easily sort. For this document, time stamp is of no relevance.

I’m sure DEVONthink can do all this plus more, but for dealing with files before they hit DEVONthink I just have found it simpler to use Hazel. To do DEVONthink stuff, I use DEVONthink, of course (manual and some Smart Rules).

Your mileage may differ.

bosie · March 1, 2021, 12:15pm

thank you very much. that is interesting. so you have 1 rule per specific document (type of document + company etc)? have you tried to do it less specific (i.e. one rule covering all of AMEX)?

rmschne · March 1, 2021, 12:42pm

This is a rule covering all of my AMEX files. Yes, I have basically one Hazel rule per document for which I have discerned the pattern. Hazel does not, to my knowledge but it’s been a while since I read the Help, have “if” statements to allow different actions based on differing filters (All or Any). I can think how to do that in something even more sophisticated, e.g. Python, but … Hazel is what I use and I keep it simple with a rule for each type of document.

To be honest having multiple rules in Hazel is really no different than having multiple “if’s” using something else.

jambamana · March 1, 2021, 5:35pm

I’ve been playing with Hazel based on this thread and found the same. It’s pretty good. It would be better if it allowed you to nest if’s inside and’s or vice-versa. That would allow for some more sophisticated discernment inside a single rule. But as @rmschne says, it’s not much different than a flat list of if statements for pattern matching in any other language. Still quite helpful.