Auto-renaming PDF after OCR based on content

bosie · December 11, 2020, 12:24pm

thats one type with 3 variants. type being (presumably) Invoice

BLUEFROG · December 11, 2020, 12:26pm

How is the software supposed to discern what is important without you explicitly telling it, especially in a sea of variable text?
It’s only a computer, not a human being.

bosie · December 11, 2020, 12:45pm

By identifying the type of document and then figuring out the important pieces. similar to my bank figuring out the important pieces of the invoice simply by taking a photo of the invoice. works well enough.
and on the odd documents where it can’t figure it out, a simple user interaction to identify the important bits and pieces is OK too. from that it can learn for the future.

chrillek · December 11, 2020, 1:01pm

That’s called a DWIM system (Do What I Mean). Developers are working on it since ages, but it has still some rough edges to be polished.
In the meantime, you might try to figure out what characterizes your documents (aka “important pieces”) and script accordingly.
Oh, and your bank’s system works with invoices only, I suppose. In my experience, with quite different results depending on light, font size and the position of the moon. YMMV, of course.

bosie · December 11, 2020, 1:18pm

my banks system is indeed only invoices but so far, every single one of my invoices has worked. haven’t manually entered data for wire transfers in ages.

but for documents that i scan, the number of types of documents is limited anyways. the ‘search space’ is much more limited.
as for entities, presumably DT is already doing something similar for search and the graphs/word lists.

rmschne · December 11, 2020, 1:47pm

@bosie. You may wish to consider using Hazel for all this “smart” file naming prior to importing into DEVONthink. Hazel is well more sophisticated and has features you’ll like.

bosie · December 11, 2020, 1:54pm

i haven’t found those features in hazel. mind naming some keywords i could google for please?

rmschne · December 11, 2020, 1:56pm

I overstated saying “smart”. I meant “sophisticated”. There are no “computer does the thinking” anywhere that I know. Sorry to have misled. But easier to setup complex rules in Hazel (just my opinion). Your mileage may differ.

Edit a few minutes later: The only personal computer program that I know of that did “smart” interpretation (not “learning”, though) of text was Lotus Agenda from the 1980’s. Product died in early 1990’s if not before. Oh well.

bosie · December 11, 2020, 2:27pm

There are no “computer does the thinking” anywhere that I know.

machine learning? DT’s search would have to do some thinking though. if DT extracts the most valuable entities out of a document (similar to the occurrence list thingie they already have) and you could use that, it might be interesting to use in the renaming process. very hard to do with rules that are based on simple text scans IMO.

i don’t see how you could have complex hazel rules doing this sort of stuff. i tried that already but i genuinely failed to do it. if i need to create one rule per type/variant, i end up with hundreds of rules that i have to manually curate.

rmschne · December 11, 2020, 2:31pm

Yes, of course you are correct. Recommend you buy and use one of those “machine learning” products.

For me, Hazel works on the dozen or so “types” of incoming invoices/statements/etc. that I have. The rules were “sophisticated” and it was fun setting up, and now I reap the benefits.

bosie · December 11, 2020, 2:41pm

i haven’t found those machine learning products just yet, hence i was asking if someone can recomend them (for non coporate usage that is)

chrillek · December 11, 2020, 2:50pm

I’m afraid you’re falling for a buzzword here. Machine learning needs training, much like humans do. Throwing an arbitrary document at it and expecting it to find what’s “important” (to you, nonetheless) is like asking a three year old about partial differential equations. Though you might have some luck there.
As long as you’re not able to express in human terms, what “important piece” means for you, software won’t be able to help you.

BLUEFROG · December 11, 2020, 2:58pm

Also to further @chrillek’s comments, machine learning uses many thousand (and more!) tests to learn from. It gets answers wrong at a very high percentage until it’s sufficiently trained. And the training is also focused like, “Is this a picture of a cat?” not “Identify any animal”. That kind of fidelity is achieved only after testing specifics and again, a very large set of data being used.

bosie · December 11, 2020, 3:22pm

thanks for the feedback, i work in machine learning and build/evaluate models. i don’t necessarily fall for the buzzword here

That kind of fidelity is achieved only after testing specifics and again, a very large set of data being used.

sure, so what? that’s what it means to build a ML product

BLUEFROG · December 11, 2020, 3:39pm

DEVONthink is not an ML product so you’re talking about something beyond the scope of the application.

bosie · December 11, 2020, 5:08pm

maybe one day it can add ML and boost its search?
but yes, that’s why i started the thread asking for a recommendation for another application.

rmschne · December 11, 2020, 5:22pm

Lot easier for all if up front you had indicated your expertise in machine learning and led us into giving ideas for an app that does machine learning given your expertise …

mjnnyc · December 16, 2020, 1:51am

So, putting aside the ML digression, getting to your original goal: would you be able to check, say, “x # of words” from the start of your OCR-ed doc against a table of, say, company names, and if it finds a match there then use it to populate the “COMPANYNAMEFIELD” portion of your filename?

You could, in theory, do the same for “DOCTYPE,” then have variable lengths after DOCTYPE triggering an import of any data relevant to DOCTYPE. Like, e.g., if it finds “Recipe” for DOCTYPE it then pulls in any following words prior to a carriage return and appends that to DOCTYPE.

This checking against tables might get you part of the way to the automation you desire.

bosie · December 16, 2020, 9:26am

no, it unfortunately isn’t that easy. Where in the document the relevant information is is basically unknown and just taking ‘x words after triggerword’ is not working.
having a company list like you suggest is cumbersome as i don’t deal with most companies that much. And the doctype is surprisingly difficult. Financial and medical services have all sorts of doctypes and it might be a one-off even. I.e. i get a referral from a doctor to run some lab tests. adding the doctor and lab name to my company list is just too frickle (for me at least).

eburgwedel · December 16, 2020, 3:09pm

I’ve been trying to solve this problem for years. I’ve written hundreds of Hazel rules, only to find myself in the same old ditch every other week. A new document type, or something in the known ones changed - a space between turned into a CR in the PDF, the month changed from “March” instead of “Mar”, the currency suddenly became “EUR” instead of “€”. A nightmare.

I gave up at some point and and wrote myself a Perl/Apple script, which reformats and cleans up the name, prepends dates found in the name or its creation date as a fall back. It also appending things I select in the document.

It’s not perfect, but saved me hours of typing in the long run. if you are interested, happy to share.