Auto-renaming PDF after OCR based on content

bosie · March 1, 2021, 5:49pm

what do you mean?

@rmschne that’s why i was wondering if you could get a away with it by using regex to extract the information and have multiple types of docs covered by a single rule.

if this rule covers all your AMEX files, how do you name AMEX files that are informing you of an increase of fees or something?

chrillek · March 1, 2021, 5:54pm

From what I’ve seen, I doubt that it does – seems to be geared towards statements. If you’d want to process all AMEX documents (are they so many apart from statements?), you could use cascading rules: the first triages the documents (setting a temporary tag, for example), the others are triggered on these tags and act accordingly.
But that’s not really a DT topic.

rmschne · March 1, 2021, 6:00pm

This only covers AMEX statements. Any other AMEX document would never pass the rules. If I had multiple accounts with the same formatted statement, then I could put in a rule which looks for the Account Number, puts it into a variable, and then add that to the file name. That’s what I do with Bank statements (checking and savings … same back, different account numbers).

REGEX hurts my brain, frankly. Always did and probably always will.

But as @chrillek says, this getting off topic. If you want more Hazel discussion they have a forum for that.

chrillek · March 1, 2021, 6:32pm

Don’t give up just yet There certainly are very dark and murky corners in some RE dialects (like the recursive code-executing weirdness in Perl), but the basics are way cool and not too complicated. And they come in handy quite often (at least for me).

bosie · March 1, 2021, 6:39pm

rmschne · March 1, 2021, 6:47pm

Yea. I have done a bit of REGEX in Python over the years. Works well. And this built into Macs already and few know and expoit.

jasonekratz · March 1, 2021, 9:01pm

My solution to this problem years ago was on a weekly basis tagging files as I scanned them. Then Hazel would move stuff around based on the tags. There simply is no easy way to have the tools do all sorts of “magic”, they just don’t do that. I keep my folder/group structure simple because search indexing and searching is WAY easier than some complex folder/group structure. For the most part all of my files are OCRed PDF files. In the case given above for AMEX stuff I’d put everything in one AMEX folder and use search when I needed to find something.

bosie · March 1, 2021, 9:10pm

thats what i have been doing too but search isn’t useful to me anymore. i cannot figure out from the title what i am looking at which makes going through the search list in DT cumbersome. how are you getting around that if all your files are named with timestamps for example?

jasonekratz · March 1, 2021, 9:47pm

Depends. For some things I end up renaming the files to something meaningful. But I’m always using some combination of the tags, the location of the file, and the dates to find stuff in the search results. Or I just go right to a given group and just look around.

In your example above of “if this rule covers all your AMEX files, how do you name AMEX files that are informing you of an increase of fees or something?” the file is a searchable PDF I assume. How many AMEX files do you have that say “increasing fee” that are relevant? I mean if you wanted to find all the times you had a fee increase among a list of files simply named with a timestamp you could search for that text and get the results. If you just wanted to know the latest one, you search for the text and take the most recent result.

I guess my problem in the past has always been overengineering this kind of things because “what if…”. But the “what if…” never happens. Then I went back and really simplified it based on just thinking through the reality of how often I need to really get back to stuff. This system is simple and hasn’t failed me once. I’ve ALWAYS been able to find what I need even if it requires a bit of clicking around in different files in the search results.

In the time since I even stopped using Hazel quite honestly. These days I just shove everything in the global inbox in DT (I scan directly to it) and once a week just move stuff manually into the same set of groups. Its part of my weekly review. Sometimes I rename files, sometimes not.

Part of the problem in this thread is it’s hard to determine why you’re doing what you’re doing and what the volume of documents are. Are you just storing stuff for potential future searching? How often are you searching? How many documents do you have coming in on a daily basis? My system works great for me but my volume of stuff is, I think, relatively low and I dont need to look back at most of it hardly ever (like bank statements. Look at them once usually and once in a blue moon have to go back to them for something).

rkaplan · March 1, 2021, 10:16pm

I think most of us start out with some very specific tagging or classification system like this.

Over time we realize that only the very big picture is needed - like a group for “BIlls” or “Invoices” or “Tax Returns”

Beyond that, smart rules work stunningly well - and can evolve on the fly restrospectively as the information is needed.

The main need to tag or group documents meticulously is if there is a legal or medical or fidiciary interest involved, i.e. a lawyer or doctor keeping records for a client or patient or some other professional activity where you need to be able to identify documents for the entire “case.”

For almost all personal/household tasks, smart rules do just fine.

jasonekratz · March 1, 2021, 10:39pm

Yes. Totally agree. The last paragraph is why I was asking what the use case is. Could be vastly different than mine but its a bit necessary to get more information to make any suggestions truly useful.

bosie · March 2, 2021, 12:01am

it is purely to find something and record keeping. Searching/retrieving is the main activity i do in DT i guess (but I guess that isn’t too different from anyone else because why else would you use DT?). My volume is low, maybe 5 a day. Though might not be about volume.
Also about error rates and work. If i don’t do it for 2 weeks, i have zero will to do 50 or 80 on a saturday morning.

How many AMEX files do you have that say “increasing fee” that are relevant?

semantically or that phrase spcifically? situation i find myself more and more in is that i search for ‘increasing fee’ and get 0 results or the wrong results just to realise after 5 minutes that it wasn’t that phrase specifically. title might get around that and normalise it across all financial institutes. though thats an issue with the search in DT I am trying to solve. Would have the advantage of being portable to DTTG which has a much weaker search system.

BLUEFROG · March 2, 2021, 1:56am

Why put that info in the filename?
A filename is not supposed to be a full summary. You can use tags, Finder Comments, custom metadata, etc. to house such arbitrary comments.

bosie · March 2, 2021, 9:38am

sure, i could use tags. custom metadata is probably not portabel and finder comments are terrible imo anyways. easiest is still just naming it, easiest to grep on the command line, easiest to spot, no third party tool required to extract it.

anyways, problem still remains…

chrillek · March 2, 2021, 10:07am

In this context (“naming and importing files into DT”) grep on the command line does not make a lot of sense, I think. As @BLUEFROG said: stuffingg the file name with a bunch of metadata (which is “metadata” for a reason just makes it less legible. And inside of DT you do not really need to grep because you have full text search.

It is not quite clear (to me, at least) what exactly your problem is. If you want to search for “fee increasing” but there’s no mention of these words in the documents at all, how would automatically naming the documents help? The algorithm would have to search for something else, but then you could simply search for this something as well whenever you need it. But if you don’t know what it is you’re looking for, how could DT?

bosie · March 2, 2021, 10:30am

i do process my files outside of DT too, though. and going forward the next app i use might not be able to use tags. not sure if linux can even read tags?

the problem is that searching in DT has become cumbersome. i never find the right documents quickly. having at least some help in the search panel would be useful. i thought it is naming. i don’t care if it is naming or tags. it still needs to be named or tagged properly.
sub problem is that i dont want to click around when i have a search list in front of me. i should be able to determine the correct file with a single click.

The algorithm would have to search for something else, but then you could simply search for this something as well whenever you need it.

obviously not. that is like knowing the unknown. i don’t want to search for specific words but semantically similar meaning. otherwise i would have to be able to remember exact phrasings in 10s of thousands of documents.

But if you don’t know what it is you’re looking for, how could DT?

I do know what i am looking for, i just don’t know the specific phrasing that is in the document. i even do this for photos in lightroom. searching for forest gives me results for tree, leave, leaves, woodlands etc.

ok, guys, let’s turn the tables. i might be extra thick but from all the replies i still don’t undrestand how you guys are actually naming things. it seems to be something like 189389238932.pdf because naming things is useless anyways, right? all hail the tags.
how are you finding ANYTHING if all you have is amex-letter1.pdf and amex-letter2.pdf ?

chrillek · March 2, 2021, 11:39am

I doubt that you’ll find any desktop software providing just that. Even the huge search backends (like Amazon’s or those of grocery stores) find far too many wrong matches (at least for my taste).

grep is obviously completely out of the picture here: even egrep can only search reg exes, not semantics. You wouldn’t get anything more with them then you get already with DT.

As to your example with fotos: if I search for “Wald” (forest in german) in Apple’s software, I get some pictures of forests. And I get my husband between two rocks (no trees, no green) and him alone somewhere. No tree, no leave, no branch. If I search for “Ast” (aka branch), I get pictures of flowers. Far off the mark. May be Apple’s algorithm works better in English or Lightroom’s is simply brilliant? But then, LR relies on the cloud. DT does not. An I’m wondering if this algorithm relies on predefined terms (as does Apple’s or if it really “learns”, e.g. from keywords that you define for your fotos). And while that may work, it still requires you to store your photos in LR/cloud. Which might not neccessarily be what you’d want to do with your credit card statements.

Nobody said that they’re using this kind of dumbed-down naming. One possibilty would be to have groups for credit card statements and for documents. If necessary, add a subgroup for credit cards to the latter. Name your files like “yyyy-mm-dd Amex” and sort them into either the first or the second group. Than, if you need to find you fee increase letters, you know that you can limit your search to the files in the group documents/credit cards with “Amex” in their name. And in case you do that often, just use the tag “fee” for them.

Blanc · March 2, 2021, 11:53am

I use Source - Topic, so in your example it would be Amex - Card Statement or Amex - Rejected charge etc… I adjust the creation date of the document to the date printed on the document.

I would say that most of the time, the document name is only useful to me if I am literally looking for something. If I am searching for it, the content of the document is generally what provides the results.

I have smart rules for every document I receive at all regularly (so that’s bills, income, statements, insurance…); those rules rename the document, adjust the creation date, put the document where it belongs, mark read & locked and display a notification). For those documents which are not subject to rules, I often put them where they belong using AI, then open the folder the document has gone to (I have a script in the toolbar which opens the folder of the last document and highlights that document), and then rename it based on the convention I find I used for similar documents in the folder (e.g., Amex - Card Statement could just as easily be Amex - Statement or Amex - Credit Card Statement).

bosie · March 2, 2021, 3:54pm

Not sure i follow. of course i am searching the entire document but the search workflow (which i assume you are talking about in the second sentence) is heavily influenced by the name of the document (i might switch to a partial name/tag workflow though)?

though even that is not true i think as one of the two DT devs mentioned at some point that the name influences the score, hence the rank, heavily. might have misunderstood and @BLUEFROG can chime in whether or not tags get the same (or potentially higher?) priority.

For those documents which are not subject to rules, I often put them where they belong using AI

what does that mean? you move it based on the first recommendation in the See Also list?

bosie · March 2, 2021, 4:01pm

Amazon … wrong matches

Not sure I agree with you on this. if i search for ‘forest’ but don’t get woodlands by DT, DT is providing wrong matches anyways. It just goes in the other direction. Recall with DT is relatively low because of a very specific and narrow search.

As for lightroom, i should have explained better, sorry. I set up the synonym lists myself. LR does not rely on the cloud though. Data is not stored in the cloud (at least not if you use the Classic version).

Nobody said that they’re using this kind of dumbed-down naming. One possibilty would be to have groups for credit card statements and for documents. If necessary, add a subgroup for credit cards to the latter. Name your files like “yyyy-mm-dd Amex”

I know nobody said it, I was half-joking about. Hilariously you suggested an even more dumbed-down naming strategy? I don’t see how “amex-letter1.pdf” is any different than your suggestion of “yyyy-mm-dd Amex” except you wouldn’t even know if it is a letter or card statement. Not trying to be difficult but this is a genuine question, how often do you need to click around to find anything if you name things this way? How often is the top scored result what you want?