So far, I don’t have a universal workflow. It’s somewhat automated, but PDFs from different sources are handled differently.
When I scan, PDFs go into my DTPO inbox straight from the scansnap, and DTPO OCRs them. Eventually, when I process them, I rename them and sort them into the appropriate databases. I’ve got a lot of different types of things I scan, so I don’t have an automated way of handling them. Magazines get renamed to the magazine name and the volume number. Bills get renamed to the utility and the billing date (not the date of the scan). Recipes, etc get renamed depending on what they are.
I have some automation for PDF bills that I regularly download (check out the hazel rule export attached). For these, it was worth coming up with Hazel rules that see which bill it is (usually by checking the source URL, sometimes the text), scan the text for the bill date (which is dependent on which utility it is, I haven’t found a generic way yet), and rename things appropriately. Downloaded PDFs rarely require OCR, so I don’t have that in the rules.
Other PDFs I download are renamed on an ad-hoc basis or left alone. I download a lot of product manuals, books, and roleplaying materials, and predominately their filenames are used as is.
I also create PDFs in a number of ways, and they generally go straight in. For example, when capturing an article from the web these days, I like to go into Reader mode in safari, then print as “PDF to DTPO”. Those articles need to get renamed, but not OCRed.
When thinking of all the different cases I’ve got, it seems like putting together automation that would correctly handle all the exceptions would be a lot of work. Putting together automation which blindly OCRed and renamed files would cause a lot of damage I’d have to undo later, so it doesn’t seem like a win to me.
Downloads.zip (6.15 KB)