How much automation with OCRed-PDFs?

gestyle · August 5, 2013, 9:03am

Hello,

I’d like to index PDFs (receipts) coming from ExactSCAN Pro. The PDFs are already readable and after saving them on my external Disk as “0000-00-00_0000_ACCOUNT_00,00_OBJECT_MERCHANT_.00.0.pdf”,
I have 2 options:

a) renaming them with an application I didn’t find yet, then indexing the PDF-files in DT-Pro.

b) using a script in order to extract the information I’d like to appear in the name, then indexing the PDF-files in DT-Pro. As an example, a filename would become: “2013-08-05_0135_SOFTWARE_99,95_DEVONtech_DEVONthinkPro.33.1.pdf”

Is it possible to extract the expenditure date (99,95) out of the readable PDF in order to have the filename actualized with this information?

And is there any script- or not-script-solution who could assist in doing the renaming job?

Any hints would be much appreciated

G

Greg_Jones · August 5, 2013, 9:16am

Hazel will automate all of this for you, although it comes with a price ($28 USD). Here is a recent topic here about using Hazel to do pretty much exactly what you are looking to do. To my knowledge, there is no other solution out there that will automate the naming of documents based on a date in the document.

gestyle · August 5, 2013, 12:24pm

Thank you, Greg!

Hazel seems to open many doors. I registered into their forum and am quite curious about the return of information.

The whole article you mentioned is interesting, as it shows that a full-index-process really works. Glad to have this kind of info.

I’ll drop some more lines, as soon as I get a feedback

G

BLUEFROG · August 5, 2013, 8:23pm

echo “2013-08-05_0135_SOFTWARE_99,95_DEVONtech_DEVONthinkPro.33.1.pdf” | sed -E 's/.*_([0-9]*,[0-9]*).*$/\1/'

```>> 99,95 <img src="//devontech-discourse.s3.dualstack.us-east-1.amazonaws.com/uploads/original/1X/c06c9a7ed7ebde4a9ded95f738d3086e60ed264b.gif" width="15" height="17" alt="8)" title="Cool"/>

gestyle · August 5, 2013, 11:43pm

Sorry BLUEFROG, for spreading the wrong pricing – I just chose any price as an example – but I think it could work too …with a few nice functions added.

Are these two lines of code able to extract all the info out of a readable PDF – no, really?

BLUEFROG · August 6, 2013, 2:26am

Scraping the full contents of a PDF is a different matter than your example. Your example was a very conformed filename so any filename built this way would return the price with that expression.

For example, if I run the expression on: 12110_1991,23_some purchase.pdf it will return 1991,23.