I am trying to extract data from a bunch of PDFs files that I processed in DT3 with OCR. The files are essentially my downloaded credit card statements from the past few years.
I want to run an extract routine that pulls specific dates, vendor information, and charged amounts from each monthly statement and then makes that data available in an easy to read Numbers or Excel file.
I prefer Numbers because that is what I have installed, but if required I can get a copy of Excel to do this as well. Just wondering if there is an easy way to extract the data from the OCR documents so that I don’t have to manually cut and paste this from each PDF. I know I can search then and when I click on each of the search results it does pull up the PDF files with that data I am searching, like a specific Vendor name and it highlights it, but I am not clear on how to extract this information other than manually cutting and pasting it.
If you OCRd these statements: I doubt that extracting amounts, vendors etc can be done reliably. There’s no way to tell
if the OCR retrieved adjacent parts of the PDF as adjacent texts
if it even recognised all the information you need/want
Even if all that worked flawlessly: how would you go about telling any algorithm how to fish out vendor, date and amount data from the statements?
I don’t quite follow. The main point of Excel/Numbers and such is not “easy to read” but “calculation”. PDF is already ready to read, in my opinion.
In any case, that’s not an easy task, I think. What are the benefits of a Numbers version of you CC statements you’re looking for?
The main goal is to create a spreadsheet with each vendor that was paid, the date they were paid and the amount they were paid. It is multiple credit card statements over many years. The goal being to calculate in the end the total spend with each vendor on an annual basis and possibly to run several pivots against the data. For instance, once the data is extracted I could manually categorize each vendor into categories like restaurants, clothing, plumbing, auto and so forth and then would be able to run a pivot by month, year, category, or vendor to show totals where needed.
I am looking at some alternative applications right now like mac-pdf-converter which, I can purchase to do this, but was really hoping there were available routines within Devon or MacOS Automator to make this possible, but so far no luck with that.
As you have stated, I would need to develop an algorithm to look at each document, identify and extract the date of purchase, tell it to look for a vendor name following that and then look for the amount. Unfortunately, when using automator, the dates, the vendor names, and the amounts all end up in different locations in the resulting text or rich text documents. So, the data elements are in no way related to each other. Hoping to find a better tool or option to do this.
This not a task for DEVONthink. If you are able to discern the pattern (and I fear this not a simple task), you could (or have someone) write in Python the code to do this. Python has terrific libraries for reading/writing PDF and Excel, and text searching. Relatively easy to get the PDF file in, and relatively easy to output to CSV or and Excel file, both of which are readable by Numbers. The hard part is to figure out the search pattern in each or all of the files. Perhaps better to do a one-time manual effort?