Data Extraction to Numbers or Excel with DEVONthink

Pondleaper · February 22, 2021, 1:47pm

I am trying to extract data from a bunch of PDFs files that I processed in DT3 with OCR. The files are essentially my downloaded credit card statements from the past few years.

I want to run an extract routine that pulls specific dates, vendor information, and charged amounts from each monthly statement and then makes that data available in an easy to read Numbers or Excel file.

I prefer Numbers because that is what I have installed, but if required I can get a copy of Excel to do this as well. Just wondering if there is an easy way to extract the data from the OCR documents so that I don’t have to manually cut and paste this from each PDF. I know I can search then and when I click on each of the search results it does pull up the PDF files with that data I am searching, like a specific Vendor name and it highlights it, but I am not clear on how to extract this information other than manually cutting and pasting it.

cgrunenberg · February 22, 2021, 1:56pm

DEVONthink is only able to extract the document date & amount (see placeholders in smart rules & batch processing or AppleScript suite), extracting the vendor would require your own script.

You could then e.g. use the extracted data to create a sheet (CSV) which can be exported to Numbers or Excel.

chrillek · February 22, 2021, 2:43pm

If you OCRd these statements: I doubt that extracting amounts, vendors etc can be done reliably. There’s no way to tell

if the OCR retrieved adjacent parts of the PDF as adjacent texts
if it even recognised all the information you need/want

Even if all that worked flawlessly: how would you go about telling any algorithm how to fish out vendor, date and amount data from the statements?

I don’t quite follow. The main point of Excel/Numbers and such is not “easy to read” but “calculation”. PDF is already ready to read, in my opinion.

You could use a script and regular expressions. I’d employ JavaScript for that, but others here (notably @pete31) have forced AppleScript into submission for this kind of task.
In any case, that’s not an easy task, I think. What are the benefits of a Numbers version of you CC statements you’re looking for?

Pondleaper · February 22, 2021, 3:10pm

Thank you for the quick responses.

The main goal is to create a spreadsheet with each vendor that was paid, the date they were paid and the amount they were paid. It is multiple credit card statements over many years. The goal being to calculate in the end the total spend with each vendor on an annual basis and possibly to run several pivots against the data. For instance, once the data is extracted I could manually categorize each vendor into categories like restaurants, clothing, plumbing, auto and so forth and then would be able to run a pivot by month, year, category, or vendor to show totals where needed.

I am looking at some alternative applications right now like mac-pdf-converter which, I can purchase to do this, but was really hoping there were available routines within Devon or MacOS Automator to make this possible, but so far no luck with that.

As you have stated, I would need to develop an algorithm to look at each document, identify and extract the date of purchase, tell it to look for a vendor name following that and then look for the amount. Unfortunately, when using automator, the dates, the vendor names, and the amounts all end up in different locations in the resulting text or rich text documents. So, the data elements are in no way related to each other. Hoping to find a better tool or option to do this.

rmschne · February 22, 2021, 3:30pm

This not a task for DEVONthink. If you are able to discern the pattern (and I fear this not a simple task), you could (or have someone) write in Python the code to do this. Python has terrific libraries for reading/writing PDF and Excel, and text searching. Relatively easy to get the PDF file in, and relatively easy to output to CSV or and Excel file, both of which are readable by Numbers. The hard part is to figure out the search pattern in each or all of the files. Perhaps better to do a one-time manual effort?

DTLow · February 22, 2021, 4:02pm

I use applescript to extract DTP data to a spreadsheet
Spreadsheets allow much better presentation options

Not sure about extracting data from a PDF
I see two problems

Access to the OCR’d data
The data is not in a standard format with specific fields
I doubt you can create accurate algorithms for this
I do this manually when adding the data to DTP

mailbuoy · February 23, 2021, 12:46pm

If DT won’t do exactly what you need, you might try PDFZone. Would have the extra step of extracting data to a .cvs and then importing into numbers.

chrillek · February 23, 2021, 1:15pm

That’s simple with a script: access the plain text property in AppleScript or plainText in JavaScript.

Exactly. No guarantee that the OCR is doing “the right thing”. Even in the case of machine generated PDFs (like account statements) the text layer can be wide off any human expectation.