Extracting bank statement information

p_mitchell · November 25, 2025, 10:55pm

This may not be strictly a DEVONthink issue but because the OCR is so good, and now that AI is enabled, floating the question on the forum seems worthwhile.

I am a lawyer. Client has given me a lot of old bank statements. They are no longer available from the bank in csv or a similar format. I am wondering if, after being OCR’s in Devonthink, one of the AI engines can be used to extract the data into the respective columns - date/description/Debit/Credit/Balance in csv or similar so that a reconciliation can be prepared.

I read an article in TidBITS yesterday that ChatGT Atlas might be a solution but would be interested to know if the collective wisdom of this group has any suggestions.

BLUEFROG · November 26, 2025, 12:04am

If the AI engine and model are up to it, I’d say it’s likely to be feasible.
And if the text layer is good, even better.

Here is a simple prompt: Extract the exact information in the table on page 1 of the selected document as a sheet. Logically, you should be as specific as you can be for your documents and situation. Claude 4.5 Sonnet did a bang-up job on a hotel invoice I tested.

cgrunenberg · November 26, 2025, 6:54am

Most modern AI engines do not even need the OCR layer and can process images (like scans) too, in some cases with better results.

kewms · November 26, 2025, 7:02am

What does your code of legal ethics say about sharing sensitive client data with the Internet?

If you decide you need a server model, rather than a local model, I’d advise reading the terms of service very very carefully.

cgrunenberg · November 26, 2025, 7:06am

A local model like Gemma 3:27b or Qwen3-VL:32b should be sufficient.

p_mitchell · November 26, 2025, 8:08am

There is an overriding obligation of confidentiality of the client’s materials. Part of the materials have aleady been disclosed in a Court hearing although I doubt that the materikals have found their way onto the Web.

Currently investigating several options to protect the client’s confidentiality and I see that Christian Grunenberg has also commented.

chrillek · November 26, 2025, 8:53am

Only tangentially related: OCR on my bank statements usually creates the text by columns. So, in the text layer, all dates of a page come first, than all descriptions, followed by all amounts. I never found that very useful except if I wanted to search for a particular amoung or description.

Perhaps an AI is more “intelligent” in that respect than the OCR.

Phileosophos · November 27, 2025, 10:55pm

I’ll add to that: I never share information with the big tech bros through AI. I host all my own models locally, which I should imagine would serve for your needs as well. It’s a lot simpler than one might expect, though it can require some significant computing horsepower to make it performant.

p_mitchell · November 28, 2025, 5:50am

Out of interest, do you recommend any models that may be worth considering?

Phileosophos · November 28, 2025, 6:09am

I don’t honestly know what might be best for your use case. I saw cgrunenberg posted a couple you might try. For my part I’m using GandalfBaum/llama3.2-claude3.7:latest and granite3.1-dense, but I’m doing software development. You can find a list of Ollama supported models here FYI: Ollama Search