Find Last Amount in PDF

Hi,

Could anyone tell me if there’s a way to find the last amount (i.e. the one nearest the end of the document) in a PDF containing multiple amounts using the Scan Text action in a Smart Rule, please?

I assume that by “value” you mean a numeric token. But is it a currency value or any number? And what if the last number is part of a telephone number or a zip code, but you are wanting a dollar value. Would the value have a prefix like $?

Thanks for the quick response and sorry my question was a bit vague.

I have a document with multiple instances of differing positive numeric values in the format 1,234.56 (they are currency values but aren’t prefixed with a currency symbol). I was curious to know if there was a way to capture the value matching that pattern (I’m thinking RegEx \d+,\d+.\d\d) which appears nearest to the end of the document?

Depending on what you want to do with the found amount and when you need to find it Hazel could do what you want. I don’t believe that DevonThink could (at least easily) do what you want—although I stand to be corrected by someone (and there are many) more expert.

I think it might anyway be helpful, in terms of getting more useful responses than this, if you were to explain exactly what you need to achieve in terms of a final outcome.

Edit: apologies, reply should have been to @RogueWolf.

Stephen

1 Like

Short answer: Not reliably, no.

Longer answer: you could get at the REs in a PDF with a text layer using JavaScript (also with AppleScript, but that’s a bit more convoluted).

But “the end” might be problematic, depending on how the text layer was generated. If it is the result of OCR, you might be lucky in simply choosing the last match. However, if the text layer was generated programmatically, all bets are off: they might have written all text first and the numbers later or the other way round or the left column first or …
If you were hunting for the largest amount, that would be a lot easier.

As @chrillek noted, YMMV.

Due to the way PDFs are generated, what you see isn’t necessarily how the underlying text layer is constructed. So your idea of “the one nearest the end of the document” is only relative to your human perception of the document’s structure. In the internals of the PDF, that value may be found much earlier in the text.

You can use Data > Convert > to Plain Text and examine it to see where it falls. This is no guarantee of results with another type of document, but it may help with files from a particular source.

Indeed it can, with a little JavaScript:

const app = Application("DEVONthink 3");
const rec = (app.selectedRecords())[0]; /* get first selected record */
const txt = rec.plainText(); /* get the record's plain text */
const matches = [...txt.matchAll(/\d+,\d+.\d\d/)]; /* get all amounts. See below for the RE */
const lastMatch = matches.pop(); /* get last match */
console.log(lastMatch); /* might want to do something useful with it */

As always, this is also feasible with AppleScript, needing some more bells and whistles.
Re the RE: \d+,\d+.\d/d is not a sensible expression for arbitrary amounts. I simply took it from the OP’s post.

  • it finds only values > 999 (because of the leading \d+,)
  • it finds no values withouth a decimal part
  • the dot (.) has to be escaped to match a dot.

A more robust alternative might be (\d+,)?\d+(.\d\d)?
This, of course, sill works only for US-american (and British?) amounts, not for German (dot is a thousands separator, comma the decimal separator) or Swiss (’ as thousand’s separator) values. And It does not find amounts >=1000 without a thousand’s separator. But hey, REs can almost always be improved.

That is not what I call “easy”. :grinning: I am a simple man and “a little JavaScript” is beyond me I’m afraid!

(Not being rude to you at all and deferring to your programming ability!)

Stephen

2 Likes

Many thanks to all of you for your thoughtful and detailed replies. I was starting to come to the conclusion that @BLUEFROG outlined. I’ve examined a few of the files I was hoping to process and the text layer seems to be too randomly structured to reliably allow extraction of the value I’m after. OCRing the files again in DEVONthink seems to produce a more reliable output but this is probably a step too far for what I was hoping to achieve.

In this instance, I’m happy to accept defeat but I’ve learned a lot from your answers so a different victory has been achieved!

Thanks again for your guidance.

Is the text before/after the amount always the same?

It is (“Total Amount” always appears before the value when viewing the PDF) but the structure of the text layer isn’t consistent between PDFs so that text isn’t always before the value in the text layer.

If I OCR the PDF again in DT, the text layer seems to be more consistent (I get “Total Amount\n1,234.56”, for example) but I’m not sure capturing the value is important enough to me to do that…

At least for these PDF documents an action Scan Text > Amount searching for Total Amount* should work. Or…

Total Amount
*

…if there’s a line break.

1 Like