OCR only within specific areas/zones of a pdf, dealing with letterhead

mjnnyc · March 6, 2021, 3:04am

Hi everyone,

I deal with a lot of scanned letters on letterhead from attorneys. This invariably makes for a PITA OCR experience. Is there some way to set boundaries for each page of a pdf and then send it to the OCR engine so that it only OCRs that particular area.

Hopefully I have managed to attach an example letter and then the area I would like to limit the OCR to, effectively cropping out the surrounding areas.

And then how could I automate this for all pages within a document.

Thanks very much!

P.S. Just edited the title to clarify my intent.

Example letterhead letter.pdf (122.3 KB) Example letterhead letter cropped for OCR.pdf (122.6 KB)

BLUEFROG · March 6, 2021, 4:06am

What am I missing? These files appear the same.

And no, you can’t set zones for OCR.
Obviously this also means you can’t automate it.

mjnnyc · March 6, 2021, 4:54am

They are not the same – you will notice that the file “cropped for OCR” has the letterhead at the top of the page and the date stamp removed, leaving only the body of the letter remaining to be OCR-ed.

This is a massively common need; I’d be shocked if I’m the only one. OCR-ing the other file yields a lot of useful data. It’s similar to clipping a letterboxed film, removing the black bordering areas above and below the picture. I’m sorry if this is the wrong terminology as I don’t know what the right this is.

Thanks, @BLUEFROG !

BLUEFROG · March 6, 2021, 5:51am

You’re welcome!

I understand what you’re saying but it’s also the first request I recall seeing.

Development will have to assess this request. Thanks for your patience and understanding.

Solar-Glare · March 6, 2021, 6:14am

Couldn’t this be worked around by splitting the first page of the document, OCR that first page and join it with the original?

Written out:

Original document with pages 1, 2… n
Split document leaving two documents: A (page 1) and B (pages 2…n)
OCR document A
Join A with B

Would that be scriptable you think?

Without explaing the PITA (I know that), what exactly does it make it so troublesome for you?

Is the letterhead on each page? Why does that matter to you? I.e. what do you use the OCR for in a following step?

mjnnyc · March 6, 2021, 6:21am

This is confusing to me, because the doc is cropped on my computer. Is this better? I don’t know how to make it show up as an image in the forum…

I am patient! This is such a long-standing need for me. I am sure others have a similar need but it is hard to articulate (and lawyers are pretty tech-averse, generally…).

Example letterhead letter cropped for OCR.pdf (402.8 KB)

mjnnyc · March 6, 2021, 6:23am

@Solar-Glare , thanks, but I might not have been clear. Every page has block text that should be ignored, while inside of that is relevant text that I’d like to OCR. It’s basically a template (irrelevant) with text placed within it (relevant).

EDIT: I didn’t notice the rest of your comment.

The letterhead is lots of repetitive, useless text. I only want the relevant part of the doc which is the letter the correspondent has written. So, for example, I have an 8-1/2" x 11" page but only want the center 8-1/2" x 7" zone of the page OCR-ed. Every page of the doc basically the same, but every doc has a different zone the is relevant (because every letterhead is different).

Does that help clarify?

Solar-Glare · March 6, 2021, 6:29am

Right, I misread it the other way around (a request to OCR the letterhead and not the rest).

I’ve updated my first comment and tagged you. Why is the letterhead so troublesome to you? What do you use the OCR for in a following step?

Solar-Glare · March 6, 2021, 6:33am

We’re ‘out of sync’ with our comments

But what do you subsequently do with the OCR content that makes you want use a selective part for OCR? Is it a copy/paste issue?

mjnnyc · March 6, 2021, 9:06am

You’re right to get right at the functionality I’m looking for. It’s copy/paste and search. C/P is an obvious frustration here, but search is the hidden PITA. Names, titles, page numbers, and whatever else people put in headers/footers repeat and clog up search results (and C/P).

There might be another angle, e.g. what scholars might use, cropping out/in footnotes. Really what I’m trying to achieve is that: just being able to automatically crop a specified amount from the top/bottom/sides would get me 90% of the way there. It’s functionally unrelated to OCRing, and more an image-prepping need.

I’m sorry, I know nothing of how to script. But hopefully this is a pretty standard procedural need…

Solar-Glare · March 6, 2021, 9:31am

Not a definite solution and I’m on an iOS device currently, but automatic cropping of PDF appears to be possible in Shortcuts. Based on that, my preliminary assumption would be that it’s also possible in Applescript. The idea is that you crop a document with the letterhead and/or footer and save that as the input for OCR.

Not ideal as you loose information or end up with ‘duplicates’ (a cropped version and an original version), but it might be better than waiting for a crop solution that needs to be developed and thus takes time.

Below is just a proof of principle and requires an iOS device:

(if necessary) install Shortcuts
create a shortcut
rename it and enable sharing in the share sheet
add action: mask image (choose some mask, like ellipse)
add action: quick look

Now share a small multipaged PDF to that shortcut from DTTG or any other app with a PDF.

If it works as intended, you’ll see the PDF displayed, but cropped with the ellipse shape.

BLUEFROG · March 6, 2021, 9:31am

Are you scanning these files yourself or receiving them?

Blanc · March 6, 2021, 10:11am

Just an idea: Adobe Acrobat is pretty scriptable, and a quick search revealed this little script off hand. That approach would almost certainly allow you to set up a smart rule which sent the file you received in the Global Inbox to Acrobat for cropping, running OCR on the result. (You could actually set up different smart rules for different documents, incorporating different crop zones.) I personally would set up a second database which mirrored my first; in one, I would put the originals, in the other I would put the cropped files. That way I could exclude the originals from search easily.

chrillek · March 6, 2021, 10:31am

In a perfect world. Shortcuts has (un?)fortunately nothing to do with AppleScript, which is not even available on iOS. It comes from a different company that Apple bought some years ago.

Apples Preview app is hardly scriptable at all. As @Blanc pointed out, Acrobat is, but that’s also quite costly afaik. A cheaper option might be PDFPen: it permits scripting, thus creating an “imprint” which could perhaps be a white box covering the letterhead. And then run OCR (either in PDFPen or in DT).
Perhaps it is possible to define an imprint in DT as well and apply it to (copies of) those documents by using a white text on a white background?

Solar-Glare · March 6, 2021, 10:43am

My assumption wasn’t based on the correlation between frameworks (which is indeed 0), but on the aspect that cropping is fairly simple to do on iOS. That makes the chance of it being possible somehow on macOS fairly high.

Blanc · March 6, 2021, 10:59am

I think I’m going with @chrillek on this one; document manipulation options with onboard tools are pretty basic, because Preview is pretty much unscriptable. I didn’t find any simple options when I search earlier on, or when we had the problem that OCR was changing the page size and I wanted to automate a solution.

That is a damn good idea; just tried it though, it seems not to be possible in one swoop (it’s not possible to define an imprint of suitable size). It might be possible if you defined say 5 imprints and applied them all though.

chrillek · March 6, 2021, 11:02am

It doesn’t seem to be possible with the tools provided by Apple. Automator allows cropping but only for images and to fixed formats – I doubt that this action accepts PDFs. GraphicsConverter has a similar function, but again only to fixed formats. Not sure about PDFs there.
One could of course convert the PDF to an image first, then crop that and OCR the image.

Slightly OT: Apple’s automation “strategy” is completely derailed (if it ever was on anything like a rail). They have shortcuts for iOS and automator for MacOS with similar, but not identical functions. They have AppleScript for MacOS but no scripting at all for iOS. And many Apple programs on MacOS nowadays have broken scripting support anyway, cf. Notes, Preview, Reminders.
So it’s no wonder that something seemingly so obvious and simple as cropping a PDF is not possible on the Mac.

chrillek · March 6, 2021, 11:08am

Using text consisting of several lines? I get a new line with Ctrl-Ret in the imprint’s textfield. However, I’m too dumb to set the fill colour to anything at all. All I ever get it text printed over the original (in the preview, didn’t try it on a document).

Blanc · March 6, 2021, 11:58am

Together that makes us quite clever; I was not aware of the Ctrl-Ret. new line in imprinter. You don’t see the fill in preview, but so long as you select a border and a fill, it will be present in the document when you imprint. And it works, too - I can cover the top section of a document with an imprinter. @mjnnyc I think @chrillek has cracked it - in preferences, set up an imprint which uses white text colour, is as wide as a page, and as many lines high as you need (see above re. new lines). Set a border. Set the border and fill to white. Imprint your documents (decide first whether to duplicate at the same time; you cannot later remove an imprint). OCR the imprinted document.

Better still: imprinting is a smart rule action, so you could automate the whole lot

mjnnyc · March 6, 2021, 12:22pm

Both, but mostly receiving. (Gay? lol) And it would be awesome to go back and process older files, even if it was a bit tedious.

I’m guessing that you are asking because there is maybe an easy way to do this if I’m doing the scanning…? I have a ScanSnap S1500M that I scan with. Maybe there’s a simple cropping procedure that I don’t know about…?

You guys are amazing. Thanks for all your interest in looking at this.