OCR only within specific areas/zones of a pdf, dealing with letterhead

Hi everyone,

I deal with a lot of scanned letters on letterhead from attorneys. This invariably makes for a PITA OCR experience. Is there some way to set boundaries for each page of a pdf and then send it to the OCR engine so that it only OCRs that particular area.

Hopefully I have managed to attach an example letter and then the area I would like to limit the OCR to, effectively cropping out the surrounding areas.

And then how could I automate this for all pages within a document.

Thanks very much!

P.S. Just edited the title to clarify my intent.

Example letterhead letter.pdf (122.3 KB) Example letterhead letter cropped for OCR.pdf (122.6 KB)

What am I missing? These files appear the same.

And no, you canā€™t set zones for OCR.
Obviously this also means you canā€™t automate it.

They are not the same ā€“ you will notice that the file ā€œcropped for OCRā€ has the letterhead at the top of the page and the date stamp removed, leaving only the body of the letter remaining to be OCR-ed.

This is a massively common need; Iā€™d be shocked if Iā€™m the only one. OCR-ing the other file yields a lot of useful data. Itā€™s similar to clipping a letterboxed film, removing the black bordering areas above and below the picture. Iā€™m sorry if this is the wrong terminology as I donā€™t know what the right this is.

Thanks, @BLUEFROG !

Youā€™re welcome!

I understand what youā€™re saying but itā€™s also the first request I recall seeing.

Development will have to assess this request. Thanks for your patience and understanding.

Couldnā€™t this be worked around by splitting the first page of the document, OCR that first page and join it with the original?

Written out:

Original document with pages 1, 2ā€¦ n
Split document leaving two documents: A (page 1) and B (pages 2ā€¦n)
OCR document A
Join A with B

Would that be scriptable you think?

Without explaing the PITA (I know that), what exactly does it make it so troublesome for you?

Is the letterhead on each page? Why does that matter to you? I.e. what do you use the OCR for in a following step?

This is confusing to me, because the doc is cropped on my computer. Is this better? I donā€™t know how to make it show up as an image in the forumā€¦

I am patient! This is such a long-standing need for me. I am sure others have a similar need but it is hard to articulate (and lawyers are pretty tech-averse, generallyā€¦).

Example letterhead letter cropped for OCR.pdf (402.8 KB)

@Solar-Glare , thanks, but I might not have been clear. Every page has block text that should be ignored, while inside of that is relevant text that Iā€™d like to OCR. Itā€™s basically a template (irrelevant) with text placed within it (relevant).

EDIT: I didnā€™t notice the rest of your comment.

The letterhead is lots of repetitive, useless text. I only want the relevant part of the doc which is the letter the correspondent has written. So, for example, I have an 8-1/2" x 11" page but only want the center 8-1/2" x 7" zone of the page OCR-ed. Every page of the doc basically the same, but every doc has a different zone the is relevant (because every letterhead is different).

Does that help clarify?

Right, I misread it the other way around (a request to OCR the letterhead and not the rest).

Iā€™ve updated my first comment and tagged you. Why is the letterhead so troublesome to you? What do you use the OCR for in a following step?

Weā€™re ā€˜out of syncā€™ with our comments :grinning:

But what do you subsequently do with the OCR content that makes you want use a selective part for OCR? Is it a copy/paste issue?

Youā€™re right to get right at the functionality Iā€™m looking for. Itā€™s copy/paste and search. C/P is an obvious frustration here, but search is the hidden PITA. Names, titles, page numbers, and whatever else people put in headers/footers repeat and clog up search results (and C/P).

There might be another angle, e.g. what scholars might use, cropping out/in footnotes. Really what Iā€™m trying to achieve is that: just being able to automatically crop a specified amount from the top/bottom/sides would get me 90% of the way there. Itā€™s functionally unrelated to OCRing, and more an image-prepping need.

Iā€™m sorry, I know nothing of how to script. But hopefully this is a pretty standard procedural needā€¦:crossed_fingers:t3:

Not a definite solution and Iā€™m on an iOS device currently, but automatic cropping of PDF appears to be possible in Shortcuts. Based on that, my preliminary assumption would be that itā€™s also possible in Applescript. The idea is that you crop a document with the letterhead and/or footer and save that as the input for OCR.

Not ideal as you loose information or end up with ā€˜duplicatesā€™ (a cropped version and an original version), but it might be better than waiting for a crop solution that needs to be developed and thus takes time.

Below is just a proof of principle and requires an iOS device:

  • (if necessary) install Shortcuts
  • create a shortcut
  • rename it and enable sharing in the share sheet
  • add action: mask image (choose some mask, like ellipse)
  • add action: quick look

Now share a small multipaged PDF to that shortcut from DTTG or any other app with a PDF.

If it works as intended, youā€™ll see the PDF displayed, but cropped with the ellipse shape.

Are you scanning these files yourself or receiving them?

Just an idea: Adobe Acrobat is pretty scriptable, and a quick search revealed this little script off hand. That approach would almost certainly allow you to set up a smart rule which sent the file you received in the Global Inbox to Acrobat for cropping, running OCR on the result. (You could actually set up different smart rules for different documents, incorporating different crop zones.) I personally would set up a second database which mirrored my first; in one, I would put the originals, in the other I would put the cropped files. That way I could exclude the originals from search easily.

In a perfect world. Shortcuts has (un?)fortunately nothing to do with AppleScript, which is not even available on iOS. It comes from a different company that Apple bought some years ago.

Apples Preview app is hardly scriptable at all. As @Blanc pointed out, Acrobat is, but thatā€™s also quite costly afaik. A cheaper option might be PDFPen: it permits scripting, thus creating an ā€œimprintā€ which could perhaps be a white box covering the letterhead. And then run OCR (either in PDFPen or in DT).
Perhaps it is possible to define an imprint in DT as well and apply it to (copies of) those documents by using a white text on a white background?

1 Like

My assumption wasnā€™t based on the correlation between frameworks (which is indeed 0), but on the aspect that cropping is fairly simple to do on iOS. That makes the chance of it being possible somehow on macOS fairly high.

I think Iā€™m going with @chrillek on this one; document manipulation options with onboard tools are pretty basic, because Preview is pretty much unscriptable. I didnā€™t find any simple options when I search earlier on, or when we had the problem that OCR was changing the page size and I wanted to automate a solution.

That is a damn good idea; just tried it though, it seems not to be possible in one swoop (itā€™s not possible to define an imprint of suitable size). It might be possible if you defined say 5 imprints and applied them all though.

It doesnā€™t seem to be possible with the tools provided by Apple. Automator allows cropping but only for images and to fixed formats ā€“ I doubt that this action accepts PDFs. GraphicsConverter has a similar function, but again only to fixed formats. Not sure about PDFs there.
One could of course convert the PDF to an image first, then crop that and OCR the image.

Slightly OT: Appleā€™s automation ā€œstrategyā€ is completely derailed (if it ever was on anything like a rail). They have shortcuts for iOS and automator for MacOS with similar, but not identical functions. They have AppleScript for MacOS but no scripting at all for iOS. And many Apple programs on MacOS nowadays have broken scripting support anyway, cf. Notes, Preview, Reminders.
So itā€™s no wonder that something seemingly so obvious and simple as cropping a PDF is not possible on the Mac.

Using text consisting of several lines? I get a new line with Ctrl-Ret in the imprintā€™s textfield. However, Iā€™m too dumb to set the fill colour to anything at all. All I ever get it text printed over the original (in the preview, didnā€™t try it on a document).

Together that makes us quite clever; I was not aware of the Ctrl-Ret. new line in imprinter. You donā€™t see the fill in preview, but so long as you select a border and a fill, it will be present in the document when you imprint. And it works, too - I can cover the top section of a document with an imprinter. @mjnnyc I think @chrillek has cracked it - in preferences, set up an imprint which uses white text colour, is as wide as a page, and as many lines high as you need (see above re. new lines). Set a border. Set the border and fill to white. Imprint your documents (decide first whether to duplicate at the same time; you cannot later remove an imprint). OCR the imprinted document.

Better still: imprinting is a smart rule action, so you could automate the whole lot :slight_smile:

1 Like

Both, but mostly receiving. (Gay? lol) And it would be awesome to go back and process older files, even if it was a bit tedious.

Iā€™m guessing that you are asking because there is maybe an easy way to do this if Iā€™m doing the scanningā€¦? I have a ScanSnap S1500M that I scan with. Maybe thereā€™s a simple cropping procedure that I donā€™t know aboutā€¦?

You guys are amazing. Thanks for all your interest in looking at this.