I deal with a lot of scanned letters on letterhead from attorneys. This invariably makes for a PITA OCR experience. Is there some way to set boundaries for each page of a pdf and then send it to the OCR engine so that it only OCRs that particular area.
Hopefully I have managed to attach an example letter and then the area I would like to limit the OCR to, effectively cropping out the surrounding areas.
And then how could I automate this for all pages within a document.
They are not the same ā you will notice that the file ācropped for OCRā has the letterhead at the top of the page and the date stamp removed, leaving only the body of the letter remaining to be OCR-ed.
This is a massively common need; Iād be shocked if Iām the only one. OCR-ing the other file yields a lot of useful data. Itās similar to clipping a letterboxed film, removing the black bordering areas above and below the picture. Iām sorry if this is the wrong terminology as I donāt know what the right this is.
This is confusing to me, because the doc is cropped on my computer. Is this better? I donāt know how to make it show up as an image in the forumā¦
I am patient! This is such a long-standing need for me. I am sure others have a similar need but it is hard to articulate (and lawyers are pretty tech-averse, generallyā¦).
@Solar-Glare , thanks, but I might not have been clear. Every page has block text that should be ignored, while inside of that is relevant text that Iād like to OCR. Itās basically a template (irrelevant) with text placed within it (relevant).
EDIT: I didnāt notice the rest of your comment.
The letterhead is lots of repetitive, useless text. I only want the relevant part of the doc which is the letter the correspondent has written. So, for example, I have an 8-1/2" x 11" page but only want the center 8-1/2" x 7" zone of the page OCR-ed. Every page of the doc basically the same, but every doc has a different zone the is relevant (because every letterhead is different).
Youāre right to get right at the functionality Iām looking for. Itās copy/paste and search. C/P is an obvious frustration here, but search is the hidden PITA. Names, titles, page numbers, and whatever else people put in headers/footers repeat and clog up search results (and C/P).
There might be another angle, e.g. what scholars might use, cropping out/in footnotes. Really what Iām trying to achieve is that: just being able to automatically crop a specified amount from the top/bottom/sides would get me 90% of the way there. Itās functionally unrelated to OCRing, and more an image-prepping need.
Iām sorry, I know nothing of how to script. But hopefully this is a pretty standard procedural needā¦
Not a definite solution and Iām on an iOS device currently, but automatic cropping of PDF appears to be possible in Shortcuts. Based on that, my preliminary assumption would be that itās also possible in Applescript. The idea is that you crop a document with the letterhead and/or footer and save that as the input for OCR.
Not ideal as you loose information or end up with āduplicatesā (a cropped version and an original version), but it might be better than waiting for a crop solution that needs to be developed and thus takes time.
Below is just a proof of principle and requires an iOS device:
(if necessary) install Shortcuts
create a shortcut
rename it and enable sharing in the share sheet
add action: mask image (choose some mask, like ellipse)
add action: quick look
Now share a small multipaged PDF to that shortcut from DTTG or any other app with a PDF.
If it works as intended, youāll see the PDF displayed, but cropped with the ellipse shape.
Just an idea: Adobe Acrobat is pretty scriptable, and a quick search revealed this little script off hand. That approach would almost certainly allow you to set up a smart rule which sent the file you received in the Global Inbox to Acrobat for cropping, running OCR on the result. (You could actually set up different smart rules for different documents, incorporating different crop zones.) I personally would set up a second database which mirrored my first; in one, I would put the originals, in the other I would put the cropped files. That way I could exclude the originals from search easily.
In a perfect world. Shortcuts has (un?)fortunately nothing to do with AppleScript, which is not even available on iOS. It comes from a different company that Apple bought some years ago.
Apples Preview app is hardly scriptable at all. As @Blanc pointed out, Acrobat is, but thatās also quite costly afaik. A cheaper option might be PDFPen: it permits scripting, thus creating an āimprintā which could perhaps be a white box covering the letterhead. And then run OCR (either in PDFPen or in DT).
Perhaps it is possible to define an imprint in DT as well and apply it to (copies of) those documents by using a white text on a white background?
My assumption wasnāt based on the correlation between frameworks (which is indeed 0), but on the aspect that cropping is fairly simple to do on iOS. That makes the chance of it being possible somehow on macOS fairly high.
I think Iām going with @chrillek on this one; document manipulation options with onboard tools are pretty basic, because Preview is pretty much unscriptable. I didnāt find any simple options when I search earlier on, or when we had the problem that OCR was changing the page size and I wanted to automate a solution.
That is a damn good idea; just tried it though, it seems not to be possible in one swoop (itās not possible to define an imprint of suitable size). It might be possible if you defined say 5 imprints and applied them all though.
It doesnāt seem to be possible with the tools provided by Apple. Automator allows cropping but only for images and to fixed formats ā I doubt that this action accepts PDFs. GraphicsConverter has a similar function, but again only to fixed formats. Not sure about PDFs there.
One could of course convert the PDF to an image first, then crop that and OCR the image.
Slightly OT: Appleās automation āstrategyā is completely derailed (if it ever was on anything like a rail). They have shortcuts for iOS and automator for MacOS with similar, but not identical functions. They have AppleScript for MacOS but no scripting at all for iOS. And many Apple programs on MacOS nowadays have broken scripting support anyway, cf. Notes, Preview, Reminders.
So itās no wonder that something seemingly so obvious and simple as cropping a PDF is not possible on the Mac.
Using text consisting of several lines? I get a new line with Ctrl-Ret in the imprintās textfield. However, Iām too dumb to set the fill colour to anything at all. All I ever get it text printed over the original (in the preview, didnāt try it on a document).
Together that makes us quite clever; I was not aware of the Ctrl-Ret. new line in imprinter. You donāt see the fill in preview, but so long as you select a border and a fill, it will be present in the document when you imprint. And it works, too - I can cover the top section of a document with an imprinter. @mjnnyc I think @chrillek has cracked it - in preferences, set up an imprint which uses white text colour, is as wide as a page, and as many lines high as you need (see above re. new lines). Set a border. Set the border and fill to white. Imprint your documents (decide first whether to duplicate at the same time; you cannot later remove an imprint). OCR the imprinted document.
Better still: imprinting is a smart rule action, so you could automate the whole lot
Both, but mostly receiving. (Gay? lol) And it would be awesome to go back and process older files, even if it was a bit tedious.
Iām guessing that you are asking because there is maybe an easy way to do this if Iām doing the scanningā¦? I have a ScanSnap S1500M that I scan with. Maybe thereās a simple cropping procedure that I donāt know aboutā¦?
You guys are amazing. Thanks for all your interest in looking at this.