Script to OCR PDFs with the latest FineReader

Silverstone · November 14, 2019, 12:00pm

Do you mean Script or Rule?
In script you should just find one line of code and change it with other given above in any script editor.
The meaning of this change: if you leave “delete” version script will delete the original, and there will be no way to undo OCR, esp if you are not satisfied with the results. If you replace “delete” with “move” you’ll be able to find original scan in Trash folder after OCR, move it back and re-OCR if needed. It will also flag PDFs after OCR, so you could easily find what this automation did to your files while you were away drinking coffee. That’s it.

In rule I think your settings are good enough

Silverstone · November 14, 2019, 12:05pm

Meanwhile, instead of drinking coffee you may set up reminders for this scan, make annotation and be sure that when OCR will be done you will not loose them. Built-in OCR function currently does not recover reminders and annotations.

roads · January 1, 2020, 4:42pm

Thanks for the Script. The file is converted as pdf.pdf which I understand can not be changed.
What is strange after scan by finereader, I can not mark a text in the converted pdf it’s a block where I can not mark a single word but the whole page. Can anyone explain what happens?
I use the test period and set langList to { German , English }

Silverstone · January 3, 2020, 10:58pm

About “pdf.pdf” issue. When you rename a pdf item (and have option “show extensions” turned on) DT3 will suggest you to rename only a base name, selecting it with “.pdf” not included in this selection. If you change only selected text, you will get “changed name.pdf.pdf” as a “filename” property. “Name” property will read as “changed name.pdf” instead of just “changed name”. How to avoid it: (1) delete extension while renaming every time you rename; (2) turn off option “show extensions”.
About “blocks”. This may be not a script issue, it’s recognition parameters. Try changing it. You may want to OCR pdf manually with FR and see what you get.
Did you change the LangList in the script to your languages?

roads · January 4, 2020, 7:33am

Thanks for the Reply will try to OCR manually and see what happens. Yeah I changed the Language. When looking what OCR does I see it moving through the Text areas perfectly just when its done I can not mark words or sentences.

roads · January 4, 2020, 7:52am

You are right, has nothing to do with your Script. Really strange after recognition I can copy the text and past it for example Textedit but after export I can not mark anything. Will have to write to Finereader Support as I can not find this problem on google

Found it, its Catalina Preview that can not select the text. Acrobat reader does not have this problem.

jerwin · January 4, 2020, 12:35pm

This may help. Not having Catalina, I can’t comment further.

BLUEFROG · January 4, 2020, 5:06pm

its Catalina Preview that can not select the text. Acrobat reader does not have this problem

Bear in mind, Adobe created the PDF format and do things their own way. Preview uses Apple’s PDFKit (as does DEVONthink), so the behavior can definitely vary.

roads · January 4, 2020, 5:34pm

Really seems the three fonts are missing. I just cant copy them from font/supplement to /font. Cata wont allow it even as Admin.

roads · January 5, 2020, 5:11am

Ah I missread the files are to be copied to /library/fonts not system/libraryfonts which works and so does the fix. Thanks for the help and sorry for the off topic. Thanks again for this amazing script.

Silverstone · January 6, 2020, 8:40am

You are welcome! )
Glad you solved the issue

Chazzo · March 12, 2020, 6:15pm

Silverstone, thank you for this excellent script. One small point: when using this in a Smart Rule, is there a way to refer to the newly-OCRd document in subsequent processing steps?

For instance, I have a Smart Rule named “OCR and rename”, with two steps:

OCR > Apply
Change Name to Sortable Document Date | Proposed Name

The renaming step doesn’t work with your script, presumably because DT has lost track of the new document. If there’s a way round this it would be great.

On a wider point, does anyone have a sense of where FineReader 12 outperforms the version built into DT? I bought it for a local history project, where it does significantly better on poor-quality scans of old newspapers. On my day-to-day 300 dpi monochrome scans from my Brother scanner I’ve not seen much difference.

Silverstone · March 12, 2020, 6:56pm

Of course you can rename the OCRed file as you want. You may do it in the block where you set the cloning options, no problem here
As for v12 vs v11, you may find my thoughts in the main post. Script gives you more control over all possible OCR and export options. Note, e.g., that if you turn on MRC all barcodes will be blurred.

Chazzo · March 14, 2020, 10:45pm

Thanks @Silverstone. For point 1 I meant that I’d like to use the built-in “Change Name to Proposed Name” command (and perhaps other built-in commands) instead of trying to replicate it in a script. That command works when it comes after the built-in OCR command, but not after the “Execute Script” command – I’m guessing because the file reference has changed. Is there any way to modify your script so that once it finishes and returns control to the Smart Rule, the next step recognises the reference to the OCR’d file we have just created?

For point 2, yes, I read your list of advantages. At this point I am mostly interested in speed, OCR accuracy and file size. It seems to me that compared to Finereader 11 with equivalent settings, FR 12 sometimes creates smaller files and sometimes not. I just wondered if there is any pattern to this (colour/greyscale/mono, resolution…).

Silverstone · March 15, 2020, 7:08am

Is there a pattern how you want to rename your OCRed file?
Not sure about exact cases, but in my experiments if I tried to set the same settings for 11 and 12 engines, the latter gave overall better outcome results.

Chazzo · March 15, 2020, 2:53pm

Good question. Up till now I’ve been happy enough with DT’s “Proposed Name”, which just seems to be the first few words of the document (and often IN CAPS, if it’s a company letterhead). That’s not ideal, but I can’t think of a better answer*. Also key is the ability to add the date using the “Sortable Document Date” variable, which generally works very well.

I used to have a script that tried to do all that, but it was complex (especially to cover all possible date formats) and not very successful. To me it makes more sense to have the OCR script return a reference to the new document, so that we can pass it on to the following steps in the Smart Rule. Is that not possible?

*unless you could take advantage of DT’s ability to summarise documents It’s obviously very hard to distill a whole document into just four or five words, and for all I know the built-in Smart Rule already uses this approach.

I looked to see what ABBYY had said about this but wasn’t sure what applies to the Mac version. I’m guessing Finereader Engine 12 is the latest. For most western languages they don’t seem to say that character recognition has improved per se, but there’s a bunch of other stuff that could be helpful, depending on the document type.

Silverstone · March 15, 2020, 4:58pm

Not sure if you can do it in one Smart rule, but if you like these built-in naming options, you can use them.

Add Custom metadata like “OCR Status” in DT with values e.g. “Recognized” and “Not for OCR”.
Somewhere in the “Cloning block” of the script use function “add custom metadata…” to add “Recognized” value to the OCRed document.
You may use now another Smart rule to select “Recognized” documents and apply all what you need to them.

Another value you may use to filter the documents which you do not want to OCR (add appropriate “if” block to the script at the beginning).

That’s it

Chazzo · March 15, 2020, 8:33pm

Ah, cool idea. Thank you!

Silverstone · March 16, 2020, 5:33am

You’re welcome

Silverstone · March 17, 2020, 6:22am

I’ve already written somewhere that 11th engine has problems with a multi-oriented text. If you have vertical text along with horizontal on the same page (complex graphics, pivot tables, captions etc), it will not OCR vertical text correctly. 12th engine does it as expected.