Difference in results of OCR: Smart Rule Action vs. Context Menu

Background: A standard test within Hazel to determine if a file is OCR’d is by running a shell script which performs a grep for the string “Font”, which is an internal data item in a PDF that has been OCR’d and is missing if not.

OCR within DT3.0 using the context menu OCR > To searchable PDF contains that string.
OCR using a smart rule and either OCR > Apply or OCR > to searchable PDF does not.
Using an embedded applescript to perform the OCR with the smart rule also yields identical results to the above actions.

I used a very short original document (one line of text) as a test.
When compared in BBEdit, the file using the context menu is longer by about 150 lines and contains much more internal data that the smart rule product does not.

Can the Smart Rule actions (Edit: and via AppleScript) be modified to perform the same action as the context menu?

There are two routes through the OCR i) converting an existing record to a searchable PDF and ii) importing an image or document to be converted to a searchable PDF. A smart rule with OCR->Apply will use option ii and OCR > To searchable PDF will use option i. An AppleScript could use either depending on which commands you are using.

In either case the actual OCR of the image/document is the same and the text layer of the OCR should be identical (unless the dpi of the original image is a low). Small differences in final output size of the PDF file can occur when the OCR file has been re-saved in post process (such as adding/transferring metadata)

If the text layer is different between the two files please raise a bug report and attach a copy of both files.

Thanks for getting back to me, Alan.

The options you mention (i and ii) plus AppleScript using the “ocr file” command all yield virtually the same result, which does NOT include the searchable, but not part of the visible text layer, “Font” term.

The option I raised as preferable, but you do not mention, is using the context menu "OCR → To searchable PDF"which yields more complete results in terms of internal data.

An example of what is missing in the automated methods but appears in a “complete” scanned PDF via the contextual menu:

<</Length1 36447/Filter/FlateDecode/Type/Font/Length 23556>>

This data is not visible in a normal PDF viewer such as Preview, but can be searched for in BBEdit and for my purposes Hazel rules, to determine if the file has been OCRed or not.

I will submit a bug report later today with sample files, but please note that what I am concerned about is the different output between automated methods (Smart rules and AppleScript) and the manual method of using the contextual menu.

As explained previously there can be differences in size using the different options, and it should be the text layer that you are comparing against not the contents or size of the whole PDF file.

As explained previously, the difference is the inclusion or exclusion of searchable data that can be used to automate my workflow. The contextual menu and the scriptable commands yield different results, and the latter results are not useable for workflow purpose.

Why do the script commands raise a different process that leaves out the data quoted above? The text layer is the same but that is not my issue.

I will look at the files when I receive the bug report

Thanks. Sent, assigned ticket #812394.

Appreciate your insight as to why the contextual menu result is complete and the smart rule result is not.

The presence of Font in a PDF does not indicate OCR has been done on a file. This is could be present in a file that was captured or printed to PDF by normal means.

PS: I just converted several images via Convert > to Searchable PDF in the contextual menu. Neither includes a specific Font reference.

The Hazel forum folks recommend the grep search for “Font” as a test for OCR; I’ve been using it in Hazel rules for some time and have never run into an exception…until now, when I get a false indication of not OCRed rather than of being OCRed by finding “Font” in a non-OCRed PDF.

Do you mean “OCR–>to searchable PDF”? I don’t show that as a Convert option.

How are you searching for “Font”? Search using grep, or within BBEdit, or, If you have Hazel, try this, which returns a match if the file is not OCRed:

(This discussion is now more academic than practical, as Alan has indicated that the smart rule process won’t allow the grep test, so I’m using tags instead. Still gets the workflow accomplished.)

1 Like

Do you mean “ OCR–>to searchable PDF ”? I don’t show that as a Convert option.

Yes. Sorry about that.

BBEdit at the moment.

Can you access the two PDF files in ticket #812394? If so, load them into BBEdit and do a Compare…you’ll see the difference. One was converted with the contextual menu, the other with a smart rule.

The contextual menu version contains the following at line 18, which is not present in the smart rule version (there are similar ones further down in the file; this is the first):

<< /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] /Font << /TT2 9 0 R /TT1
8 0 R >> /XObject << /Im1 10 0 R >> >>

(Edit…PS: I could have been more specific in my post 9…I’m referring to pdf products of document scanning, and I realize there are many more types. Regret the confusion.)