OCR images to make them searchable, but leave them as images?

stuartro · November 4, 2023, 3:42pm

Is it possible to have DT OCR images so that one might search for, say, “kind:image ”, where is text extracted / “indexed” from the OCR results?

If the answer is “yes”, then I will have to upgrade to DT Pro this afternoon. In hope that the answer is “yes”, one more quick question: Are there any issues with viewing / searching emails (i.e. .eml files) in DT these days? I recall there being an issue with Catalina that broke the built-in macOS preview of emails. Does DT Pro rely on the built-in macOS functionality or have dedicated email previewing functionality?

DTLow · November 4, 2023, 3:48pm

afaik These are two separate individual actions; OCR and Index
I’m wondering where the text will be stored after the OCR action
The image format doesn’t support a text layer

BLUEFROG · November 4, 2023, 4:07pm

Text isn’t extracted from OCR and it’s always indexed.
And yes, you can create a smart rule to OCR incoming documents, including images. In fact, this is exactly what OCR is built for. It’s not a PDF document processor. It just looks that way on the outside.
This has been discussed many times and there is even a blog post on the subject…

Are there any issues with viewing / searching emails (i.e. .eml files) in DT these days?

None I’m aware of.

Does DT Pro rely on the built-in macOS functionality or have dedicated email previewing functionality?

Yes and yes. You should read Help > Tutorials > Handling Email.

stuartro · November 4, 2023, 5:57pm

@BLUEFROG, so at the risk of being pedantic, this means that if I add a PNG file showing the text “Hello World”, provided an appropriate smart rule to OCR incoming images, I should be able to search for “Hello” and the image will show up in the results list?

If I click on the image in the results list, is the word “Hello” highlighted in the image?

chrillek · November 4, 2023, 6:25pm

At the risk of stating the obvious: OCRing an image must create a file that contains text, eg a PDF. PNG does not. So, do a “OCR to PDF” and the PDF will come up when you search for hello.

stuartro · November 4, 2023, 6:40pm

Thanks @chrillek. I understand fully now. Just upgraded to DT Pro and feeling like a jedi master with newfound powers

BLUEFROG · November 4, 2023, 7:49pm

Yes, indeed @chrillek’s response was

stuartro · November 4, 2023, 8:49pm

As part of trying out my new OCR powers, I created a little “Experiments” database, added a single PNG screenshot to it, and created a Smart Rule as follows:

However… if I right-click on the new Smart Rule and click Apply Rule, DEVONthink Pro hangs with a spinning pizza icon. I waited for more than 5 minutes (thinking perhaps it was downloading some OCR component or something) and then eventually had to force-quit DT. This is repeatable — i.e. same hang every time I “Apply Rule”.

Any ideas what might be wrong?

Running DT Pro 3.9.4 on macOS Ventura 13.6 (on 5K iMac 27" 2019, 72GB RAM, lots of free disk space).

stuartro · November 4, 2023, 8:50pm

The same hang occurs if I right-click on the PNG file and directly select “OCR to searchable PDF”.

chrillek · November 4, 2023, 9:06pm

Is the OCR stuff installed? Search for Abbyy in the forum to find pointers - in away from my Mac so can’t say more than that.

stuartro · November 4, 2023, 9:12pm

I just rebooted and the problem seems to be gone.

Now, however, I have another question: After OCRing the PNG file to a searchable PDF, is it possible to have some link automatically created between the original PNG image file and the output PDF?

I see when I “OCR to annotation”, the generated annotation RTF files have incoming links (from the OCR’d image file)—though, for the life of me I can’t find out where the outgoing link (on the OCR’d image) is stored. That is, displaying the “Outgoing links” column in the file list view shows “1”, but the Links section (in the sidebar) shows nothing.

DTLow · November 4, 2023, 9:20pm

Use the Annotations Inspector
and right-click > Reveal

chrillek · November 4, 2023, 9:54pm

Why? The PDF contains the same image data as the PNG, so what do you need the PNG for anymore?