Make PDF searchable without explicitly using OCR

MauriceK · August 24, 2025, 4:30pm

I have a DT3 database with approximately 10,000 PDF documents. Almost all of them are without OCR. When opening the database with DT4, are these files automatically made searchable (without OCR) within DT4? If not, what do I need to do to achieve this?

chrillek · August 24, 2025, 4:37pm

How do you imagine searching for text without any text (layer) being present?

MauriceK · August 24, 2025, 4:42pm

Feature Description of DT4:
”PDFs added to your database can be made searchable without explicitly using OCR (though OCR is still a powerful and useful feature) .”

chrillek · August 24, 2025, 4:59pm

That’s referring to Apple’s Vision framework, I suppose. I believe there were some posts here discussing that. Searching the forum should reveal what has been said about that.

Eg Re-Indexing Imported PDF Documents - #4 by BLUEFROG

MauriceK · August 24, 2025, 5:06pm

That doesn’t answer my question about whether indexing happens automatically during the upgrade (when opening an old database in DT4 for the first time). If it happens at all. I couldn’t find anything about this in the forum either. That’s why I posted.

BLUEFROG · August 24, 2025, 5:16pm

PDFs without a text layer, or with a broken text layer, may be processed with a function similar to Apple’s Live Text. However…

It is optional and needs to either be done on a document-by-document basis or enabled in via the _Files > Import > Recognition: Transcribe PDF documents** setting.
This does not replace traditional OCR as it does not add a text layer to the document. It adds the indexed text to the database’s index, making it searchable without explicitly doing OCR.

MauriceK · August 24, 2025, 5:36pm

Thanks for the clarification. I’ll give it a try.

BLUEFROG · August 24, 2025, 5:44pm

You’re welcome.

CAE · August 24, 2025, 7:51pm

@MauriceK just curious: why would you not want to use OCR?

MauriceK · August 25, 2025, 9:45am

I don’t want to add a text layer. In the past, there have been problems when recompressing the scanned image (CCITT G4). The quality of the scan was significantly degraded. I don’t want to take that risk anymore.

CAE · August 25, 2025, 5:21pm

Ah, I see. Luckily, I have not had that issue.

MauriceK · August 25, 2025, 6:12pm

The results of my test are disappointing. Recognition is worse than with OCR, but also worse than Live Text performed manually in the Preview app. In addition, the documents are then labelled as ‘PDF+TEXT’ if recognition was explicitly started. When recognition is performed during import, the label remains ‘PDF document’. Furthermore, after recognition, the number of words is initially 0, but after closing and reopening the database, the number is > 0. This means that I can no longer distinguish between PDF documents with and without a text layer.

BLUEFROG · August 25, 2025, 6:13pm

What are your OCR settings?

The results of my test are disappointing.

We have no idea about your “test” or the actual documents nor the OCR settings I just asked about. More information, methodology, materials, etc. would be useful. You can send them to a support ticket if you don’t want to share publicly.

MauriceK · August 25, 2025, 6:20pm

Not necessary. I’ve already spent too much time wondering whether upgrading to DT4 is worth it for me. I’ve decided to wait and see before investing any more time in this issue.

BLUEFROG · August 25, 2025, 6:26pm

As you wish

chrillek · August 25, 2025, 6:52pm

AFAIK, the OCR in DT4 has not changed from DT3, it is still Abbyy’s library. And the text recognition that is not Abby’s OCR is, I think, just Apple’s Vision framework. Which has, IMO, some shortcomings.

cornchip · August 25, 2025, 7:02pm

It’d be nice to be able to lift the ABBYY SDK limits on documents that can be OCRed per month and in parallel.

chrillek · August 25, 2025, 7:25pm

I’m sure that’s just a matter of license fees. Which would increase the cost for DT. I never ran into any limitations in that context until now, though.

macula · August 25, 2025, 7:52pm

Recognition is worse than with OCR, but also worse than Live Text performed manually in the Preview app.

This is curious. Is Apple’s Vision algorithm in Preview.app superior to the one in macOS SDK?

chrillek · August 25, 2025, 8:00pm

I wouldn’t put it beyond Apple to use a different algorithm in their own products. When I fiddled around with Vision, running it on screenshots, it sometimes output the words on one line in the wrong order, for example. Haven’t seen that yet with Preview etc.