Make PDF searchable without explicitly using OCR

I have a DT3 database with approximately 10,000 PDF documents. Almost all of them are without OCR. When opening the database with DT4, are these files automatically made searchable (without OCR) within DT4? If not, what do I need to do to achieve this?

How do you imagine searching for text without any text (layer) being present?

Feature Description of DT4:
ā€PDFs added to your database can be made searchable without explicitly using OCR (though OCR is still a powerful and useful feature) .ā€

That’s referring to Apple’s Vision framework, I suppose. I believe there were some posts here discussing that. Searching the forum should reveal what has been said about that.

Eg Re-Indexing Imported PDF Documents - #4 by BLUEFROG

That doesn’t answer my question about whether indexing happens automatically during the upgrade (when opening an old database in DT4 for the first time). If it happens at all. I couldn’t find anything about this in the forum either. That’s why I posted.

1 Like

PDFs without a text layer, or with a broken text layer, may be processed with a function similar to Apple’s Live Text. However…

  1. It is optional and needs to either be done on a document-by-document basis or enabled in via the _Files > Import > Recognition: Transcribe PDF documents** setting.
  2. This does not replace traditional OCR as it does not add a text layer to the document. It adds the indexed text to the database’s index, making it searchable without explicitly doing OCR.
3 Likes

Thanks for the clarification. I’ll give it a try.

You’re welcome.

@MauriceK just curious: why would you not want to use OCR?

1 Like

I don’t want to add a text layer. In the past, there have been problems when recompressing the scanned image (CCITT G4). The quality of the scan was significantly degraded. I don’t want to take that risk anymore.

Ah, I see. Luckily, I have not had that issue.

1 Like

The results of my test are disappointing. Recognition is worse than with OCR, but also worse than Live Text performed manually in the Preview app. In addition, the documents are then labelled as ā€˜PDF+TEXT’ if recognition was explicitly started. When recognition is performed during import, the label remains ā€˜PDF document’. Furthermore, after recognition, the number of words is initially 0, but after closing and reopening the database, the number is > 0. This means that I can no longer distinguish between PDF documents with and without a text layer.

What are your OCR settings?

The results of my test are disappointing.

We have no idea about your ā€œtestā€ or the actual documents nor the OCR settings I just asked about. More information, methodology, materials, etc. would be useful. You can send them to a support ticket if you don’t want to share publicly.

2 Likes

Not necessary. I’ve already spent too much time wondering whether upgrading to DT4 is worth it for me. I’ve decided to wait and see before investing any more time in this issue.

As you wish :slight_smile:

1 Like

AFAIK, the OCR in DT4 has not changed from DT3, it is still Abbyy’s library. And the text recognition that is not Abby’s OCR is, I think, just Apple’s Vision framework. Which has, IMO, some shortcomings.

It’d be nice to be able to lift the ABBYY SDK limits on documents that can be OCRed per month and in parallel.

I’m sure that’s just a matter of license fees. Which would increase the cost for DT. I never ran into any limitations in that context until now, though.

2 Likes

Recognition is worse than with OCR, but also worse than Live Text performed manually in the Preview app.

This is curious. Is Apple’s Vision algorithm in Preview.app superior to the one in macOS SDK?

I wouldn’t put it beyond Apple to use a different algorithm in their own products. When I fiddled around with Vision, running it on screenshots, it sometimes output the words on one line in the wrong order, for example. Haven’t seen that yet with Preview etc.