I have a DT3 database with approximately 10,000 PDF documents. Almost all of them are without OCR. When opening the database with DT4, are these files automatically made searchable (without OCR) within DT4? If not, what do I need to do to achieve this?
How do you imagine searching for text without any text (layer) being present?
Feature Description of DT4:
āPDFs added to your database can be made searchable without explicitly using OCR (though OCR is still a powerful and useful feature) .ā
Thatās referring to Appleās Vision framework, I suppose. I believe there were some posts here discussing that. Searching the forum should reveal what has been said about that.
That doesnāt answer my question about whether indexing happens automatically during the upgrade (when opening an old database in DT4 for the first time). If it happens at all. I couldnāt find anything about this in the forum either. Thatās why I posted.
PDFs without a text layer, or with a broken text layer, may be processed with a function similar to Appleās Live Text. Howeverā¦
- It is optional and needs to either be done on a document-by-document basis or enabled in via the _Files > Import > Recognition: Transcribe PDF documents** setting.
- This does not replace traditional OCR as it does not add a text layer to the document. It adds the indexed text to the databaseās index, making it searchable without explicitly doing OCR.
Thanks for the clarification. Iāll give it a try.
Youāre welcome.
@MauriceK just curious: why would you not want to use OCR?
I donāt want to add a text layer. In the past, there have been problems when recompressing the scanned image (CCITT G4). The quality of the scan was significantly degraded. I donāt want to take that risk anymore.
Ah, I see. Luckily, I have not had that issue.
The results of my test are disappointing. Recognition is worse than with OCR, but also worse than Live Text performed manually in the Preview app. In addition, the documents are then labelled as āPDF+TEXTā if recognition was explicitly started. When recognition is performed during import, the label remains āPDF documentā. Furthermore, after recognition, the number of words is initially 0, but after closing and reopening the database, the number is > 0. This means that I can no longer distinguish between PDF documents with and without a text layer.
What are your OCR settings?
The results of my test are disappointing.
We have no idea about your ātestā or the actual documents nor the OCR settings I just asked about. More information, methodology, materials, etc. would be useful. You can send them to a support ticket if you donāt want to share publicly.
Not necessary. Iāve already spent too much time wondering whether upgrading to DT4 is worth it for me. Iāve decided to wait and see before investing any more time in this issue.
As you wish ![]()
AFAIK, the OCR in DT4 has not changed from DT3, it is still Abbyyās library. And the text recognition that is not Abbyās OCR is, I think, just Appleās Vision framework. Which has, IMO, some shortcomings.
Itād be nice to be able to lift the ABBYY SDK limits on documents that can be OCRed per month and in parallel.
Iām sure thatās just a matter of license fees. Which would increase the cost for DT. I never ran into any limitations in that context until now, though.
Recognition is worse than with OCR, but also worse than Live Text performed manually in the Preview app.
This is curious. Is Appleās Vision algorithm in Preview.app superior to the one in macOS SDK?
I wouldnāt put it beyond Apple to use a different algorithm in their own products. When I fiddled around with Vision, running it on screenshots, it sometimes output the words on one line in the wrong order, for example. Havenāt seen that yet with Preview etc.