OCR for PDF that is already half readable

tamara6 · March 30, 2025, 9:01pm

I have a pdf document (a downloaded dissertation from Proquest) that seems to be partially an image and partially a text layer. I am a new DT user (I have the Pro edition). DT tells me that the document is PDF+Text, and it gives me a word count. I had DT convert it to a text file. The first part of the text file is gibberish, but the second part (which seems to be an appendix to the dissertation) is readable in the text file. Also, words that are in this second part can be searched and found using the in-document search function (in the inspector on the right side of the window). Words previous to this appendix (such as “abstract”) - which I can see - are not searchable.

If I open the downloaded PDF in Preview, the same thing happens - I can search for words in the appendix, but not in the rest of the document.

Since most of the document is clearly not readable, I chose Data > OCR > to searchable PDF. It looks like it is processing, up until it gets to the pages that are already readable, and then it stops, and it does not save a version that is readable for all pages.

So… I’m not sure what to do. I’m sure I could actually print (on paper) the whole dissertation, scan it in, and then do an OCR on it. But that wastes a lot of paper. Is there another way? Is there a way to delete the text layer that is already there? Or have Preview save an “image only” file, that then DT could do an OCR on to get a text layer for the whole document?

(I’m not sure if I have permission to upload the document here, since I downloaded it from behind a password protected site - but I’m happy to post the link if anyone thinks they can get to it and wants it)

I should add that I’m on an M3 macbook air running sonoma, and have DT 3.9.8

Thanks!

BLUEFROG · March 30, 2025, 9:52pm

Welcome @tamara6

Hold the Option key and choose Help > Report bug to start a support ticket.

kewms · March 31, 2025, 12:54am

Split the PDF at the last unreadable page, and then OCR the first half?

tamara6 · March 31, 2025, 11:29pm

That is exactly what I did, and it worked well.