OCR issues with PDF > PDF+text, again

tharpold · April 3, 2024, 12:06pm

I’ve recently gathered a large number of scanned copies of old journals (PDF only, no text layer) used in my research and am in the process of gradually adding them to my DT3 databases. (3.9.6, macOS 14.4.1). As I copy them over to DT3, I manually execute the “OCR to searchable text” command for each file.

My workflow in this situation is to add maybe a half-dozen or so journal issues at a time, queue up the OCR jobs, verify the results when completed, then delete the old PDF-only files and move the new PDF+text files to their proper location in the database. As I’m short on time these days, I’m doing this only perhaps once or twice a day. Thus I’m running the OCR conversions between six and twelve times daily, on files that are about 60-90 MB without a text layer and 120-180 MB after the conversion.

I’m once again running into an issue I’ve encountered before: after repeated successes, DT3’s built-in OCR (ABBYY FineReader) just stops working: I select the “OCR to searchable text” command and nothing happens. Error messages are rarely generated in these situations and nothing pops up in the Activity window to indicate that the menu command was even registered. Repeated efforts to execute it don’t work. Usually restarting DT3 fixes the problem, sometimes (this happened this AM), I have to reboot my computer to get back this feature of DT3. It’s as if ABBYY reaches some threshold for OCR, either a bug or some kind of counter, and just won’t budge after that. After restarting/rebooting, the documents in question OCR correctly, as if nothing had happened.

The issue is difficult to reproduce – I do wish I could be more detailed in this report – and I can’t identify the conditions under which it occurs. But it does happen, nearly predictably, almost every time I try to run a lot of OCR jobs. And it’s the only significant glitch I’ve encountered with DT3.

So… I’m registering an observation that something is amiss…?

BLUEFROG · April 3, 2024, 12:52pm

This is the kind of thing to be reported in our support ticket system. Hold the Option key and choose Help > Report bug to start a support ticket.
Thanks!

tharpold · April 3, 2024, 12:56pm

Hmm. I just did that, but my default email program (Outlook 16.8.3) did not include the email address for support in the To: field. What is the correct email address to generate a support ticket?

BLUEFROG · April 3, 2024, 2:56pm

Send it to support@devontechnologies.com.
The email should have some technical info and at least two log files for us to inspect.

tharpold · April 3, 2024, 3:25pm

Well, then, hmm. That’s not working as it should. No logs attached, just a long string of stats on my databases. Is there a way to force the bug report to open in an email program of the user’s choice. (On this computer, for job-related reasons, my defaulty email app is, ugh, Outlook.)

BLUEFROG · April 3, 2024, 3:37pm

Sorry to hear you’re stuck with Outlook
Just open a support ticket and we’ll go from there.

chrillek · April 3, 2024, 4:00pm

The problem might be that only Apple’s Mail is easily scriptable. And there’s a plethora of mail programs out there, not to mention mail apps in the browser.

signsinthedust · April 3, 2024, 6:51pm

This is also an issue I have encountered when trying to OCR book PDFs

rmschne · April 3, 2024, 6:53pm

Permissions on the book PDF’s in the way? Aren’t book PDF’s already OCRed?

signsinthedust · April 4, 2024, 12:14am

Not all of my books are OCR’d, at least.

rmschne · April 4, 2024, 5:45am

But I’m wondering if the publisher (whoever created the PDF) did it in such a way that permissions on the file allow someone to OCR, e.g. “change” it.

When creating PDFs, there are a number of permission options that can be placed on the created file to allow for example, opening, reading, printing, editing, … That’s in the ANSI spec for PDF and implemented in PDF creating software.

Try: “print” a new PDF from the book with macOS Preview app (if the publisher allows the file to be printed) then OCR the new one.

signsinthedust · April 4, 2024, 5:51am

I have tried that, actually. I still experience the same issue as @tharpold.

rmschne · April 4, 2024, 5:57am

DEVONthink will need to address. They’ll want to know macOS version, DEVONthink version, and perhaps a sample of offending PDF maybe.