OCR via DTPro creates enormous page sizes?

BLUEFROG · June 2, 2021, 2:34pm

Why did you do this?

cmedy · June 2, 2021, 2:53pm

For years now I have been taking research trips to archives, where I’d take thousands of photos of historical documents and then use Acrobat Pro to convert all the files to PDFs and OCR them that way. After I started using DT (in 2011 I believe), I would import them as already OCR-ed PDFs.

At some point (last year or the year before?) I discovered that DT’s OCR capabilities had improved and not only did a better job recognizing text but would make the files smaller. I started going back and OCR-ing old files with DT.

This process worked fine until February, when I started on a new project and OCRed tons of older files that I hadn’t touched in several years.

I do have JPGs and can go in and redo the process for all of the files that somehow became enlarged. But that means going through a few thousand files that I’d already sorted, tagged, etc. And I can make sure to process all JPGs in DT from the start.

But I know it happened with files that I downloaded as PDFs or received as PDFs from manuscript collections, too. And so it would be great to figure out why it happens so I can ensure it doesn’t happen again.

A couple possible clues, maybe: The change might have coincided with DT getting ABBYY 12, vs. 11? And, if it makes any difference to how PDF services work in DT, I do own ABBYY 12, though I rarely use it.

I’m on a MacBook Pro 2020, Big Sur (I waited to install Big Sur until maybe a month ago, but I’ve been using the same Mac since May 2020).

Thank you.

BLUEFROG · June 2, 2021, 2:56pm

If you’re doing OCR in DEVONthink or DEVONthink To Go, there’s no need to convert them to PDFs beforehand.

Also, you shouldn’t run OCR on every PDF you acquire. You should determine if they need OCR first.

cmedy · June 2, 2021, 3:02pm

Yes. I understand both of those points, and I tried to explain my reasoning for having done so above while also assuring you that I’ve learned my lesson.

However, the problem still remains that I have a number of PDFs that do need to be OCRed that I did not acquire as JPEGs. And often these files, too, have been problematic. At the moment I don’t have the originals of those and so I can’t recreate the problem. But when/if it happens again – after I have received a PDF in need of OCRing for which I cannot access a JPG file – I’ll send those along to Alan as well.