OCR without compression

Ruth · January 16, 2021, 7:23pm

Using DT3 Pro, is it possible to run OCR on an PDF document without compressing it at the same time? I’ve unchecked Compress PDF under Preferences > OCR, but that doesn’t change this behaviour and seems just to change behaviour when scanning within DT3.

BLUEFROG · January 17, 2021, 1:42pm

Welcome @Ruth

Why are you asserting unchecking the Compress PDF option is doing nothing?

Ruth · January 17, 2021, 4:54pm

Hi @BLUEFROG

I am sending scans to DT3 Pro from the MacOS app PDFScanner. Although the latter has OCR functionality, I prefer not to use it as DT3 OCR is more accurate. Then I right click on the PDF in my DT3 inbox and select OCR to searchable PDF. I then see 2 files - one is the imported PDF Document and other a PDF+text. The latter file tends to be a third of the size of the former, is slightly fuzzier and the black/grey is lighter. Unchecking Compress PDF in the OCR preferences does not change anything. Hope this is clear.

SimonAdameit · May 19, 2022, 9:00am

I have a similar situation here. I want DEVONthink to just add the OCR text to the PDF, not have the text or images be changed in any way, as the original scan is already high quality. Even when setting the DPI to 300 in DEVONthink and deselecting „compress PDF“, the text ends up looking slightly fuzzy and less readable on high DPI devices like the iPad.

FROBGOBLIN · May 23, 2022, 3:56pm

I have had this problem as well, so I’ve been doing my OCR outside of DT, but I would prefer not to jump through that extra hoop. More options (I could choose a level of compression appropriate for my use case) or just a no compression option (probably the best fit for me) would be appreciated.

BLUEFROG · May 23, 2022, 4:02pm

You can already disable compression. However, the page image is still processed per-page.

Blanc · May 23, 2022, 6:23pm

Why is that, Jim? So - on a technical level - why is it not possible to extract the image from a PDF and put exactly that very same image back into the new PDF (or: why is the image processed at all when compression is off)? (The question, I’m sure, reveals my complete lack of understanding of the process - so I’m only asking to expand my knowledge.)

BLUEFROG · May 23, 2022, 6:26pm

@aedwards would be the expert on this particular topic regarding how the ABBYY engine processes the files.

FROBGOBLIN · May 24, 2022, 5:26am

I think I misstated my problem. I want to perform optical character recognition on my PDF files while keeping the amount of image degradation to a minimum. I am not familiar with the specifics of how the ABBYY engine performs the process, but the results have not been able to meet my expectations. For example:

I have a 39-page file of newspaper clippings (scanned using my iPhone, so relatively poor quality to begin with) from four different sources (Japanese and English) that I attempted to process using Japanese + English as the primary language settings without compression and with the highest dpi allowed (300). The result was a file of only 9 pages (pp. 1–30 disappeared) with text that is significantly less legible (even illegible in places) and photos that are blurry. Where did the thirty missing pages go? Why is 300 dpi the best the OCR can do?

Ideally, the file would not lose pages in the OCR process and the dpi would be at a high enough level that any degradation is imperceptible. This is especially true for materials with very high quality images in them. Concretely speaking, I want to obtain results similar to those that I get from Adobe Pro X–no reduction in image quality and excellent text recognition without losing nearly 75% of the pages. Would this be possible?

Here are some examples.

(Original File)

(ABBYY)

(Adobe Acrobat Pro X)
*No downsampling, so at least to my naked eye, there is no perceptible difference between the original and the OCR’d version

chrillek · May 24, 2022, 7:28am

Because, afaik, more DPI does not mean better OCR. Given that newspapers print at about 75 DPI, using more than that in your case will not help (but add visual noise instead).

Frankly, the quality of your “scans” is so bad that I don’t see a point in wondering about the poor quality of the OCR. Especially with complicated glyphs like Japanese where every detail counts.

You could try (!) to run your images through some kind of enhancement process first (edge detection, increase contrast, reduce to black-and-white – not necessarily in this sequence). Or see if you can copy/paste the text in Apple’s photos app.

FROBGOBLIN · May 24, 2022, 9:41am

I had two points to make earlier: (1) pages from my PDF get deleted and (2) the file images are degraded through the process.

As you noted, the quality of the original scans I shared was not especially good, but that is not the point. The problem is that the images are degraded by the OCR process. On the other end of the spectrum, I often do high-quality scans of books and so forth that have gorgeous images of artwork or manuscripts that get mangled by the image degradation. Sure, the text in the explanation and commentaries becomes searchable through the OCR process, but it becomes more difficult to read, the photos of the items themselves are ruined in the process, and pages are deleted, so I avoid using DT’s OCR . The trade-off isn’t worth it, and I can avoid all of the problems with Adobe. In other words, the technology exists in the world to do OCR without degrading the images, so I would like to see it implemented in DT’s OCR feature.

As for the quality of the OCR, I didn’t have any comment to make earlier about it, but if we are talking about that, I would say (in my experience) that ABBYY seems to be noticeably better on some things than Adobe and slightly worse on others. I would need to do more testing, but overall I might actually prefer ABBYY if the quality of the images could be maintained and we could avoid data loss.

alanshutko · May 29, 2022, 9:30pm

I would love a setting (even a hidden setting) which completely disabled downsampling on OCR and left the image resolution the same as the incoming document. Didn’t DTPO 2 have that setting once upon a time? I realize it can make for huge files, but I’d love the ability to make that choice for myself.

BLUEFROG · May 30, 2022, 1:10am

and left the image resolution the same as the incoming document.

Considering there are plenty of times where people scan with far more resolution than is needed (and even 600dpi is rarely needed), I still think an upper bound should be in place.

alanshutko · May 30, 2022, 2:39am

Maybe there could be an upper bound, with an option to always resample to a given DPI (for people that works for) and an option to not resample as long as it’s below the limit?

Even if I could set the existing DPI field to something like 600 and be sure that DT won’t resample UP if the existing DPI was lower than that, I think it would satisfy nearly all my needs. And when I really needed to be picky, I could instead OCR it with a tool like Acrobat with more knobs to twiddle.

BLUEFROG · May 30, 2022, 3:09am

Perhaps.
@aedwards would have a more authoritative word on this.

aedwards · May 30, 2022, 8:46am

The current DPI settings are a workaround to an issue with ABBYY’s process of PDF files. We are expecting an update for ABBYY very soon which should remove the need for this workaround and allow for the output PDF to be generated at the same dpi as the input.

FROBGOBLIN · June 3, 2022, 12:05pm

This improvement would be wonderful. Thank you.

As far as opinions about upper limits on the resolution, I can’t comment on other people’s work, but for my own, there are certainly cases when I want the best resolution available. Of course, there are others when I frankly do not mind a lower resolution, but it depends on a lot of factors, so a one size fits all approach simply won’t work for me.

At any rate, in most cases, I just want the processed file to come out looking just like the original (no downsampling?), so if that’s an option, I’ll be satisfied.

alanshutko · June 3, 2022, 3:49pm

That’s great news, thank you!