OCR via DTPro creates enormous page sizes?

cmedy · May 31, 2021, 7:54pm

Lately, and bizarrely, when I OCR a PDF via DT3, I end up with pages that are up to 56-inches wide. They say they are produced via the ABBYY 12 plug-in, and I do own ABBYY 12, so I don’t know if that is the problem or how to get DT to go back to OCR-ing files without changing the page size. In preferences, I have the OCR resolution set at 200, but the only change I can see is that now the files (in properties) say they were created using the ABBYY Finereader 12 engine rather than 11. I can go in and use the preflight options in Adobe Pro to change the page sizes back to U.S. Letter (9.5 x 11), but that is a time-consuming, tiresome process. Thank you for any help you can provide.

Blanc · May 31, 2021, 8:11pm

That is interesting; that problem did occur with a previous version of the OCRhelper (see here for example). I have not come across it recently; none of the files I have just checked after recently OCRing them using DT are showing incorrect page sizes. They all show ABBYY FineReader Engine 12 as the creator. (As an aside: all my scans are ISO sizes, e.g. A4, A3.)

Could you pls post the version number of this file: /Users/yourusername/Library/Application Support/DEVONthink 3/Abbyy/DTOCRHelper.app on your Mac?

Mine is 1.1.17, dated 09.02.2021.

cmedy · May 31, 2021, 8:15pm

Thank you. Mine is also 1.1.17, also dated Feb. 9, and coincidentally all of the files that are problematic were created after February. I hadn’t used the database as much and didn’t pick up on what was causing problems until this week, but any file that I OCRed in DT before February is the correct size.

cmedy · May 31, 2021, 8:20pm

Here’s an example. I had a PDF of a single page that was about 8.5 inches x 11 inches (U.S. letter), and probably about 1 MB. Then I OCRed it. Now it is 60.1 inches by 73.3 inches and 7.5 MB. It also looks far worse than it did before.

Blanc · May 31, 2021, 8:21pm

I think @aedwards does the OCR bit of DEVONthink; this is the first report I have read of this problem recurring, so perhaps Alan will have some questions or input.

How are the documents arriving in DEVONthink? Does the page size get changed regardless of whether you OCR a scan or a document from a different source (e.g. print to PDF)? Does the same happen to ISO defined type pages (e.g. A4 size)?

Blanc · May 31, 2021, 8:22pm

That’s a different problem to the previous one if I remember correctly - the file size did not change, because whilst the page size changed, DPI changed proportionally.

cmedy · May 31, 2021, 8:24pm

It doesn’t seem to matter because I did a few hundred and they all came in different ways. Some I had previously imported as PDFs, others as JPGs. Some I had clipped, some I had printed to DT. (Yes, I should probably come up with a best practice rather than so many methods, but…).

I had done tons of files this way in Nov/Dec, with no problems at all. But all of the files I have OCRed since then have this same problem.

I don’t know how to figure out which pages were A4 size to begin with, but I’ll try to figure that out.

Blanc · May 31, 2021, 8:28pm

I have set Resolution: 150 dpi in the preferences. Perhaps try any other number than you are currently using (obviously a lower resolution will create a smaller file; but the question which is actually relevant is whether the page size still changes - which it shouldn’t, of course). Obviously setting 200 isn’t wrong, I’m just trying to figure why your are experiencing a problem which I am not.

cmedy · May 31, 2021, 8:44pm

OK, I might have figured out the problem. I tried it on two files with the resolution set at 100 – one that I’d previously (as in, more than 6 months ago) OCRed with no trouble and another that had already had blown up to an enormous size. On the latter, the page size stayed enormous and the file size increased significantly.

Here is the difference that I can see between the two files – the problematic ones (EVEN after OCRing them the same way) seem to say “Creator: Adobe Acrobat 19.10” and “Producer: macOS Version 11.4 (Build 20F71) Quartz PDFContext.” The others just say “Creator ABBYY FineReader Engine 12.”

Again, that’s what both files say after just OCRing them five minutes ago. Is there some file or plug-in or some step I need to delete to fix this? I already searched for info on “Acrobat 19.10” to no avail.

Blanc · May 31, 2021, 9:01pm

How are you performing OCR (i.e. which steps are you taking to OCR a file)?

cmedy · May 31, 2021, 9:06pm

Data >> OCR >> to searchable PDF.

Blanc · May 31, 2021, 9:11pm

Oh well, I was kind of hoping you were doing “activate script which sends file to Adobe Cloud through some obscure mechanism” Do you have Adobe Acrobat on your Mac? And can you maybe keep an eye open to see whether the files are marked as “Creator: Adobe Acrobat 19.10” and “Producer: macOS Version 11.4 (Build 20F71) Quartz PDFContext.” before you first OCR them? Are they from the same source maybe, or otherwise similar?

aedwards · June 1, 2021, 7:40am

Could you send me a copy of the original file and the OCR’d file and I will look to see what is causing the issue.

Could you also turn on OCR logging, to do this:

Quit DEVONthink 3
In Finder select the menu Go->Go to Folder, copy and paste the line below and press Go.
~/Library/Application Support/DEVONthink 3/Abbyy
Copy the file OCR.plist (274 Bytes) to this folder.
OCR a document and if the sizing issue occurs could you send a copy of the OCRLog.txt file that will have been created in the Abbyy folder.

cmedy · June 1, 2021, 2:07pm

Hey. I can’t seem to recreate the problem, and I don’t have the original files from those that I converted between February and April because when I use DT3 to convert to OCR it replaces the original file. I will email you some of the files that are problematic soon.

(I thought I’d recreated the problem yesterday but I was wrong, which I can explain later but I’m rushing now.)

It looks like if I reimport the original files as JPGs and then convert them via OCR it works fine now, so again, I don’t know what happened in there or why but I’ll send you the files.

And I’ll put on OCR logging so that I have that if I have trouble in the future. Thank you!

cmedy · June 1, 2021, 5:53pm

I was able to recreate the problem after all, and I emailed you with a link to the files (because the OCRed files were huge, 75-165MB though they had started out <2 MB. And I sent the text file, too. Thank you!

BLUEFROG · June 1, 2021, 9:11pm

Where did the original 2MB file originate?
If it was downloaded from somewhere, is it publicly available for download?

cmedy · June 1, 2021, 9:42pm

It was a jpg image that I took with my iPhone and then converted to a pdf using Acrobat Pro DC.

aedwards · June 2, 2021, 8:39am

Thanks for sending a copy of the files. I can see that the OCR’d files are both large in both the physical file size and their dimensions. I have tried various options when OCRing the original file however haven’t been able to reproduce the issue with each of the generated files have the correct dimension.

These are the OCR settings I used, do you have anything different, apart from Move to Trash?

cmedy · June 2, 2021, 11:09am

Thank you. Those are the same settings I used. Did the text file offer any insight?

aedwards · June 2, 2021, 11:35am

Unfortunately the log file didn’t show any errors, so just trying to track down whether the error is in the ABBYY OCRs generation of the PDF or at a point prior to that.