OCR via DTPro creates enormous page sizes?

Lately, and bizarrely, when I OCR a PDF via DT3, I end up with pages that are up to 56-inches wide. They say they are produced via the ABBYY 12 plug-in, and I do own ABBYY 12, so I don’t know if that is the problem or how to get DT to go back to OCR-ing files without changing the page size. In preferences, I have the OCR resolution set at 200, but the only change I can see is that now the files (in properties) say they were created using the ABBYY Finereader 12 engine rather than 11. I can go in and use the preflight options in Adobe Pro to change the page sizes back to U.S. Letter (9.5 x 11), but that is a time-consuming, tiresome process. Thank you for any help you can provide.

That is interesting; that problem did occur with a previous version of the OCRhelper (see here for example). I have not come across it recently; none of the files I have just checked after recently OCRing them using DT are showing incorrect page sizes. They all show ABBYY FineReader Engine 12 as the creator. (As an aside: all my scans are ISO sizes, e.g. A4, A3.)

Could you pls post the version number of this file: /Users/yourusername/Library/Application Support/DEVONthink 3/Abbyy/DTOCRHelper.app on your Mac?

Mine is 1.1.17, dated 09.02.2021.

Thank you. Mine is also 1.1.17, also dated Feb. 9, and coincidentally all of the files that are problematic were created after February. I hadn’t used the database as much and didn’t pick up on what was causing problems until this week, but any file that I OCRed in DT before February is the correct size.

Here’s an example. I had a PDF of a single page that was about 8.5 inches x 11 inches (U.S. letter), and probably about 1 MB. Then I OCRed it. Now it is 60.1 inches by 73.3 inches and 7.5 MB. It also looks far worse than it did before.

I think @aedwards does the OCR bit of DEVONthink; this is the first report I have read of this problem recurring, so perhaps Alan will have some questions or input.

How are the documents arriving in DEVONthink? Does the page size get changed regardless of whether you OCR a scan or a document from a different source (e.g. print to PDF)? Does the same happen to ISO defined type pages (e.g. A4 size)?

That’s a different problem to the previous one if I remember correctly - the file size did not change, because whilst the page size changed, DPI changed proportionally.

It doesn’t seem to matter because I did a few hundred and they all came in different ways. Some I had previously imported as PDFs, others as JPGs. Some I had clipped, some I had printed to DT. (Yes, I should probably come up with a best practice rather than so many methods, but…).

I had done tons of files this way in Nov/Dec, with no problems at all. But all of the files I have OCRed since then have this same problem.

I don’t know how to figure out which pages were A4 size to begin with, but I’ll try to figure that out.

I have set Resolution: 150 dpi in the preferences. Perhaps try any other number than you are currently using (obviously a lower resolution will create a smaller file; but the question which is actually relevant is whether the page size still changes - which it shouldn’t, of course). Obviously setting 200 isn’t wrong, I’m just trying to figure why your are experiencing a problem which I am not.

OK, I might have figured out the problem. I tried it on two files with the resolution set at 100 – one that I’d previously (as in, more than 6 months ago) OCRed with no trouble and another that had already had blown up to an enormous size. On the latter, the page size stayed enormous and the file size increased significantly.

Here is the difference that I can see between the two files – the problematic ones (EVEN after OCRing them the same way) seem to say “Creator: Adobe Acrobat 19.10” and “Producer: macOS Version 11.4 (Build 20F71) Quartz PDFContext.” The others just say “Creator ABBYY FineReader Engine 12.”

Again, that’s what both files say after just OCRing them five minutes ago. Is there some file or plug-in or some step I need to delete to fix this? I already searched for info on “Acrobat 19.10” to no avail.

How are you performing OCR (i.e. which steps are you taking to OCR a file)?

Data >> OCR >> to searchable PDF.

Oh well, I was kind of hoping you were doing “activate script which sends file to Adobe Cloud through some obscure mechanism” :stuck_out_tongue_winking_eye: Do you have Adobe Acrobat on your Mac? And can you maybe keep an eye open to see whether the files are marked as “Creator: Adobe Acrobat 19.10” and “Producer: macOS Version 11.4 (Build 20F71) Quartz PDFContext.” before you first OCR them? Are they from the same source maybe, or otherwise similar?

Could you send me a copy of the original file and the OCR’d file and I will look to see what is causing the issue.

Could you also turn on OCR logging, to do this:

  • Quit DEVONthink 3
  • In Finder select the menu Go->Go to Folder, copy and paste the line below and press Go.
    ~/Library/Application Support/DEVONthink 3/Abbyy
  • Copy the file OCR.plist (274 Bytes) to this folder.
  • OCR a document and if the sizing issue occurs could you send a copy of the OCRLog.txt file that will have been created in the Abbyy folder.
Hey. I can’t seem to recreate the problem, and I don’t have the original files from those that I converted between February and April because when I use DT3 to convert to OCR it replaces the original file. I will email you some of the files that are problematic soon.

(I thought I’d recreated the problem yesterday but I was wrong, which I can explain later but I’m rushing now.)

It looks like if I reimport the original files as JPGs and then convert them via OCR it works fine now, so again, I don’t know what happened in there or why but I’ll send you the files.

And I’ll put on OCR logging so that I have that if I have trouble in the future. Thank you!

I was able to recreate the problem after all, and I emailed you with a link to the files (because the OCRed files were huge, 75-165MB though they had started out <2 MB. And I sent the text file, too. Thank you!

  • Where did the original 2MB file originate?
  • If it was downloaded from somewhere, is it publicly available for download?

It was a jpg image that I took with my iPhone and then converted to a pdf using Acrobat Pro DC.

Thanks for sending a copy of the files. I can see that the OCR’d files are both large in both the physical file size and their dimensions. I have tried various options when OCRing the original file however haven’t been able to reproduce the issue with each of the generated files have the correct dimension.

These are the OCR settings I used, do you have anything different, apart from Move to Trash?

Thank you. Those are the same settings I used. Did the text file offer any insight?

Unfortunately the log file didn’t show any errors, so just trying to track down whether the error is in the ABBYY OCRs generation of the PDF or at a point prior to that.