I’m attempting to OCR a 532-page (129.6 MB) PDF scan of a book. The PDF is indexed (not imported) in my DT3 database.
After selecting Data > OCR > to Searchable PDF, the activity message at the bottom of the sidebar first briefly said “Adding Document,” then switched to “Loading Document.” There was no progress indicator, but Activity Monitor showed that DTOCRHelper was writing huge amounts of data to disk, about 18 GB in about 20 minutes.
At this point I had only 17 GB of free disk space left, so I clicked the X next to the “Loading Document” message (which I assumed was a Cancel button), assuming that would abort the OCR process. However, DTOCRHelper continued to run and write out additional data even after I quit DT3, until I finally killed the process in Activity Monitor.
I had to hunt for where this data was written to, because I didn’t get the space back when I killed the process. It was buried five folders down under the /private folder, in the ABBYY FineReader Engine folder, which contained a folder full of 468 folders, each ~46 MB in size. The ABBYY FineReader Engine folder disappeared following a reboot of my computer and I had my disk space available again.
(This was actually the second time I went through this process; the first time, I didn’t notice the disappearing disk space until I got a message in another app that my startup disk was full. Thinking this was a random glitch, I tried again after rebooting restored my disk space, watching Activity Monitor for trouble, with the same results.)
Is this normal behavior for the OCR engine? It doesn’t seem reasonable that over 20 GB of files (even if temporary) should be produced to OCR a 130 MB file. (I’m guessing that each 46 MB folder probably represents output from one page of the PDF.) If it is normal, then I think a warning in the manual, or in-program, about the potential disk space requirements would be in order.
Aside from that, it appears that the DTOCRHelper app simply ignored my attempt to abort the process by clicking the X. Seems like a bug. If the OCR engine is just not able to handle a PDF of this size, again I believe an explanation of the program’s limitations should be included in the manual.