OCR of a large (500+ pages) PDF requires huge amount of disk space

kmccracken1951 · September 28, 2019, 5:22pm

I’m attempting to OCR a 532-page (129.6 MB) PDF scan of a book. The PDF is indexed (not imported) in my DT3 database.

After selecting Data > OCR > to Searchable PDF, the activity message at the bottom of the sidebar first briefly said “Adding Document,” then switched to “Loading Document.” There was no progress indicator, but Activity Monitor showed that DTOCRHelper was writing huge amounts of data to disk, about 18 GB in about 20 minutes.

At this point I had only 17 GB of free disk space left, so I clicked the X next to the “Loading Document” message (which I assumed was a Cancel button), assuming that would abort the OCR process. However, DTOCRHelper continued to run and write out additional data even after I quit DT3, until I finally killed the process in Activity Monitor.

I had to hunt for where this data was written to, because I didn’t get the space back when I killed the process. It was buried five folders down under the /private folder, in the ABBYY FineReader Engine folder, which contained a folder full of 468 folders, each ~46 MB in size. The ABBYY FineReader Engine folder disappeared following a reboot of my computer and I had my disk space available again.

(This was actually the second time I went through this process; the first time, I didn’t notice the disappearing disk space until I got a message in another app that my startup disk was full. Thinking this was a random glitch, I tried again after rebooting restored my disk space, watching Activity Monitor for trouble, with the same results.)

Is this normal behavior for the OCR engine? It doesn’t seem reasonable that over 20 GB of files (even if temporary) should be produced to OCR a 130 MB file. (I’m guessing that each 46 MB folder probably represents output from one page of the PDF.) If it is normal, then I think a warning in the manual, or in-program, about the potential disk space requirements would be in order.

Aside from that, it appears that the DTOCRHelper app simply ignored my attempt to abort the process by clicking the X. Seems like a bug. If the OCR engine is just not able to handle a PDF of this size, again I believe an explanation of the program’s limitations should be included in the manual.

BLUEFROG · September 28, 2019, 8:50pm

You are describing an atypical situation, i.e., a PDF that large, so it wouldn’t be covered in the documenation.

Was the PDF finally produced?

If not, please hold the Option key in DEVONthink and choose Help > Report bug to start a support ticket. Thanks.

kmccracken1951 · September 29, 2019, 1:13am

As I said, I shut down the process due to lack of available disk space, so no, the PDF was not produced. I’ve done some disk cleanup and regained some additional space, so I’m going to take another try and hope it finishes processing before I run out of space. If not I’ll open a ticket as you suggest. (Well, I should probably open a ticket regardless, as the failure of the Cancel button to actually cancel the process is surely a bug.)

Kind of hard to recognize in advance that it was an “atypical” case, without any information in the docs. I’m just looking for a little general guidance, like “the OCR engine consumes a lot of temporary disk space and is best suited for documents under 100 pages” (or under 50, or whatever). There are a lot of genealogically useful public-domain books available as PDFs on Internet Archive, Google Books, etc., that don’t have an OCR layer, and it would be very useful to be able to convert them.

kmccracken1951 · September 29, 2019, 3:59am

Update: My latest attempt to OCR this 531-page book (I deleted an unneeded “cover page”) has succeeded. It took just under 2 hours, and did indeed write one temporary ~46 MB folder of files for each of the 531 pages, for a total of 26.87 GB. The result was a 307.4 MB searchable PDF. At the conclusion of the process, the temporary files/folders were all automatically deleted, restoring my available disk space.

So, for anyone else with an interest in OCR-ing PDF books in DT3, it is possible, as long as you have at least 5 GB of free disk space available per 100 pages. I would recommend shutting down all other applications, as it is both memory- and CPU-intensive. I would also recommend splitting anything over, say, 400 pages into parts (my file was in fact the first part of a larger book that I split in two). You can always rejoin them after the OCR.

BLUEFROG · September 29, 2019, 2:11pm

Development would have to assess this.

R_2_is_misleading · September 29, 2019, 10:48pm

I have not had this experience exactly, but years ago when I tested several OCR programs, I found that Abbyy Finereader was by far the slowest. In fact, AF worked so hard to be “thorough” that it was a general resource hog. An hour would not surprise me at all. This was in 2014, and I have not repeated my benchmarking efforts.