Script: Perform OCR on PDF and Preserve Table of Contents

If you use DEVONthink’s OCR feature on a PDF that has an existing Table of Contents, you have probably noticed that the TOC is missing in the processed file. It appears this is a bug in the underlying ABBYY FIneReader OCR engine that DEVONthink uses:

I’ve been working around this by using other PDF software to perform OCR on any PDFs that had a TOC I wanted to preserve. But I just haven’t gotten the same quality OCR results as using DEVONthink and the ABBYY engine. So I finally set about finding a solution that would let me use DT’s OCR and not lose my TOCs.

This script relies on a Python package called pdf.tocgen. It will perform OCR, export the TOC from the original document, and then import the TOC into the new document. You’ll need to install the pdf.tocgen package following its instructions. Then update the script to point to it (mine is installed in /usr/local/bin).

I’ve tried this on a few different PDFs with good results, but I have not performed exhaustive testing. The error handling and robustness to different file types, etc., can also be improved. It defaults to a 1 hour (3600 second) timeout when running the OCR–plenty of time for the files I work with, but you can modify it as needed.

This script should be non-destructive to your original PDF files, but of course I recommend testing it in a test database or on test documents first, and by using it you assume all risk to your own data.

-- Performs OCR via DEVONthink, then restores any Table of Contents that got stripped out during OCR.

-- This script relies on the pdftocio Python script provided by https://pypi.org/project/pdf.tocgen/.
-- Before running, please ensure that pdftocio is installed (e.g., via: python -m pip install -U pdf.tocgen)

-- Point this to wherever pdftocio is installed
property PDFTOCIO : "/usr/local/bin/pdftocio"
property TIMEOUT_SECS : 3600

tell application id "DNtp"
	repeat with theSourceRecord in selected records
		-- First try to OCR the file, which will create a new file sans TOC
		set theSourcePath to theSourceRecord's path
		with timeout of TIMEOUT_SECS seconds
			set theOCRedRecord to ocr theSourcePath file theSourcePath to theSourceRecord's location group
		end timeout
		delay 5 -- DEVONthink might change the path of the output record on us, so give it a few seconds
		set theOCRedPath to theOCRedRecord's path
		
		try
			-- Extract the TOC from the original (pre-OCR) file
			-- Write the TOC to the OCRed file; this in turn creates a third output file
			set theQuotedSourcePath to quoted form of theSourcePath
			set theQuotedOCRedPath to quoted form of theOCRedPath
			set theQuotedOutputPath to quoted form of (theOCRedPath & ".out")
			tell me to do shell script PDFTOCIO & " -p " & theQuotedSourcePath & " | " & PDFTOCIO & " -o " & theQuotedOutputPath & " " & theQuotedOCRedPath
			
			-- Move the new output file (OCRed and with TOC) over the top of the OCRed file without TOC.
			tell me to do shell script "mv -f " & theQuotedOutputPath & " " & theQuotedOCRedPath
		on error errStr number errorNumber
			if errStr contains "no table of contents found" then
				-- This is expected if the source file had no TOC, in which case there's nothing more to do
				tell me to display notification "No table of contents found in: " & theSourcePath
			else
				error errStr number errorNumber
			end if
		end try
	end repeat
end tell

If you make any improvements to this script, please share!

1 Like

I will be testing this. Thank you for sharing!

1 Like

Thank you. I’ve discovered this bug after processing many files. It would be nice if the DTP team added a warning to the OCR function…