Script: Perform OCR on PDF and Preserve Table of Contents

clang · September 16, 2022, 9:59pm

If you use DEVONthink’s OCR feature on a PDF that has an existing Table of Contents, you have probably noticed that the TOC is missing in the processed file. It appears this is a bug in the underlying ABBYY FIneReader OCR engine that DEVONthink uses:

I’ve been working around this by using other PDF software to perform OCR on any PDFs that had a TOC I wanted to preserve. But I just haven’t gotten the same quality OCR results as using DEVONthink and the ABBYY engine. So I finally set about finding a solution that would let me use DT’s OCR and not lose my TOCs.

This script relies on a Python package called pdf.tocgen. It will perform OCR, export the TOC from the original document, and then import the TOC into the new document. You’ll need to install the pdf.tocgen package following its instructions. Then update the script to point to it (mine is installed in /usr/local/bin).

I’ve tried this on a few different PDFs with good results, but I have not performed exhaustive testing. The error handling and robustness to different file types, etc., can also be improved. It defaults to a 1 hour (3600 second) timeout when running the OCR–plenty of time for the files I work with, but you can modify it as needed.

This script should be non-destructive to your original PDF files, but of course I recommend testing it in a test database or on test documents first, and by using it you assume all risk to your own data.

-- Performs OCR via DEVONthink, then restores any Table of Contents that got stripped out during OCR.

-- This script relies on the pdftocio Python script provided by https://pypi.org/project/pdf.tocgen/.
-- Before running, please ensure that pdftocio is installed (e.g., via: python -m pip install -U pdf.tocgen)

-- Point this to wherever pdftocio is installed
property PDFTOCIO : "/usr/local/bin/pdftocio"
property TIMEOUT_SECS : 3600

tell application id "DNtp"
	repeat with theSourceRecord in selected records
		-- First try to OCR the file, which will create a new file sans TOC
		set theSourcePath to theSourceRecord's path
		with timeout of TIMEOUT_SECS seconds
			set theOCRedRecord to ocr theSourcePath file theSourcePath to theSourceRecord's location group
		end timeout
		delay 5 -- DEVONthink might change the path of the output record on us, so give it a few seconds
		set theOCRedPath to theOCRedRecord's path
		
		try
			-- Extract the TOC from the original (pre-OCR) file
			-- Write the TOC to the OCRed file; this in turn creates a third output file
			set theQuotedSourcePath to quoted form of theSourcePath
			set theQuotedOCRedPath to quoted form of theOCRedPath
			set theQuotedOutputPath to quoted form of (theOCRedPath & ".out")
			tell me to do shell script PDFTOCIO & " -p " & theQuotedSourcePath & " | " & PDFTOCIO & " -o " & theQuotedOutputPath & " " & theQuotedOCRedPath
			
			-- Move the new output file (OCRed and with TOC) over the top of the OCRed file without TOC.
			tell me to do shell script "mv -f " & theQuotedOutputPath & " " & theQuotedOCRedPath
		on error errStr number errorNumber
			if errStr contains "no table of contents found" then
				-- This is expected if the source file had no TOC, in which case there's nothing more to do
				tell me to display notification "No table of contents found in: " & theSourcePath
			else
				error errStr number errorNumber
			end if
		end try
	end repeat
end tell

If you make any improvements to this script, please share!

AW2307 · September 17, 2022, 5:06pm

I will be testing this. Thank you for sharing!

SeferTapuach · August 20, 2023, 6:53pm

Thank you. I’ve discovered this bug after processing many files. It would be nice if the DTP team added a warning to the OCR function…

paranoiduser · July 20, 2024, 9:13am

I signed up to thank you! This script works well on my files.

cgrunenberg · July 23, 2024, 10:10am

The latest version of the OCR engine actually retains the table of contents.

porcupine945 · February 25, 2025, 11:10am

I’ve just tried to OCR a PDF with a table of contents but unfortunately it’s stripped the table of contents out. I’ve tried it on two occasions on the same PDF and the same thing happened each time. Do you know if there is still a bug?

cgrunenberg · February 25, 2025, 11:26am

Which version of DEVONthink and macOS do you use? Intel or Apple chip?

porcupine945 · February 25, 2025, 11:48am

DEVONthink 3.9.8. macOS 15.3.1 with an M2 Pro chip.

aedwards · February 25, 2025, 1:02pm

Unfortunately this is a known issue with the ABBYY OCR and we are waiting for them to release an update.

aedwards · February 25, 2025, 2:38pm

Sorry I forgot we added a workaround that should transfer the table of contents to the new PDF. It is working on the couple of documents I have just tested. Are you using a script or a smart rule? Does it work if you OCR via the Data->OCR menu?

porcupine945 · February 26, 2025, 10:11am

I just used the Data->OCR menu and it didn’t work. How do I use the workaround?

aedwards · February 26, 2025, 10:44am

The workaround is already in the current version of DT. Can you send me a copy the original PDF prior to OCR that still contains the TOC.