If you use DEVONthink’s OCR feature on a PDF that has an existing Table of Contents, you have probably noticed that the TOC is missing in the processed file. It appears this is a bug in the underlying ABBYY FIneReader OCR engine that DEVONthink uses:
I’ve been working around this by using other PDF software to perform OCR on any PDFs that had a TOC I wanted to preserve. But I just haven’t gotten the same quality OCR results as using DEVONthink and the ABBYY engine. So I finally set about finding a solution that would let me use DT’s OCR and not lose my TOCs.
This script relies on a Python package called pdf.tocgen. It will perform OCR, export the TOC from the original document, and then import the TOC into the new document. You’ll need to install the pdf.tocgen package following its instructions. Then update the script to point to it (mine is installed in /usr/local/bin).
I’ve tried this on a few different PDFs with good results, but I have not performed exhaustive testing. The error handling and robustness to different file types, etc., can also be improved. It defaults to a 1 hour (3600 second) timeout when running the OCR–plenty of time for the files I work with, but you can modify it as needed.
This script should be non-destructive to your original PDF files, but of course I recommend testing it in a test database or on test documents first, and by using it you assume all risk to your own data.
-- Performs OCR via DEVONthink, then restores any Table of Contents that got stripped out during OCR.
-- This script relies on the pdftocio Python script provided by https://pypi.org/project/pdf.tocgen/.
-- Before running, please ensure that pdftocio is installed (e.g., via: python -m pip install -U pdf.tocgen)
-- Point this to wherever pdftocio is installed
property PDFTOCIO : "/usr/local/bin/pdftocio"
property TIMEOUT_SECS : 3600
tell application id "DNtp"
repeat with theSourceRecord in selected records
-- First try to OCR the file, which will create a new file sans TOC
set theSourcePath to theSourceRecord's path
with timeout of TIMEOUT_SECS seconds
set theOCRedRecord to ocr theSourcePath file theSourcePath to theSourceRecord's location group
end timeout
delay 5 -- DEVONthink might change the path of the output record on us, so give it a few seconds
set theOCRedPath to theOCRedRecord's path
try
-- Extract the TOC from the original (pre-OCR) file
-- Write the TOC to the OCRed file; this in turn creates a third output file
set theQuotedSourcePath to quoted form of theSourcePath
set theQuotedOCRedPath to quoted form of theOCRedPath
set theQuotedOutputPath to quoted form of (theOCRedPath & ".out")
tell me to do shell script PDFTOCIO & " -p " & theQuotedSourcePath & " | " & PDFTOCIO & " -o " & theQuotedOutputPath & " " & theQuotedOCRedPath
-- Move the new output file (OCRed and with TOC) over the top of the OCRed file without TOC.
tell me to do shell script "mv -f " & theQuotedOutputPath & " " & theQuotedOCRedPath
on error errStr number errorNumber
if errStr contains "no table of contents found" then
-- This is expected if the source file had no TOC, in which case there's nothing more to do
tell me to display notification "No table of contents found in: " & theSourcePath
else
error errStr number errorNumber
end if
end try
end repeat
end tell
If you make any improvements to this script, please share!