Itâs possible that the PDF documents are not supported by the PDF engine of macOS (as it doesnât support the complete PDF specification). Which PDF version is used by the documents?
I see. But where do I find the specification used in the PDF file I am dealing with? Is there a certain section in the info-section of a PDF-File that tell the used specification?
Assuming that you have Acrobat, simply open the document in Acrobat and âD to see the document properties, where under Description you will find an entry PDF Version which will look something like 1.1 (Acrobat 2.x)
Version and specification are used synonymously here, afaik.
I donât know of any publication which specifically lists the capabilities of Apple PDFKit - although I would guess it would be compatible with version 1.4, which seems to be what your document is using. I wonder whether the documents are using an extension of some kind. Equally, perhaps the creating software doesnât fully comply with the specification (disclaimer: I say that without any knowledge of that software and not as an accusation or even a fact, but rather as a general consideration). Acrobat also treats document security differently when compared to other software.
Your solution may be to OCR your files when you import them, thus creating a new text layer.
Well, itâs a bank statement â so, I guess, it might have been originally created digitally.
And as I wrote before: The document is perfectly readable and the OCR-Layer is correct once the document is opened in Acrobat.
Isnât there really anyone @Devontechnologies â @cgrunenberg@BLUEFROG â who came across this issue of having an ocred PDF-Document that isnât searchable in Devonthink because of the issue I initially created this Topic for?
I understand that - but repeat that you may find re-OCR-ing the documents on import is the simplest solution. You might try it. If you are prepared to impart with one of the documents, I expect Criss or Jim might take a look at it - but again, if this is a PDFKit-problem I think you would be better off taking a solution oriented approach.
Probably. But others may wish to present ideas on this. What I would do to automate would be something like.
use Hazel to detect a new download in a particular folder.
if a PDF, then run some code (Python, Apple Script [never figured out that], or maybe another program on the Mac executable by a shell command line if there is one) to make a new PDF without OCR (or test this method and see if the OCR created by this method works as you expect).
move the new PDF into folder in the OSX file system that is designated in DEVONthink as at the Global Inbox
make a Rule in DEVONthink that detects the incoming PDF, runs and OCR (if needed, see item 2) on it (using the engine in DEVONthink). In this rule move the file to the Group you want .
All that said, unless there are dozens or hundreds of files like this to handle, Iâd probably just use OSXâs Preview to open the file, then âPrintâ as a PDF direct to DEVONthink. Maybe even consolidate all the incoming into one bigger PDF
From DT, youâd have to use a script (Apple Script) along these lines:
to print to a PDF, use something like described here. Personally, Iâd go with the cups solution, because I hate to depend on menu positions, but thatâs up to you, of course.
import the new pdf into DT (you have to import before you can OCR!) like so
tell application id("DNtp")
set newPDF to import <Posix Path to new pdf>
set newRecord ocr newPDF type pdf
end tell
This is not tested in any way, though. You could specify the target group for the newRecord in ocr and also the name of newPDFin import. For more options, see the DT dictionary in Appleâs script editor.
There are other options as well, e.g. you could use PDFPen for OCR
After reading @rmschneâs post, I amend my suggestion: Youâll probably need the script only to print to a new PDF. Like he said, target a directory thatâs DTâs designated global inbox and have DT do the rest with a smart rule. Thatâs easier than wading through the Apple Script dictionary.