today I imported 20 Bank statements as PDF-documents.
They are perfectly searchable in Acrobat or Acrobat Reader.
But they are not searchable in either Devonthink or Preview.app.
Coping the content and pasting it in, say, Pages, just renders gibberish.
Is this a Mac OSX Problem?
Help highly appreciated.
It’s possible that the PDF documents are not supported by the PDF engine of macOS (as it doesn’t support the complete PDF specification). Which PDF version is used by the documents?
Thanks for getting back on this.
- Where do I find the PDF specification?
- AND: where in Devonthink do I find the PDF specification?
Eagerly awaiting word
The PDF specification is the definition of the PDF file format by Adobe (see https://en.wikipedia.org/wiki/PDF).
I see. But where do I find the specification used in the PDF file I am dealing with? Is there a certain section in the info-section of a PDF-File that tell the used specification?
Assuming that you have Acrobat, simply open the document in Acrobat and ⌘D to see the document properties, where under Description you will find an entry PDF Version which will look something like 1.1 (Acrobat 2.x)
Thanks @Blanc for your replay.
- But is Version (here: 1.1) identical to specification?
- With which version/specification a PDF-Document has to comply in order to be fully usable under mac OSX?
Version and specification are used synonymously here, afaik.
I don’t know of any publication which specifically lists the capabilities of Apple PDFKit - although I would guess it would be compatible with version 1.4, which seems to be what your document is using. I wonder whether the documents are using an extension of some kind. Equally, perhaps the creating software doesn’t fully comply with the specification (disclaimer: I say that without any knowledge of that software and not as an accusation or even a fact, but rather as a general consideration). Acrobat also treats document security differently when compared to other software.
Your solution may be to OCR your files when you import them, thus creating a new text layer.
Thanks for your Feedback …
Well, it’s a bank statement – so, I guess, it might have been originally created digitally.
And as I wrote before: The document is perfectly readable and the OCR-Layer is correct once the document is opened in Acrobat.
Isn’t there really anyone @Devontechnologies – @cgrunenberg @BLUEFROG – who came across this issue of having an ocred PDF-Document that isn’t searchable in Devonthink because of the issue I initially created this Topic for?
Suggestion: “Print” the document to a new PDF. Import into DEVONthink then OCR it.
I understand that - but repeat that you may find re-OCR-ing the documents on import is the simplest solution. You might try it. If you are prepared to impart with one of the documents, I expect Criss or Jim might take a look at it - but again, if this is a PDFKit-problem I think you would be better off taking a solution oriented approach.
is there a way of automating this – all in one workflow?
- print to new PDF
- ocr-ing the PDF
- importing it into Devonthink
Probably. But others may wish to present ideas on this. What I would do to automate would be something like.
- use Hazel to detect a new download in a particular folder.
- if a PDF, then run some code (Python, Apple Script [never figured out that], or maybe another program on the Mac executable by a shell command line if there is one) to make a new PDF without OCR (or test this method and see if the OCR created by this method works as you expect).
- move the new PDF into folder in the OSX file system that is designated in DEVONthink as at the Global Inbox
- make a Rule in DEVONthink that detects the incoming PDF, runs and OCR (if needed, see item 2) on it (using the engine in DEVONthink). In this rule move the file to the Group you want .
All that said, unless there are dozens or hundreds of files like this to handle, I’d probably just use OSX’s Preview to open the file, then “Print” as a PDF direct to DEVONthink. Maybe even consolidate all the incoming into one bigger PDF
From DT, you’d have to use a script (Apple Script) along these lines:
- to print to a PDF, use something like described here. Personally, I’d go with the cups solution, because I hate to depend on menu positions, but that’s up to you, of course.
- import the new pdf into DT (you have to import before you can OCR!) like so
tell application id("DNtp")
set newPDF to import <Posix Path to new pdf>
set newRecord ocr newPDF type pdf
This is not tested in any way, though. You could specify the target group for the
ocr and also the
import. For more options, see the DT dictionary in Apple’s script editor.
There are other options as well, e.g. you could use PDFPen for OCR
After reading @rmschne’s post, I amend my suggestion: You’ll probably need the script only to print to a new PDF. Like he said, target a directory that’s DT’s designated global inbox and have DT do the rest with a smart rule. That’s easier than wading through the Apple Script dictionary.