I’m working through my library to identify files which have either (a) no OCR text at all or (b) some text but which is clearly not aligned with the actual document. The latter is pretty easy to find with a smart search that highlights PDF files which have both Word Count < a certain level (let’s say 1000 words) and a length that is greater than a certain level (let’s say over 5 pages). This helps get around the file size issue where many older PDFs are based on underlying tiff images and newer ones don’t have any raster data so a 50 page PDF can be both quite small or quite large in size. However, I’ve run into a range of PDF files which report a word count of “0” but clearly have had some OCR run on them as I can select text in the file. Has anyone else run into this kind of behaviour? I could just re-run OCR on those files, but am wondering if there’s a bug or some kind of substandard PDF spec implementation that I should be aware of in the background… happy to share sample files or screen shots if that’s useful.
This might be just “live text” on modern macOS versions but there’s indeed no text layer and therefore no search index.
1 Like
Hijacking this thread to ask a relevant question:
When I highlight some of these “live text”, the highlight would show up in Inspector > Annotations as it should – e.g. Page: 10, Type: Yellow, Content: some text
. However, if I close the PDF document and reopen later, Content
of the highlight would change to nothing. Is this intended behavior?
That’s due to the missing text layer, highlights don’t have a content on their own.
1 Like
See this blog post…