Identifying PDF files with bad OCR, can a file have selectable text but a zero word count?

kidwellj · April 18, 2024, 7:07am

I’m working through my library to identify files which have either (a) no OCR text at all or (b) some text but which is clearly not aligned with the actual document. The latter is pretty easy to find with a smart search that highlights PDF files which have both Word Count < a certain level (let’s say 1000 words) and a length that is greater than a certain level (let’s say over 5 pages). This helps get around the file size issue where many older PDFs are based on underlying tiff images and newer ones don’t have any raster data so a 50 page PDF can be both quite small or quite large in size. However, I’ve run into a range of PDF files which report a word count of “0” but clearly have had some OCR run on them as I can select text in the file. Has anyone else run into this kind of behaviour? I could just re-run OCR on those files, but am wondering if there’s a bug or some kind of substandard PDF spec implementation that I should be aware of in the background… happy to share sample files or screen shots if that’s useful.

cgrunenberg · April 18, 2024, 7:16am

This might be just “live text” on modern macOS versions but there’s indeed no text layer and therefore no search index.

meowky · April 18, 2024, 1:10pm

Hijacking this thread to ask a relevant question:

When I highlight some of these “live text”, the highlight would show up in Inspector > Annotations as it should – e.g. Page: 10, Type: Yellow, Content: some text. However, if I close the PDF document and reopen later, Content of the highlight would change to nothing. Is this intended behavior?

cgrunenberg · April 18, 2024, 1:43pm

That’s due to the missing text layer, highlights don’t have a content on their own.

BLUEFROG · April 18, 2024, 1:48pm

See this blog post…