DT4 – PDF+Text search oddity

Not sure whether there’s a bug at the bottom of this, but I had an odd experience just now: purchased an e-book on Google Play, stripped the DRM (via the Adobe Digital Editions + Calibre with DeDRM route) to make the PDF accessible to and searchable in DT, and ran a search on terms that occurred multiple times in the text, but got one or zero hits. If I scrolled to a page the term appeared on, that occurrence showed as a hit, but no others.

I then ran the PDF through DT’s ABBYY OCR, and all the occurrences showed, and also the original version was still visible in DT for comparison. (If this is an intentional change in DT4, I heartily approve.) The original continues to show only one or zero hits, but the re-OCR’d version is fine (three times the size even after post-OCR compression, but that’s how it often is). I wondered whether it was an Apple Vision thing, but both versions show in DT as PDF+Text.

Does any of this sound like anything that makes sense to anyone here? For DT team use I could send the two files (not publicly, obviously, as the DRM removal was strictly for personal use) if that might be useful in figuring out whether there’s a bug or something weirder here.

Stupid question: Are you sure the original PDF contains a text layer? And I mean a real text layer, not something created on the spot by Apple Vision. Should be easy to see by converting the e-book to text within DT. Or opening the PDF in preview and see what that findes.

I’ve seen that, too, with a PDF that was definitely never OCR’d because I’d forgotten to install Abbyy OCR with DT4.

Thanks – not a stupid question at all! That was my first thought, but I was able to select text in Preview. On the other hand, I’ve now opened it in NitroPDF, and that was unable to find an OCR layer (or at least offered to create one); and conversion to text (thanks for that suggestion, which I hadn’t thought of!) failed. So I think Preview must have been applying Vision to make the text selectable. Normally Play Books purchases come with a text layer, but I couldn’t swear that that’s always been the case, and it does look as though what’s happening here is that the PDF came without one and DT has labelled it as PDF+Text because Vision can process it.

If that’s the case, it would be great if there was a way to distinguish between an imported/indexed PDF with an existing text layer and one dependent on Vision for (reduced) searchability. But so far this is a one-off, so I’m not sure I’ve decisively nailed what’s going on here.

That’s what it does nowadays. Muddying things up.

The word count property should help.

1 Like

Was OCR necessary? I mean was your DRM removal process producing a raster PDF?

Also, DEVONthink 4 indexes PDFs without a text layer in a method similar to Apple’s Live Text. That does not mean OCR has been superseded or is unnecessary. It does mean previously unsearchable PDFs should now be searchable without explicitly doing OCR.

1 Like

Thanks! Yes, I’d initially thought it was a vector PDF because I could select text in Preview, but it seems as if Vision has made that an unreliable indicator of anything; on zooming it does appear that the text is bitmapped. The thing that puzzles me, on a first encounter with how DT4 handles such PDFs, is that search behaves oddly; the document doesn’t seem to be indexed, and search only finds results in pages which have been viewed since the PDF was last opened in the (DT) Preview pane – indicating that contents have not been indexed.

I’m still wrapping my head round how all this works. Is there an at-a-glance way to tell in DT4 whether a newly imported or indexed PDF is searchable or not? @chrillek usefully suggests word count, which you’d think would work, but my original shows 37,343 words and the ABBYY-OCR’d version 79,091 words. We probably need other users to report on any parallel experiences to figure out what’s happening in this case. Or maybe I should just go shopping again for Play books.

I refer to the Concordance most often as it also provides a view of the actual indexed text.

Ha, curiouser and curiouser! 6598 unique words in the original, 9389 in the OCR’d version. I search for a common word (e.g. “when”) in the original and get zero hits on a Concordance frequency of 159, unless there’s a page displaying that word showing in the Preview pane; the OCr’d version finds 270 occurrences (Concordance frequency 271).

Criss would have to talk about the deeper technical details, if he chooses to, but I wouldn’t expect parity between OCR and the “Live Text”. It could easily be broken words versus full ones. Only examining the Concordance or a plain text conversion would get closer to an specific answer.

Vision supports less languages and fonts. But it’s able to recognize handwritten notes. Depending on your handwriting of course :grinning_face: