Detect PDFs with garbage instead of text

rfog · July 4, 2020, 10:25am

Some 3rd party PDFs come with good visual text and good font rendering but the real text layer is composed of garbage. Some web scrap tools do the same. And some Print-to-PDF tools.

Those documents are set in DT/DTTG as “PDF+Text” but they really contains nonsense. I’m not referring to some OCR errors, that is normal, I’m referring to complete giberish.

Is there any way to detect those PDF with a smart rule/folder/search? A way is to select List of Words in Inspectors and go PDF by PDF taking a look to the list. If you don’t see real words, that is one of the PDFs. However I wonder if there is an automated way to do it and locate all documents at once.

rfog · July 4, 2020, 10:29am

Before I post here, before I find the reason: DRM. If PDF has DRM or some kind of encryption, it do not index all and only some parts…

Any way to see if a PDF has DRM?

anon6914418 · July 4, 2020, 10:35am

I guess you could use regex to search for the existence of frequently occurring words or words with a certain (unrealistic) length in a SmartRule

The trick is to think through how your brain separates random characters from actual words, as you apparently can do that.

Could you post an example or snippet of text?

rfog · July 5, 2020, 7:05am

First of all, this is not a DT issue but a way to try to do interesting things with DT.

It is little bit more complex than that with last DT versions. I remember seeing a lot of long nonsense words, but now, if a PDF has DRM, it shows only non-drm-in-dictionary words, that is a thing I agree with to avoid storing zillions of nonsense words.

I’ve done some experiments with a PDF with DRM (it has copy restriction and password). I used a tool to remove DRM, then OCR it with Abbyy Pro (it refused to OCR with password), and the result was poorer than original: less recognized words and visual aspect was very poor (I think it is an Abbyy anti-DRM protection feature, do its work but in a very poor way).

“Mi gozo en un pozo”, as it is said in Spanish.

PS: I’m completely against DRM. I purchased a PDF for my own use. It is supposed to be a reference thing, but if you cannot search into it (or index it), it is worth absolutely nothing. I’m going to contact with the editorial and ask them how it is supposed to have a non-searchable reference material… and cancel my subscription and get back my money. And the worst case is if you get it “in the wild” (the ones you can find), they do not have DRM and are fully searchable…