Can Devonsphere search for files which don't contain certain words or phrases?

Marja376 · April 29, 2020, 4:24pm

I often have to work with pdfs. Yes, I know the advice, just avoid pdf. But I don’t have that choice.

Sometimes scanned pdfs are missing text.

Sometimes pdf-born-pdfs have corrupt text, or lose their text after processing in Ghostscript.

So far, my only option is to check each pdf in turn, and check the text. I’ve tried variations on wc -w in the command line, but they often report a word count for pdfs which have lost all words. For English-language pdfs, a quick search for pdfs which don’t contain “the” should find any which lack text, or have severely corrupted text. So can Devonsphere search for files which which don’t contain a given term?

BLUEFROG · April 29, 2020, 4:39pm

Technically, it’s possible but I don’t suggest it since it will return tons of results, not just PDFs.

Where are you searcing and what do you intend to do with the files?

Marja376 · April 29, 2020, 5:14pm

To begin with, I’d like to search my Calibre libraries, and then some other folders with files I’m working with.

I want to ensure these pdfs are searchable, and find and fix the ones which aren’t.

I have an old Kindle so it can’t read newer pdfs. I reprocess scanned pdfs with Willus’s k2pdfopt, and pdf-born-pdfs with Ghostscript with compatibility level 1.4 and sometimes compression options, to create old-Kindle-compatible versions. I can’t use tablets or other touch devices.

I hope to find the pdfs which are missing text so I can either (a) ocr them or (b) if the originals have text but Ghostscript reprocessed ones lose text, run the originals through the Quartz gray filter and then Ghostscript reprocess them again.

I have reported the bug on Ghostscript’s side, but they don’t consider it a bug there. For whatever reason Quartz + Ghostscript avoids that bug in just Ghostscript.

Unfortunately, Quartz alone can’t convert newer pdfs to old-Kindle-compatible versions.

BLUEFROG · April 29, 2020, 5:31pm

cd into the directory of your choice and run this command:

grep -ai -L  font "$PWD"/*.pdf

This should only report PDFs with no text layer. You can obviously modify this to suit your needs.

Marja376 · April 29, 2020, 5:37pm

Thank you, but that doesn’t find files with corrupt text layers, e.g. losing everything except punctuation.

BLUEFROG · April 29, 2020, 5:39pm

You can modify it, e.g., looking for the as you mentioned previously.

Marja376 · April 29, 2020, 6:40pm

Thank you.

grep -aid recurse -L the *.pdf

turns up one affected file and some false positives.

grep -aid recurse -L and *.pdf

one affected file and fewer false positives

but oddly, trying a nor search…

grep -aid recurse -L “and|the” *.pdf

or

grep -aid recurse -L and *.pdf | grep -ai -L the *.pdf

adds more false positives.

P.S. using -E allows the | for OR, and

-aiE -d recurse -L -e “and|the” *.pdf

Gets this down to 2 good results and 1 false positive in my test folder.

Marja376 · April 29, 2020, 11:50pm

P.S. Better results with pdfgrep -ir -L “the|and” (folder)

BLUEFROG · April 30, 2020, 5:03am

Good to hear but remember pdfgrep is not part of the base installation of macOS.

Marja376 · April 30, 2020, 10:50pm

Yes, I use Homebrew for that.

For a shell script in Automator:

Application, Shell, Bash. Pass input as arguments.

for f in “$@”
do
suffix="-list.txt"
base=basename "$f"
outputfile=$base$suffix
/usr/local/bin/pdfgrep -ir -L “the|and” “$f” > “$f”-list.txt
done

Drop the appropriate folder onto the app icon, and it should output a list of pdfs with either unreadable text, or non-English text. Add common works from other languages you use.

It probably includes some extra cruft since I’ve adapted it from other scripts. If you use another package manager, you may need another path.