Impact of Massive Concordance Exclusion (10M+ words)

chrillek · April 22, 2026, 7:18am

Most probably just the Vision framework. Not very secret, not bad, but not perfect by a far cry – it often gets the sequence of text chunks on the same line wrong.

chrillek · April 22, 2026, 7:23am

If that’s the Textify available on GitHub, it is but an interface to Tesseract. I didn’t have the time to look into it. As @cgrunenberg and @MsLogica said, it may well be that it does not add a text layer to the PDF but just extracts the OCR’d text from the PDF to be used somewhere else.

Check out Use Tesseract to add a text layer to a PDF via OCR · GitHub on how to add a text layer to the PDF with Tesseract. Perhaps that gets you a better result.

Alternatively, there is this Installing OCRmyPDF — ocrmypdf 17.4.2 documentation

FarisNajem1 · April 23, 2026, 8:16pm

That’s strange. I can copy correct Arabic text from the file, and that’s what it means to me: that it’s “searchable.”

And yes, it seems I had the option “Settings > Files > Import > Recognition > Make text in PDF documents searchable” enabled. I’ve now disabled it, and I’ll wait for the results in future tests.

Thank you.

FarisNajem1 · April 23, 2026, 8:46pm

Yes, that seems to be what’s actually happening and I didn’t know about it! I thought that just because a file is accepted for searching in Preview, it means that it’s “searchable” in every other application!

Thank you so much for taking the time to search tools that can help me. I will try these tools when I have free time over the next two weeks and will let you know the results, as a way of sharing experiences.

FarisNajem1 · April 23, 2026, 9:15pm

I will try following the steps in the two links you provided, hoping to achieve the desired results.

Thank you.

FarisNajem1 · April 23, 2026, 10:58pm

I tested this, and after excluding more than 1.5 million words, the preferences file reached 50MB. As you predicted, DEVONthink became very slow and sluggish.

I have now restored my old preferences, and the app is back to its normal speed. I agree—massive exclusion is not the way to go. I will focus on better OCR for my Arabic files instead.