Impact of Massive Concordance Exclusion (10M+ words)

FarisNajem1 · April 20, 2026, 8:32am

Hi everyone,

I am seeking technical insight into the architectural limits of the Concordance and its relationship with the See Also & Classify engine, especially when dealing with large-scale “noise” in the index.

I have a database with a significant amount of Arabic OCR content. Due to poor OCR quality in many files, my Concordance has ballooned to approximately 10 million words/tokens. About 70% of these are gibberish or noise tokens.

To refine the database’s “intelligence,” I have been attempting to Exclude these noise words manually. My goal is to isolate a “clean” glossary of about 150,000 meaningful terms by excluding the remaining millions of garbage tokens.

I would appreciate any “under the hood” perspective on whether this exclusion strategy actually helps the AI “focus” on the clean terms, or if the noise remains a factor in the background calculations.

Best regards,

Faris
MacBook Pro 16″, M1 Pro, 16GB RAM, 1TB SSD, macOS Tahoe 26.4.1
DEVONthink 4.2

cgrunenberg · April 20, 2026, 8:42am

This excludes the words only in the Concordance inspectors but doesn’t impact e.g. see also, classifying or searching.

FarisNajem1 · April 20, 2026, 12:33pm

Thank you for the clarification. Since you mentioned that this exclusion doesn’t impact ‘See Also’ or ‘Classifying,’ it seems my strategy won’t help refine the AI’s intelligence as I hoped.

However, I have a follow-up question regarding performance: Does maintaining an exclusion list of millions of words create any overhead for DEVONthink? Specifically, I’m concerned about whether this could lead to slower response times, increased cache usage, or general sluggishness in the application’s performance.

cgrunenberg · April 20, 2026, 12:42pm

It would definitely bloat DEVONthink’s preferences and this might impact DEVONthink’s and the system’s performance, I would definitely not recommend such a huge number of exclusions.

FarisNajem1 · April 20, 2026, 2:41pm

Then, is there a way to utilize the capabilities of “See Also & Classify” when there’s confusion in the displayed words (concordance)? Any advice would be helpful in this situation.

cgrunenberg · April 20, 2026, 2:43pm

The only possibility currently is to exclude items from see also and/or classification.

troejgaard · April 20, 2026, 3:04pm

Another option is to see if you can improve OCR results.
Did you OCR the files in DEVONthink?

FarisNajem1 · April 20, 2026, 7:38pm

Thank you for your advice. Actually, I used an external Python script via Alfred to convert the PDFs into searchable files, as DEVONthink’s built-in OCR engine does not support Arabic.

In this case, since the OCR was done externally, is there a possibility to improve or correct the OCR results from within the app? Or are there specific settings you recommend to help DEVONthink index these searchable PDFs more effectively?

FarisNajem1 · April 20, 2026, 7:50pm

Since excluding words doesn’t improve “See Also” or “Classifying”, is there a possibility to edit or replace the OCR text layer of a PDF within DEVONthink? Or must this ‘cleaning’ happen entirely outside the app before re-indexing the files?

Note: I converted the PDFs into searchable files using an external Python script.

BLUEFROG · April 20, 2026, 9:13pm

You could do OCR in DEVONthink Pro/Server. However, as there isn’t support for Arabic in the OCR options we have, this wouldn’t prove especially useful.

What python script and what OCR engine did you use?

kewms · April 21, 2026, 5:33am

I would guess probably not. DT doesn’t “speak” Arabic, so it has no way to recognize “good” or “bad” readings.

If it were me, I’d be thinking about ways to get better recognition of the original source materials, and therefore reduce the noise in the text layer that DT is working with.

FarisNajem1 · April 21, 2026, 5:35am

Thanks for your reply! The Script I use is built with the help of Gemini pro, relying on the Python-based ocrmypdf (Tesseract), which is automated using Bash scripts (triggered via Alfred).

I mainly use two scripts for my workflow:

Main OCR Script: To get accurate Arabic text and strictly prevent Tesseract from hallucinating numbers or Latin characters.
Flattening Script: Used occasionally when the first script fails due to a pre-existing corrupted OCR layer, stripping it out so the engine can process the PDF fresh as plain images.

Searching inside Arabic PDFs in DEVONthink has become much more reliable using this method. Right now, my only remaining problem is the massive amount of garbled, hallucinated words that are currently cluttering the database’s concordance.
Here are the scripts for reference:

1. Main OCR Script:

# Define the command script path
TMP_SCRIPT="/tmp/alfred_researcher_ocr_devonthink.command"

echo '#!/bin/bash' > "$TMP_SCRIPT"
echo 'export PATH="/usr/local/bin:/opt/homebrew/bin:$PATH"' >> "$TMP_SCRIPT"
echo 'clear' >> "$TMP_SCRIPT"
echo 'echo "🚀 Starting scanning and processing (with bilingual support to prevent text hallucination)..."' >> "$TMP_SCRIPT"

# Create the report file with a new header
echo 'REPORT_FILE="/tmp/ocr_summary_report.txt"' >> "$TMP_SCRIPT"
echo 'echo "Files that failed and require manual processing:" > "$REPORT_FILE"' >> "$TMP_SCRIPT"
echo 'echo "==========================================" >> "$REPORT_FILE"' >> "$TMP_SCRIPT"

for input_pdf in "$@"; do
    dir_name=$(dirname "$input_pdf")
    base_name=$(basename "$input_pdf" .pdf)
    
    # Check if the output file exists and append a numeric sequence to prevent overwriting
    output_pdf="$dir_name/${base_name}_Pro.pdf"
    counter=1
    while [ -f "$output_pdf" ]; do
        output_pdf="$dir_name/${base_name}_Pro_$counter.pdf"
        ((counter++))
    done
    
    # Sanitize file paths
    safe_input=$(printf '%q' "$input_pdf")
    safe_output=$(printf '%q' "$output_pdf")

    echo "echo '-----------------------------------'" >> "$TMP_SCRIPT"
    echo "echo '📄 Currently processing: $base_name'" >> "$TMP_SCRIPT"

    # 1. Read the "Finder comment"
    echo "COMMENT=\$(osascript -e 'on run {f}' -e 'tell application \"Finder\" to get comment of (POSIX file f as alias)' -e 'end run' $safe_input)" >> "$TMP_SCRIPT"

    # Temporary log file for filtering
    echo "LOG_FILE=\"/tmp/ocr_log.txt\"" >> "$TMP_SCRIPT"

    # 2. Core conversion (using ara+eng to prevent text hallucination, while maintaining robust settings)
    echo "script -q \"\$LOG_FILE\" bash -c \"ocrmypdf -l ara+eng --redo-ocr -O 3 --image-dpi 150 --max-image-mpixels 1000 --output-type pdf $safe_input $safe_output\"" >> "$TMP_SCRIPT"
    
    # --- Hidden smart filtering section ---
    echo "if grep -qia \"cannot be mapped\\|consider using --force-ocr\" \"\$LOG_FILE\"; then" >> "$TMP_SCRIPT"
    echo "    HAS_CORRUPTION=true" >> "$TMP_SCRIPT"
    echo "else" >> "$TMP_SCRIPT"
    echo "    HAS_CORRUPTION=false" >> "$TMP_SCRIPT"
    echo "fi" >> "$TMP_SCRIPT"

    # 3. Decision making process
    echo "if [ -f $safe_output ] && [ \"\$HAS_CORRUPTION\" = false ]; then" >> "$TMP_SCRIPT"
        echo "    if [ -n \"\$COMMENT\" ] && [ \"\$COMMENT\" != \"missing value\" ]; then" >> "$TMP_SCRIPT"
        echo "        osascript -e 'on run {f, c}' -e 'tell application \"Finder\" to set comment of (POSIX file f as alias) to c' -e 'end run' $safe_output \"\$COMMENT\"" >> "$TMP_SCRIPT"
        echo "        echo '📝 Your summary (Finder comment) was successfully transferred to the new file!'" >> "$TMP_SCRIPT"
        echo "    fi" >> "$TMP_SCRIPT"
        echo "    out_base=\$(basename $safe_output)" >> "$TMP_SCRIPT"
        echo "    echo \"✅ Completed! Saved as: \$out_base\"" >> "$TMP_SCRIPT"
    echo "else" >> "$TMP_SCRIPT"
        echo "    echo \"❌ File: $base_name\" >> \"\$REPORT_FILE\"" >> "$TMP_SCRIPT"
        echo "    echo \"Path: $input_pdf\" >> \"\$REPORT_FILE\"" >> "$TMP_SCRIPT"
        echo "    echo \"------------------------------------------\" >> \"\$REPORT_FILE\"" >> "$TMP_SCRIPT"
        echo "    rm -f $safe_output" >> "$TMP_SCRIPT"
        echo "    echo '❌ File creation stopped: contains corrupted pages or watermarks.'" >> "$TMP_SCRIPT"
    echo "fi" >> "$TMP_SCRIPT"
    
    # Clean up the log file for each processed PDF
    echo "rm -f \"\$LOG_FILE\"" >> "$TMP_SCRIPT"
done

echo 'echo "-----------------------------------"' >> "$TMP_SCRIPT"
echo 'echo "✨ All files have been processed."' >> "$TMP_SCRIPT"

# Check if the report contains only the header (two lines)
echo 'if [ $(wc -l < "$REPORT_FILE") -le 2 ]; then' >> "$TMP_SCRIPT"
echo '    echo "🎉 All files processed successfully! No failed files." > "$REPORT_FILE"' >> "$TMP_SCRIPT"
echo 'fi' >> "$TMP_SCRIPT"

# === Display the floating window for the summary ===
echo 'cat << "EOF" > /tmp/show_report.scpt' >> "$TMP_SCRIPT"
echo 'set f to POSIX file "/tmp/ocr_summary_report.txt"' >> "$TMP_SCRIPT"
echo 'set reportText to read f as «class utf8»' >> "$TMP_SCRIPT"
echo 'tell application "Terminal"' >> "$TMP_SCRIPT"
echo '    activate' >> "$TMP_SCRIPT"
echo '    display dialog reportText with title "Failed Files Report" buttons {"OK, Understood"} default button 1' >> "$TMP_SCRIPT"
echo 'end tell' >> "$TMP_SCRIPT"
echo 'EOF' >> "$TMP_SCRIPT"

# Execute the report window script
echo 'osascript /tmp/show_report.scpt' >> "$TMP_SCRIPT"

# Cleanup and exit
echo 'rm -f /tmp/show_report.scpt' >> "$TMP_SCRIPT"
echo 'rm -f "$REPORT_FILE"' >> "$TMP_SCRIPT"
echo "osascript -e 'tell application \"Terminal\" to close front window'" >> "$TMP_SCRIPT"

# Execute the generated script
chmod +x "$TMP_SCRIPT"
open "$TMP_SCRIPT"

2. Flattening Script:

# Define the command script path
TMP_SCRIPT="/tmp/alfred_flatten_pdf.command"

echo '#!/bin/bash' > "$TMP_SCRIPT"
echo 'export PATH="/usr/local/bin:/opt/homebrew/bin:$PATH"' >> "$TMP_SCRIPT"
echo 'clear' >> "$TMP_SCRIPT"
echo 'echo "🚀 Flattening PDF files (converting to pure images to bypass font encodings)..."' >> "$TMP_SCRIPT"

for input_pdf in "$@"; do
    dir_name=$(dirname "$input_pdf")
    base_name=$(basename "$input_pdf" .pdf)
    
    # Check if the output file exists and append a numeric sequence to prevent overwriting
    output_pdf="$dir_name/${base_name}_Flattened.pdf"
    counter=1
    while [ -f "$output_pdf" ]; do
        output_pdf="$dir_name/${base_name}_Flattened_$counter.pdf"
        ((counter++))
    done
    
    # Sanitize file paths
    safe_input=$(printf '%q' "$input_pdf")
    safe_output=$(printf '%q' "$output_pdf")

    echo "echo '-----------------------------------'" >> "$TMP_SCRIPT"
    echo "echo '📄 Currently processing: $base_name'" >> "$TMP_SCRIPT"
    
    # Core flattening process using Ghostscript
    echo "gs -q -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pdfimage24 -r300 -sOutputFile=$safe_output $safe_input" >> "$TMP_SCRIPT"
    
    # Update completion message to reflect the new file name
    out_base=$(basename "$output_pdf")
    echo "echo '✅ Flattening complete! Saved as: $out_base'" >> "$TMP_SCRIPT"
done

echo 'echo "-----------------------------------"' >> "$TMP_SCRIPT"
echo 'echo "✨ All files have been flattened and are now ready for the OCR script."' >> "$TMP_SCRIPT"
echo 'read -p "Press Enter to close the window..."' >> "$TMP_SCRIPT"
echo "osascript -e 'tell application \"Terminal\" to close front window'" >> "$TMP_SCRIPT"

# Execute the generated script
chmod +x "$TMP_SCRIPT"
open "$TMP_SCRIPT"

MsLogica · April 21, 2026, 6:04am

I might not be understanding the problem but I like a puzzle so I thought I’d look at it anyway.

Is there not a dedicated app/service that can provide reliable Arabic OCR for PDFs? If the hallucinations are being caused by OCR (that’s what I’ve read the thread to mean?), using software that developers have already refined specifically to deal with this issue would seem like the quickest way to solve the problem. A quick search online suggests a couple of apps are already available, although I don’t speak Arabic so I don’t know what the apps offer.

If you run a dedicated app for this OCR, all you then need to do is script something that will re-do the OCR of your existing files using this app (and set something up to trigger this in future for new files), and you can then index/import your PDFs with a “clean” OCR layer that DT can play with.

Your problem seems to me like a “bad data in = bad data out” issue and instead of trying to fix the “out” side of the problem, you need to address the “in” side by getting the best OCR layer you can for your use case.

(Although I’ve not seen many posts in the forum about Arabic OCR, there are a few historians around dealing with historic manuscripts that don’t OCR well, so this “bad OCR data” issue has come up broadly before and using an OCR service that is used to handling the specifications of your files seems to be the way to go.)

chrillek · April 21, 2026, 7:13am

If I understand other posts here correctly, there might be better AI code generators (Claude?).

The problem with the stuff you posted is that it’s badly written and hard to understand. It may work or it may contain errors. But it is so terribly convoluted that it would take (at least me) far too much time to unravel and understand.

What I’d try to do if I were you:

Install tesseract locally with the necessary Arabic language support
Run it on a selection of PDFs, say five or ten outside DT. You can export files from DT to a folder, for example
Create a test database in DT
import the files processed by tesseract into this database
see what the concordance etc in this database tells you

If that is any better than what you have now, we can think about setting up an automated process. With HI 1.0, so with code that humans write, comment, and can maintain.

FarisNajem1 · April 21, 2026, 6:03pm

Thank you for the advice — it makes sense.

I already have Tesseract installed with Arabic support, but I’m not sure about a simple, practical way to test it on 10–15 PDFs outside DEVONthink.

Could you suggest a straightforward step-by-step method to run this test and produce output suitable for re-import into a DEVONthink test database?

My coding experience is so limited, and any guidance would be appreciated.

FarisNajem1 · April 21, 2026, 6:19pm

MsLogica:

I might not be understanding the problem but I like a puzzle so I thought I’d look at it anyway.

Is there not a dedicated app/service that can provide reliable Arabic OCR for PDFs? If the hallucinations are being caused by OCR (that’s what I’ve read the thread to mean?), using software that developers have already refined specifically to deal with this issue would seem like the quickest way to solve the problem. A quick search online suggests a couple of apps are already available, although I don’t speak Arabic so I don’t know what the apps offer.

If you run a dedicated app for this OCR, all you then need to do is script something that will re-do the OCR of your existing files using this app (and set something up to trigger this in future for new files), and you can then index/import your PDFs with a “clean” OCR layer that DT can play with.

Your problem seems to me like a “bad data in = bad data out” issue and instead of trying to fix the “out” side of the problem, you need to address the “in” side by getting the best OCR layer you can for your use case.

(Although I’ve not seen many posts in the forum about Arabic OCR, there are a few historians around dealing with historic manuscripts that don’t OCR well, so this “bad OCR data” issue has come up broadly before and using an OCR service that is used to handling the specifications of your files seems to be the way to go.)

Thanks — this is a very helpful way to think about the problem, and I think you are right that the main issue is probably “bad OCR in → bad results out”.

What I am trying to figure out now is whether there is a reliable OCR tool for Arabic that can be used as a first step before DEVONthink, especially for scanned or older documents where the quality is not good, since even Adobe’s OCR does not seem to give satisfactory results for Arabic PDFs so far.

Your comment also makes me think that the solution may not be just adjusting settings, but finding a workflow that already works well for this kind of material.

I would be interested to know if you have seen any tools or approaches that handle Arabic texts like this in a good way.

FarisNajem1 · April 21, 2026, 6:24pm

Yes, that makes sense in principle — improving OCR at the source is the right approach.

I’m now trying to find which OCR tools or workflows can actually work reliably for Arabic before DEVONthink.

So I agree with your point, I’m just still looking for what “better recognition” looks like in practice for this type of material.

FarisNajem1 · April 21, 2026, 7:26pm

@BLUEFROG @chrillek

OCR Outside DEVONthink.pdf (4.1 MB)

Interesting finding: OCR Arabic works well, but DEVONthink display layer seems to distort text

I’ve made some progress worth sharing.

I tested an Arabic PDF using Textify (local OCR on Mac, no internet needed). The OCR result is very good — I can copy text from Preview with less than 1% distortion.

However, after importing the same OCR-processed file into DEVONthink, the text becomes heavily corrupted in the Occurrence view (Latin-like broken characters), even though search still works correctly.

To make the test more controlled, I also disabled all OCR-related options in DEVONthink settings to ensure they are not affecting or reprocessing the text in any way.

This suggests a mismatch between the stored index and the display layer inside DEVONthink.

To help clarify the issue, I will attach the OCR-processed PDF before importing into DEVONthink.

Hopefully this can help identify where the transformation issue happens.

cgrunenberg · April 22, 2026, 5:30am

This PDF doesn’t seem to have a text layer at all. Is the option Settings > Files > Import > Recognition > Make text in PDF documents searchable en- or disabled?

MsLogica · April 22, 2026, 6:10am

I’ll second Christian here, the PDF file you’ve shared hasn’t been OCR’d and doesn’t have a text layer.

If you were using the text recognition in Preview, which it sounds like you were, this is Apple magic and involves in-app processing using whatever secret stuff Apple runs. it’s basically very sophisticated image recognition. It only works in the app while the file is open, and doesn’t save a text layer or alter the underlying file.

PDFs are basically an image file, with a text layer optionally also saved that a computer can “read”. When you create a new PDF, most apps save a text layer automatically. Scans and old PDFs tend not to have a text layer. This means a human can read the image, but the computer can’t interact with it. This is where OCR apps come in. They can “read” the image in the PDF and write a text layer to the PDF file so that it is fully searchable and you can interact with the text.

DT Pro can OCR documents, but ABBYY (the software DT uses to do that) isn’t very good at Arabic (you’d mentioned this but I’ve also looked at their list of supported languages).

Out of interest I started looking on Reddit for how others were handling this. Turns out, OCR support for Arabic is generally quite poor (I doubt it’s the only language suffering from this). The following get recommended a couple of times as having fairly reliable OCR for Arabic:

Tesseract (recommended the most)
GitHub - datalab-to/chandra: OCR model that handles complex tables, forms, handwriting with full layout. · GitHub - built specifically to support complex languages and maths
GitHub - JaidedAI/EasyOCR: Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc. · GitHub - also built specifically to support non-European languages
https://scribetools.com/ - Developed by an Arabic student for Arabic OCR originally

That was about it for options that have been recommended more than once by users. Obviously I’ve not tested any of these, but also I suspect there are more options available if you search on the Arabic web - no reason for Arabic apps to write or get reviewed in English after all!