Impact of Massive Concordance Exclusion (10M+ words)

Hi everyone,

I am seeking technical insight into the architectural limits of the Concordance and its relationship with the See Also & Classify engine, especially when dealing with large-scale “noise” in the index.

I have a database with a significant amount of Arabic OCR content. Due to poor OCR quality in many files, my Concordance has ballooned to approximately 10 million words/tokens. About 70% of these are gibberish or noise tokens.

To refine the database’s “intelligence,” I have been attempting to Exclude these noise words manually. My goal is to isolate a “clean” glossary of about 150,000 meaningful terms by excluding the remaining millions of garbage tokens.

I would appreciate any “under the hood” perspective on whether this exclusion strategy actually helps the AI “focus” on the clean terms, or if the noise remains a factor in the background calculations.

Best regards,

Faris
MacBook Pro 16″, M1 Pro, 16GB RAM, 1TB SSD, macOS Tahoe 26.4.1
DEVONthink 4.2

This excludes the words only in the Concordance inspectors but doesn’t impact e.g. see also, classifying or searching.

Thank you for the clarification. Since you mentioned that this exclusion doesn’t impact ‘See Also’ or ‘Classifying,’ it seems my strategy won’t help refine the AI’s intelligence as I hoped.

However, I have a follow-up question regarding performance: Does maintaining an exclusion list of millions of words create any overhead for DEVONthink? Specifically, I’m concerned about whether this could lead to slower response times, increased cache usage, or general sluggishness in the application’s performance.

It would definitely bloat DEVONthink’s preferences and this might impact DEVONthink’s and the system’s performance, I would definitely not recommend such a huge number of exclusions.

Then, is there a way to utilize the capabilities of “See Also & Classify” when there’s confusion in the displayed words (concordance)? Any advice would be helpful in this situation.

The only possibility currently is to exclude items from see also and/or classification.

Another option is to see if you can improve OCR results.
Did you OCR the files in DEVONthink?

Thank you for your advice. Actually, I used an external Python script via Alfred to convert the PDFs into searchable files, as DEVONthink’s built-in OCR engine does not support Arabic.

In this case, since the OCR was done externally, is there a possibility to improve or correct the OCR results from within the app? Or are there specific settings you recommend to help DEVONthink index these searchable PDFs more effectively?

Since excluding words doesn’t improve “See Also” or “Classifying”, is there a possibility to edit or replace the OCR text layer of a PDF within DEVONthink? Or must this ‘cleaning’ happen entirely outside the app before re-indexing the files?

Note: I converted the PDFs into searchable files using an external Python script.

You could do OCR in DEVONthink Pro/Server. However, as there isn’t support for Arabic in the OCR options we have, this wouldn’t prove especially useful.

What python script and what OCR engine did you use?

I would guess probably not. DT doesn’t “speak” Arabic, so it has no way to recognize “good” or “bad” readings.

If it were me, I’d be thinking about ways to get better recognition of the original source materials, and therefore reduce the noise in the text layer that DT is working with.

1 Like

Thanks for your reply! The Script I use is built with the help of Gemini pro, relying on the Python-based ocrmypdf (Tesseract), which is automated using Bash scripts (triggered via Alfred).

I mainly use two scripts for my workflow:

  1. Main OCR Script: To get accurate Arabic text and strictly prevent Tesseract from hallucinating numbers or Latin characters.

  2. Flattening Script: Used occasionally when the first script fails due to a pre-existing corrupted OCR layer, stripping it out so the engine can process the PDF fresh as plain images.

Searching inside Arabic PDFs in DEVONthink has become much more reliable using this method. Right now, my only remaining problem is the massive amount of garbled, hallucinated words that are currently cluttering the database’s concordance.
Here are the scripts for reference:

1. Main OCR Script:

# Define the command script path
TMP_SCRIPT="/tmp/alfred_researcher_ocr_devonthink.command"

echo '#!/bin/bash' > "$TMP_SCRIPT"
echo 'export PATH="/usr/local/bin:/opt/homebrew/bin:$PATH"' >> "$TMP_SCRIPT"
echo 'clear' >> "$TMP_SCRIPT"
echo 'echo "🚀 Starting scanning and processing (with bilingual support to prevent text hallucination)..."' >> "$TMP_SCRIPT"

# Create the report file with a new header
echo 'REPORT_FILE="/tmp/ocr_summary_report.txt"' >> "$TMP_SCRIPT"
echo 'echo "Files that failed and require manual processing:" > "$REPORT_FILE"' >> "$TMP_SCRIPT"
echo 'echo "==========================================" >> "$REPORT_FILE"' >> "$TMP_SCRIPT"

for input_pdf in "$@"; do
    dir_name=$(dirname "$input_pdf")
    base_name=$(basename "$input_pdf" .pdf)
    
    # Check if the output file exists and append a numeric sequence to prevent overwriting
    output_pdf="$dir_name/${base_name}_Pro.pdf"
    counter=1
    while [ -f "$output_pdf" ]; do
        output_pdf="$dir_name/${base_name}_Pro_$counter.pdf"
        ((counter++))
    done
    
    # Sanitize file paths
    safe_input=$(printf '%q' "$input_pdf")
    safe_output=$(printf '%q' "$output_pdf")

    echo "echo '-----------------------------------'" >> "$TMP_SCRIPT"
    echo "echo '📄 Currently processing: $base_name'" >> "$TMP_SCRIPT"

    # 1. Read the "Finder comment"
    echo "COMMENT=\$(osascript -e 'on run {f}' -e 'tell application \"Finder\" to get comment of (POSIX file f as alias)' -e 'end run' $safe_input)" >> "$TMP_SCRIPT"

    # Temporary log file for filtering
    echo "LOG_FILE=\"/tmp/ocr_log.txt\"" >> "$TMP_SCRIPT"

    # 2. Core conversion (using ara+eng to prevent text hallucination, while maintaining robust settings)
    echo "script -q \"\$LOG_FILE\" bash -c \"ocrmypdf -l ara+eng --redo-ocr -O 3 --image-dpi 150 --max-image-mpixels 1000 --output-type pdf $safe_input $safe_output\"" >> "$TMP_SCRIPT"
    
    # --- Hidden smart filtering section ---
    echo "if grep -qia \"cannot be mapped\\|consider using --force-ocr\" \"\$LOG_FILE\"; then" >> "$TMP_SCRIPT"
    echo "    HAS_CORRUPTION=true" >> "$TMP_SCRIPT"
    echo "else" >> "$TMP_SCRIPT"
    echo "    HAS_CORRUPTION=false" >> "$TMP_SCRIPT"
    echo "fi" >> "$TMP_SCRIPT"

    # 3. Decision making process
    echo "if [ -f $safe_output ] && [ \"\$HAS_CORRUPTION\" = false ]; then" >> "$TMP_SCRIPT"
        echo "    if [ -n \"\$COMMENT\" ] && [ \"\$COMMENT\" != \"missing value\" ]; then" >> "$TMP_SCRIPT"
        echo "        osascript -e 'on run {f, c}' -e 'tell application \"Finder\" to set comment of (POSIX file f as alias) to c' -e 'end run' $safe_output \"\$COMMENT\"" >> "$TMP_SCRIPT"
        echo "        echo '📝 Your summary (Finder comment) was successfully transferred to the new file!'" >> "$TMP_SCRIPT"
        echo "    fi" >> "$TMP_SCRIPT"
        echo "    out_base=\$(basename $safe_output)" >> "$TMP_SCRIPT"
        echo "    echo \"✅ Completed! Saved as: \$out_base\"" >> "$TMP_SCRIPT"
    echo "else" >> "$TMP_SCRIPT"
        echo "    echo \"❌ File: $base_name\" >> \"\$REPORT_FILE\"" >> "$TMP_SCRIPT"
        echo "    echo \"Path: $input_pdf\" >> \"\$REPORT_FILE\"" >> "$TMP_SCRIPT"
        echo "    echo \"------------------------------------------\" >> \"\$REPORT_FILE\"" >> "$TMP_SCRIPT"
        echo "    rm -f $safe_output" >> "$TMP_SCRIPT"
        echo "    echo '❌ File creation stopped: contains corrupted pages or watermarks.'" >> "$TMP_SCRIPT"
    echo "fi" >> "$TMP_SCRIPT"
    
    # Clean up the log file for each processed PDF
    echo "rm -f \"\$LOG_FILE\"" >> "$TMP_SCRIPT"
done

echo 'echo "-----------------------------------"' >> "$TMP_SCRIPT"
echo 'echo "✨ All files have been processed."' >> "$TMP_SCRIPT"

# Check if the report contains only the header (two lines)
echo 'if [ $(wc -l < "$REPORT_FILE") -le 2 ]; then' >> "$TMP_SCRIPT"
echo '    echo "🎉 All files processed successfully! No failed files." > "$REPORT_FILE"' >> "$TMP_SCRIPT"
echo 'fi' >> "$TMP_SCRIPT"

# === Display the floating window for the summary ===
echo 'cat << "EOF" > /tmp/show_report.scpt' >> "$TMP_SCRIPT"
echo 'set f to POSIX file "/tmp/ocr_summary_report.txt"' >> "$TMP_SCRIPT"
echo 'set reportText to read f as «class utf8»' >> "$TMP_SCRIPT"
echo 'tell application "Terminal"' >> "$TMP_SCRIPT"
echo '    activate' >> "$TMP_SCRIPT"
echo '    display dialog reportText with title "Failed Files Report" buttons {"OK, Understood"} default button 1' >> "$TMP_SCRIPT"
echo 'end tell' >> "$TMP_SCRIPT"
echo 'EOF' >> "$TMP_SCRIPT"

# Execute the report window script
echo 'osascript /tmp/show_report.scpt' >> "$TMP_SCRIPT"

# Cleanup and exit
echo 'rm -f /tmp/show_report.scpt' >> "$TMP_SCRIPT"
echo 'rm -f "$REPORT_FILE"' >> "$TMP_SCRIPT"
echo "osascript -e 'tell application \"Terminal\" to close front window'" >> "$TMP_SCRIPT"

# Execute the generated script
chmod +x "$TMP_SCRIPT"
open "$TMP_SCRIPT"

2. Flattening Script:

# Define the command script path
TMP_SCRIPT="/tmp/alfred_flatten_pdf.command"

echo '#!/bin/bash' > "$TMP_SCRIPT"
echo 'export PATH="/usr/local/bin:/opt/homebrew/bin:$PATH"' >> "$TMP_SCRIPT"
echo 'clear' >> "$TMP_SCRIPT"
echo 'echo "🚀 Flattening PDF files (converting to pure images to bypass font encodings)..."' >> "$TMP_SCRIPT"

for input_pdf in "$@"; do
    dir_name=$(dirname "$input_pdf")
    base_name=$(basename "$input_pdf" .pdf)
    
    # Check if the output file exists and append a numeric sequence to prevent overwriting
    output_pdf="$dir_name/${base_name}_Flattened.pdf"
    counter=1
    while [ -f "$output_pdf" ]; do
        output_pdf="$dir_name/${base_name}_Flattened_$counter.pdf"
        ((counter++))
    done
    
    # Sanitize file paths
    safe_input=$(printf '%q' "$input_pdf")
    safe_output=$(printf '%q' "$output_pdf")

    echo "echo '-----------------------------------'" >> "$TMP_SCRIPT"
    echo "echo '📄 Currently processing: $base_name'" >> "$TMP_SCRIPT"
    
    # Core flattening process using Ghostscript
    echo "gs -q -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pdfimage24 -r300 -sOutputFile=$safe_output $safe_input" >> "$TMP_SCRIPT"
    
    # Update completion message to reflect the new file name
    out_base=$(basename "$output_pdf")
    echo "echo '✅ Flattening complete! Saved as: $out_base'" >> "$TMP_SCRIPT"
done

echo 'echo "-----------------------------------"' >> "$TMP_SCRIPT"
echo 'echo "✨ All files have been flattened and are now ready for the OCR script."' >> "$TMP_SCRIPT"
echo 'read -p "Press Enter to close the window..."' >> "$TMP_SCRIPT"
echo "osascript -e 'tell application \"Terminal\" to close front window'" >> "$TMP_SCRIPT"

# Execute the generated script
chmod +x "$TMP_SCRIPT"
open "$TMP_SCRIPT"

I might not be understanding the problem but I like a puzzle so I thought I’d look at it anyway.

Is there not a dedicated app/service that can provide reliable Arabic OCR for PDFs? If the hallucinations are being caused by OCR (that’s what I’ve read the thread to mean?), using software that developers have already refined specifically to deal with this issue would seem like the quickest way to solve the problem. A quick search online suggests a couple of apps are already available, although I don’t speak Arabic so I don’t know what the apps offer.

If you run a dedicated app for this OCR, all you then need to do is script something that will re-do the OCR of your existing files using this app (and set something up to trigger this in future for new files), and you can then index/import your PDFs with a “clean” OCR layer that DT can play with.

Your problem seems to me like a “bad data in = bad data out” issue and instead of trying to fix the “out” side of the problem, you need to address the “in” side by getting the best OCR layer you can for your use case.

(Although I’ve not seen many posts in the forum about Arabic OCR, there are a few historians around dealing with historic manuscripts that don’t OCR well, so this “bad OCR data” issue has come up broadly before and using an OCR service that is used to handling the specifications of your files seems to be the way to go.)

3 Likes

If I understand other posts here correctly, there might be better AI code generators (Claude?).

The problem with the stuff you posted is that it’s badly written and hard to understand. It may work or it may contain errors. But it is so terribly convoluted that it would take (at least me) far too much time to unravel and understand.

What I’d try to do if I were you:

  • Install tesseract locally with the necessary Arabic language support
  • Run it on a selection of PDFs, say five or ten outside DT. You can export files from DT to a folder, for example
  • Create a test database in DT
  • import the files processed by tesseract into this database
  • see what the concordance etc in this database tells you

If that is any better than what you have now, we can think about setting up an automated process. With HI 1.0, so with code that humans write, comment, and can maintain.

3 Likes

Thank you for the advice — it makes sense.

I already have Tesseract installed with Arabic support, but I’m not sure about a simple, practical way to test it on 10–15 PDFs outside DEVONthink.

Could you suggest a straightforward step-by-step method to run this test and produce output suitable for re-import into a DEVONthink test database?

My coding experience is so limited, and any guidance would be appreciated.

Thanks — this is a very helpful way to think about the problem, and I think you are right that the main issue is probably “bad OCR in → bad results out”.

What I am trying to figure out now is whether there is a reliable OCR tool for Arabic that can be used as a first step before DEVONthink, especially for scanned or older documents where the quality is not good, since even Adobe’s OCR does not seem to give satisfactory results for Arabic PDFs so far.

Your comment also makes me think that the solution may not be just adjusting settings, but finding a workflow that already works well for this kind of material.

I would be interested to know if you have seen any tools or approaches that handle Arabic texts like this in a good way.

Yes, that makes sense in principle — improving OCR at the source is the right approach.

I’m now trying to find which OCR tools or workflows can actually work reliably for Arabic before DEVONthink.

So I agree with your point, I’m just still looking for what “better recognition” looks like in practice for this type of material.

@BLUEFROG @chrillek

OCR Outside DEVONthink.pdf (4.1 MB)

Interesting finding: OCR Arabic works well, but DEVONthink display layer seems to distort text

I’ve made some progress worth sharing.

I tested an Arabic PDF using Textify (local OCR on Mac, no internet needed). The OCR result is very good — I can copy text from Preview with less than 1% distortion.

However, after importing the same OCR-processed file into DEVONthink, the text becomes heavily corrupted in the Occurrence view (Latin-like broken characters), even though search still works correctly.

To make the test more controlled, I also disabled all OCR-related options in DEVONthink settings to ensure they are not affecting or reprocessing the text in any way.

This suggests a mismatch between the stored index and the display layer inside DEVONthink.

To help clarify the issue, I will attach the OCR-processed PDF before importing into DEVONthink.

Hopefully this can help identify where the transformation issue happens.

This PDF doesn’t seem to have a text layer at all. Is the option Settings > Files > Import > Recognition > Make text in PDF documents searchable en- or disabled?

I’ll second Christian here, the PDF file you’ve shared hasn’t been OCR’d and doesn’t have a text layer.

If you were using the text recognition in Preview, which it sounds like you were, this is Apple magic and involves in-app processing using whatever secret stuff Apple runs. it’s basically very sophisticated image recognition. It only works in the app while the file is open, and doesn’t save a text layer or alter the underlying file.

PDFs are basically an image file, with a text layer optionally also saved that a computer can “read”. When you create a new PDF, most apps save a text layer automatically. Scans and old PDFs tend not to have a text layer. This means a human can read the image, but the computer can’t interact with it. This is where OCR apps come in. They can “read” the image in the PDF and write a text layer to the PDF file so that it is fully searchable and you can interact with the text.

DT Pro can OCR documents, but ABBYY (the software DT uses to do that) isn’t very good at Arabic (you’d mentioned this but I’ve also looked at their list of supported languages).

Out of interest I started looking on Reddit for how others were handling this. Turns out, OCR support for Arabic is generally quite poor (I doubt it’s the only language suffering from this). The following get recommended a couple of times as having fairly reliable OCR for Arabic:

That was about it for options that have been recommended more than once by users. Obviously I’ve not tested any of these, but also I suspect there are more options available if you search on the Arabic web - no reason for Arabic apps to write or get reviewed in English after all!

1 Like