Import vs Index: Performance, AI Integration, Search Speed, and Stability for Large Archives

emrullah · July 8, 2025, 1:19pm

Hello,

I’m a new DEVONthink Pro 4 user (MacBook Pro M3, 16 GB RAM), and I will use the application for both long-term personal archiving and AI-assisted document analysis, summarization, and generation.

My document archive has now grown to over 100,000 files (PDF, Word, ePub, etc.) totaling close to 200 GB in size.

I’d like to better understand the real-world technical and performance differences between the two primary approaches DEVONthink offers:

Import (store files inside the database) vs Index (reference external folders)

I’m especially interested in how these approaches differ in terms of performance, AI compatibility, search behavior, long-term data stability, and efficient management at this scale.

⸻

Performance and Speed (100k files – ~200 GB)
• With such a large archive, would importing all documents cause noticeable slowdowns in:
• Database launch times
• Searching
• “See Also” suggestions
• AI-based tools like summarization or tagging?
• Does DEVONthink load and monitor indexed folders continuously, or only on demand?
• Which approach offers better overall responsiveness and reliability for very large datasets?

⸻

Functionality Differences: Import vs Index
• What functional differences exist between imported and indexed content?
• Are features like OCR, “See Also & Classify”, Smart Rules, AI tagging, etc. fully available for indexed documents?
• How well does DEVONthink track changes to indexed files made outside the app (e.g. via Finder or external editors)?
• What happens if an indexed file or folder is renamed, moved, or deleted outside DEVONthink?

⸻

AI Workflow Integration
• I want to integrate my archive with external AI tools (OpenAI GPT-4, Claude 3, Ollama, LangChain, etc.) for tasks like retrieval-based question answering and content generation.
• Does using indexed files provide a clearer advantage here, since the files remain directly accessible via the filesystem?
• Is accessing imported files from outside (e.g., via Python scripts) significantly more complex or limiting?

⸻

Search Behavior and Boolean Queries
• Are there differences in search speed, precision, or completeness between indexed and imported files?
• Specifically:
• Full-text search performance
• Boolean queries using AND, OR, NOT, NEAR
• Smart Group responsiveness
• “See Also” and semantic similarity features
• At 100,000+ documents, are these differences more pronounced?
• Does DEVONthink optimize search indexing differently based on import vs index?

⸻

Long-Term Management and Data Stability
• With a .dtBase2 database approaching or exceeding 200 GB, are there increased risks related to:
• Corruption
• Slower backups (e.g. with Time Machine)
• Performance degradation over time?
• Is there a recommended upper size limit for imported databases, beyond which indexing is safer?

⸻

Would a Hybrid Strategy Make Sense?

⸻

I’d really appreciate insights from power users and DEVONtechnologies staff—especially anyone managing high-volume archives with external AI workflows.

Thanks in advance.

chrillek · July 8, 2025, 1:34pm

Perhaps you could investigate all that yourself with a subset of your data, like 50GB.

emrullah · July 8, 2025, 1:41pm

Thanks, yes 200 gb is total but I will create sub databases. The main question about internal and external index differences.

BLUEFROG · July 8, 2025, 1:53pm

Welcome @emrullah
I would suggest you first read the In & Out > Importing & Indexing section of the built-in Help and manual.

With regards to performance, the number of words/unique words/total items is more important than the size.
16GB RAM is pretty limited for such a large volume of data so I would definitely recommend creating multiple databases that can be opened and closed as needed.
Indexed files are treated the same as imported ones, in terms of available functionality like tagging, searching, See Also, etc.
If you move files from the indexed location, they will be reported as missing. Files will also be missing if you rename the indexed parent folder in the Finder. It is possible to redirect the path for an indexed parent folder or file, but indexing is ideally used with folder that are fairly static in their name and location.
There is no inherent AI advantage but if you’re going to try to use an AI app like Elephas to process DEVONthink documents, I would recommend that be indexed data instead of getting into the database’s internal structure.
Having a larger database isn’t inherently susceptible to corruption, but logically could take longer to backup and sync, depending on the situation. Performance would be affected from a machine resource standpoint with limited RAM on a machine.
A hybrid database can certainly be used but I see no distinct advantage from what you’ve talked about.

cgrunenberg · July 8, 2025, 2:02pm

The only major difference is the location of the files. Everything else is more or less identical. But exporting a database archive (see File > Export > Database Archive… or Scripts > Export > Daily Backup Archives) is a lot faster in case of indexed items as the backup does not include the indexed items.