Hello,
I’m a new DEVONthink Pro 4 user (MacBook Pro M3, 16 GB RAM), and I will use the application for both long-term personal archiving and AI-assisted document analysis, summarization, and generation.
My document archive has now grown to over 100,000 files (PDF, Word, ePub, etc.) totaling close to 200 GB in size.
I’d like to better understand the real-world technical and performance differences between the two primary approaches DEVONthink offers:
Import (store files inside the database) vs Index (reference external folders)
I’m especially interested in how these approaches differ in terms of performance, AI compatibility, search behavior, long-term data stability, and efficient management at this scale.
⸻
- Performance and Speed (100k files – ~200 GB)
• With such a large archive, would importing all documents cause noticeable slowdowns in:
• Database launch times
• Searching
• “See Also” suggestions
• AI-based tools like summarization or tagging?
• Does DEVONthink load and monitor indexed folders continuously, or only on demand?
• Which approach offers better overall responsiveness and reliability for very large datasets?
⸻
- Functionality Differences: Import vs Index
• What functional differences exist between imported and indexed content?
• Are features like OCR, “See Also & Classify”, Smart Rules, AI tagging, etc. fully available for indexed documents?
• How well does DEVONthink track changes to indexed files made outside the app (e.g. via Finder or external editors)?
• What happens if an indexed file or folder is renamed, moved, or deleted outside DEVONthink?
⸻
- AI Workflow Integration
• I want to integrate my archive with external AI tools (OpenAI GPT-4, Claude 3, Ollama, LangChain, etc.) for tasks like retrieval-based question answering and content generation.
• Does using indexed files provide a clearer advantage here, since the files remain directly accessible via the filesystem?
• Is accessing imported files from outside (e.g., via Python scripts) significantly more complex or limiting?
⸻
- Search Behavior and Boolean Queries
• Are there differences in search speed, precision, or completeness between indexed and imported files?
• Specifically:
• Full-text search performance
• Boolean queries using AND, OR, NOT, NEAR
• Smart Group responsiveness
• “See Also” and semantic similarity features
• At 100,000+ documents, are these differences more pronounced?
• Does DEVONthink optimize search indexing differently based on import vs index?
⸻
- Long-Term Management and Data Stability
• With a .dtBase2 database approaching or exceeding 200 GB, are there increased risks related to:
• Corruption
• Slower backups (e.g. with Time Machine)
• Performance degradation over time?
• Is there a recommended upper size limit for imported databases, beyond which indexing is safer?
⸻
- Would a Hybrid Strategy Make Sense?
⸻
I’d really appreciate insights from power users and DEVONtechnologies staff—especially anyone managing high-volume archives with external AI workflows.
Thanks in advance.