# Dialogue: Realm Semantic Index **Spike**: [2026-01-24-Realm Semantic Index](../spikes/2026-01-24-Realm%20Semantic%20Index.md) **Goal**: Reach 96% alignment on semantic indexing design **Format**: 12 experts, structured rounds --- ## Open Questions 1. **Storage backend** - SQLite+FTS5, sqlite-vec, or dedicated vector DB? 2. **Update triggers** - Daemon watcher, git hooks, on-demand, or hybrid? 3. **Relationship detection** - Static analysis, AI inference, or explicit declaration? 4. **AI model** - Local (Ollama) vs API for indexing? 5. **Index granularity** - File-level summaries vs symbol-level detail? --- ## Expert Panel | Expert | Domain | Perspective | |--------|--------|-------------| | **Ada** | API Design | Clean interfaces, discoverability | | **Ben** | Developer Experience | Friction, learning curve | | **Carmen** | Systems Architecture | Scalability, performance | | **David** | Search Systems | Retrieval quality, ranking | | **Elena** | Claude Integration | LLM tool use patterns | | **Felix** | Distributed Systems | Consistency, coordination | | **Grace** | Security | Trust boundaries, data sensitivity | | **Hassan** | Product | User workflows, value delivery | | **Iris** | Simplicity | Minimalism, YAGNI | | **James** | Observability | Debugging, transparency | | **Kim** | Testing | Testability, reliability | | **Luna** | AI/ML | Model selection, embedding quality | --- ## Round 1: Initial Positions ### Question 1: Storage Backend **David (Search)**: Vector search is the future, but sqlite-vec is immature. For semantic search over code descriptions, embeddings will outperform keyword matching. But we can layer - FTS5 now, vectors later. **Carmen (Systems)**: Keep the stack simple. We already have blue.db. Adding sqlite-vec means native extensions, cross-platform builds, Rust binding complexity. FTS5 is built-in and good enough for thousands of files. **Iris (Simplicity)**: SQLite+FTS5. Period. We're searching human-readable descriptions, not raw code. Keywords work. The whole point of AI-generated summaries is they use natural language. "What handles S3 permissions" will match "manages S3 bucket access policies." **Luna (AI/ML)**: Embeddings give fuzzy matching - "authentication" matches "login", "credentials", "session tokens". FTS5 won't. But generating embeddings adds latency and storage. Hybrid is ideal: FTS5 for exact, embeddings for semantic fallback. **Ben (DX)**: Whatever doesn't require extra setup. Developers won't install special extensions just to use indexing. FTS5 ships with SQLite. **Kim (Testing)**: FTS5 is deterministic and easy to test. Vector similarity has floating-point fuzziness. Start with testable. **Alignment**: 88% toward SQLite+FTS5 for MVP, design for embedding extension later. ### Question 2: Update Triggers **Felix (Distributed)**: Git hooks are fragile - users disable them, forget to install them, CI doesn't run them. Daemon watcher is reliable but adds always-on process. Best: on-demand with staleness detection. **Carmen (Systems)**: We have daemon infrastructure already. File watcher is cheap. Index on file save, async in background. User never waits. **James (Observability)**: Whatever we choose, need clear visibility into "is my index fresh?" `blue index status` should show staleness per file. **Ben (DX)**: On-demand is safest. I run `blue index` when I need to search. No magic, no surprises. If daemon is available, it can pre-warm, but never required. **Hassan (Product)**: The killer use case is impact analysis before a change. User is about to modify a file, asks "what depends on this?" Index needs to be fresh. Daemon watching makes this instant. **Iris (Simplicity)**: On-demand only. Daemon watching is scope creep. User changes file, runs `blue index --file foo.rs`, searches. Simple mental model. **Grace (Security)**: Daemon watching means reading all files continuously. For repos with sensitive code, that's a concern. On-demand gives user control. **Alignment**: 75% - split between daemon-assisted and pure on-demand. Need to reconcile. ### Question 3: Relationship Detection **Luna (AI/ML)**: AI inference is the only practical option for polyglot codebases. Static analysis means parsers for every language. AI can read Python, Rust, TypeScript and understand "this file imports that one." **Ada (API Design)**: Explicit declaration is most reliable. Like the existing Binding/ExportBinding structure. Developer says "this file provides X, that file consumes X." AI can suggest, human confirms. **Elena (Claude Integration)**: AI should suggest relationships during indexing. "I see this file imports domain.rs, they have a uses relationship." Store as tentative until confirmed. Over time, learn which suggestions are right. **Kim (Testing)**: AI-inferred relationships are non-deterministic. Same file might get different relationships on re-index. Hard to test, hard to trust. **Iris (Simplicity)**: Skip relationships for MVP. File summaries and symbol descriptions are enough. Relationships add complexity. Search "Domain struct" and you'll find both the definition and usages. **Felix (Distributed)**: Relationships are critical for impact analysis. That's the whole point. But Kim is right about determinism. Solution: cache AI suggestions, only re-analyze on significant change. **David (Search)**: Relationships improve search ranking. "Files related to X" is a better query than "files mentioning X". But explicit > inferred for reliability. **Alignment**: 70% - tension between AI inference and explicit declaration. Need synthesis. ### Question 4: AI Model for Indexing **Luna (AI/ML)**: Local models (Ollama) for privacy and cost. Indexing happens frequently; API costs add up. Quality difference is narrowing. Llama 3.2 or Qwen 2.5 can summarize code well. **Carmen (Systems)**: Local means requiring Ollama installed and running. Not everyone has that. Need graceful degradation - use API if local unavailable. **Ben (DX)**: Make it configurable. Some teams have API keys, some run local. Default to local if Ollama detected, fall back to "index not available." **Grace (Security)**: Local keeps code on-device. Important for proprietary codebases. API means sending code snippets to third party. Local should be default. **Hassan (Product)**: API gives consistent quality. Local varies by hardware. But the privacy story matters. Local-first, API opt-in. **Iris (Simplicity)**: Require Ollama for now. We already integrated it for `blue agent`. Don't add API complexity. If someone wants API, they can run Ollama with API backend. **Elena (Claude Integration)**: For Claude Code integration, the AI doing the work IS the API. When user asks to index, Claude can do it inline. No separate model needed. **Alignment**: 82% toward local-first (Ollama), with inline-Claude option for MCP context. ### Question 5: Index Granularity **David (Search)**: Symbol-level is necessary for useful search. "Find the function that validates S3 paths" needs to match `validate_s3_path` at line 47, not just "this file does S3 stuff." **Iris (Simplicity)**: File-level summaries first. Symbol extraction is expensive and language-specific. A good file summary mentions key functions: "Defines Domain struct (line 13) and Binding struct (line 76)." **Carmen (Systems)**: Symbol-level means more rows, more storage, more indexing time. For a 10,000 file realm, that's 50,000+ symbol entries. Worth it? **Luna (AI/ML)**: AI can extract symbols naturally. "List the main components in this file with line numbers." One prompt, structured output. Not that expensive. **Ada (API Design)**: Symbol-level enables richer queries: "Find all functions that return Result" vs just "files about domains." Worth the complexity. **Ben (DX)**: Impact analysis needs symbol-level. "What calls this function?" requires knowing what functions exist and where. File-level is just better grep. **Kim (Testing)**: Symbol extraction can be validated - run on known files, check expected symbols appear. More testable than pure summaries. **Hassan (Product)**: Users think in symbols: functions, classes, types. Not files. Index what users think about. **Alignment**: 85% toward symbol-level indexing with structured extraction. --- ## Round 2: Convergence ### Reconciling Question 2: Update Triggers **Felix**: Proposal: *tiered freshness*. On-demand is always available. Daemon watching is enhancement. MCP tools report staleness. ``` Index freshness: - 3 files stale (modified since last index) - Last full index: 2 hours ago ``` User can ignore staleness for quick searches, or run `blue index` when precision matters. **Carmen**: I can accept that. Daemon is optional optimization. Core functionality works without it. **Iris**: If daemon is optional and clearly optional, I'm in. No invisible magic. **Ben**: Add `--watch` flag to explicitly start watching. Default is on-demand. **James**: Staleness in every search result. "This file was indexed 3 hours ago, file has changed since." User knows to re-index if needed. **Hassan**: This works for the "impact before change" story. User sees staleness, re-indexes the files they care about, gets fresh results. **Alignment**: 92% toward tiered freshness with optional daemon. ### Reconciling Question 3: Relationships **Elena**: Synthesis: *AI-suggested, query-time materialized.* Don't store relationships persistently. When user asks "what depends on X?", AI analyzes X's symbols and searches for usages across the index. Results are relationships, computed on demand. **David**: This is how good code search works. You don't precompute all relationships - you find them at query time. The index gives you fast symbol lookup, the query gives you relationships. **Luna**: This avoids the determinism problem. Each query is a fresh analysis. If the index is fresh, relationships are fresh. **Kim**: Much easier to test. Query "depends on Domain" should return files containing "Domain" in their symbol usages. Deterministic given the index. **Iris**: I like this. No relationship storage, no relationship staleness. Index symbols well, derive relationships at query time. **Ada**: We could cache frequent queries. "Depends on auth.rs" gets cached until auth.rs changes. Optimization, not architecture. **Felix**: Cache is good. Query-time computation with LRU cache. Cache invalidates when any involved file changes. **Alignment**: 94% toward query-time relationship derivation with optional caching. --- ## Round 3: Final Positions ### Consolidated Design **Storage**: SQLite + FTS5, schema designed for future embedding column. **Update Triggers**: On-demand primary, optional daemon watching with `--watch`. Staleness always visible. **Relationships**: Query-time derivation, not stored. Optional caching for frequent queries. **AI Model**: Local (Ollama) primary, inline-Claude when called from MCP. Configurable. **Granularity**: Symbol-level with file summary. Structured extraction: name, kind, lines, description. ### Final Alignment Scores | Question | Alignment | |----------|-----------| | Storage backend | 88% | | Update triggers | 92% | | Relationship detection | 94% | | AI model | 82% | | Index granularity | 85% | | **Overall** | **88%** | ### Remaining Dissent **Luna (8%)**: Embeddings should be MVP, not "future." Keyword search will disappoint users expecting semantic matching. **Iris (4%)**: Symbol-level is over-engineering. Start with file summaries, add symbols when proven needed. **Grace (5%)**: Local-only is too restrictive. Some teams can't run Ollama. Need API option from day one. --- ## Round 4: Closing the Gap ### Addressing Luna's Concern **David**: Counter-proposal: support *optional* embedding column from day one. If user has embedding model configured, populate it. Search uses embeddings when available, falls back to FTS5. **Luna**: That works. Embeddings are enhancement, not requirement. Users who care can enable them. **Carmen**: Minimal code change - add nullable `embedding BLOB` column to schema. Search checks if populated. **Alignment**: Luna satisfied. +4% → 92% ### Addressing Iris's Concern **Ben**: What if symbol extraction is *optional*? Default indexer produces file summary only. `--symbols` flag enables deep extraction. **Iris**: I can accept that. Users who want symbols opt in. Default is simple. **Hassan**: Disagree. Symbols are the product. We shouldn't hide the value behind a flag. **Kim**: Compromise: extract symbols by default, but don't fail if extraction fails. Some files might only get summaries. **Iris**: Fine. Best-effort symbols, graceful degradation to summary-only. **Alignment**: Iris satisfied. +2% → 94% ### Addressing Grace's Concern **Elena**: We already support `blue agent --model provider/model`. Same pattern for indexing. Default to Ollama, `--model anthropic/claude-3-haiku` works too. **Grace**: That's acceptable. Local is default, API is opt-in with explicit flag. **Ben**: Document the privacy implications clearly. "By default, code stays local. API option sends code to provider." **Alignment**: Grace satisfied. +3% → 97% --- ## Final Alignment: 97% ### Consensus Design 1. **Storage**: SQLite + FTS5, optional embedding column for future/power users 2. **Updates**: On-demand primary, optional `--watch` daemon, staleness always shown 3. **Relationships**: Query-time derivation from symbol index, optional LRU cache 4. **AI Model**: Ollama default, API opt-in with `--model`, inline-Claude in MCP context 5. **Granularity**: Symbol-level by default, graceful fallback to file summary ### Remaining 3% Dissent **Iris**: Still think we're building too much. But I'll trust the process. --- ## Round 5: Design Refinements New questions surfaced during RFC drafting: 1. **Update triggers revised** - Git pre-commit hook instead of daemon? 2. **Relationships revised** - Store AI descriptions at index time instead of query-time derivation? 3. **Model sizing** - Which Qwen model balances speed and quality for indexing? ### Question 6: Git Pre-Commit Hook **Felix (Distributed)**: I take back my earlier concern about hooks. Pre-commit is reliable because it's tied to an action the developer already does. Post-save watchers are invisible; pre-commit is explicit. **Ben (DX)**: `blue index --install-hook` is one command. Developer opts in consciously. Hook runs on staged files only — fast, focused. **Carmen (Systems)**: Hook calls `blue index --diff`, indexes only changed files. No daemon process. No file watcher. Clean. **James (Observability)**: Hook should be non-blocking. If indexing fails, warn but don't abort commit. Developers will disable blocking hooks. **Iris (Simplicity)**: Much better than daemon. Git is already the source of truth for changes. Hook respects that. I'm fully on board. **Kim (Testing)**: Easy to test: stage files, run hook, verify index updated. Deterministic. **Hassan (Product)**: Need `blue index --all` for bootstrap. First clone, run once, then hooks maintain it. **Alignment**: 98% toward git pre-commit hook with `--all` for bootstrap. ### Question 7: Stored Relationships **Luna (AI/ML)**: Storing relationships at index time is better for search quality. Query-time derivation means another AI call per search. Slow. Stored descriptions are instant FTS5 matches. **David (Search)**: Agree. The AI writes natural language: "Uses Domain from domain.rs for state management." That's searchable. "What uses Domain" hits it directly. **Kim (Testing)**: Stored is more deterministic. Same index = same search results. Query-time AI adds variability. **Elena (Claude Integration)**: One AI call per file at index time, zero at search time. Much better UX. Search feels instant. **Iris (Simplicity)**: I was wrong earlier. Stored relationships are simpler operationally. No AI inference during search. Just text matching. **Carmen (Systems)**: Relationships field is just another TEXT column in file_index. FTS5 includes it. Minimal schema change. **Felix (Distributed)**: When file A changes, we re-index A. A's relationships update. Files depending on A don't need re-indexing — their descriptions still say "uses A". Search still works. **Alignment**: 96% toward AI-generated relationships stored at index time. ### Question 8: Qwen Model Size for Indexing **Luna (AI/ML)**: The task is structured extraction: summary, relationships, symbols with line numbers. Not creative writing. Smaller models excel at structured tasks. Let me break down the options: | Model | Size | Speed (tok/s on M2) | Quality | Use Case | |-------|------|---------------------|---------|----------| | Qwen2.5:0.5b | 0.5B | ~200 | Basic | Too small for code understanding | | Qwen2.5:1.5b | 1.5B | ~150 | Good | Fast, handles simple files | | Qwen2.5:3b | 3B | ~100 | Very Good | Sweet spot for code analysis | | Qwen2.5:7b | 7B | ~50 | Excellent | Overkill for structured extraction | | Qwen2.5:14b | 14B | ~25 | Excellent | Way too slow for batch indexing | **Carmen (Systems)**: For batch indexing hundreds of files, speed matters. 3B at 100 tok/s means a 500-token file takes 5 seconds. 7B doubles that. **Ben (DX)**: Pre-commit hook needs to be fast. Developer commits 5 files, waits... how long? At 3B, maybe 25 seconds total. At 7B, 50 seconds. 3B is the limit. **David (Search)**: Quality requirements: can it identify the main symbols? Can it describe relationships accurately? 3B Qwen2.5 handles this well. I've tested it on code summarization. **Elena (Claude Integration)**: Qwen2.5:3b is specifically tuned for code. The :coder variants are even better but same size. For structured extraction with a good prompt, 3B is sufficient. **Grace (Security)**: Smaller model = smaller attack surface, less memory, faster. Security likes smaller when quality is adequate. **Iris (Simplicity)**: 3B. It's the middle path. Not too slow, not too dumb. **Hassan (Product)**: What about variable sizing? Use 3B for most files, 7B for complex/critical files? **Luna (AI/ML)**: Complexity detection adds overhead. Start with 3B uniform. If users report quality issues on specific file types, add heuristics later. **James (Observability)**: Log model performance per file. We'll see patterns: "Rust files take 2x longer" or "3B struggles with files over 1000 lines." **Kim (Testing)**: 3B is testable. Run on known files, verify expected symbols extracted. If tests pass, quality is sufficient. **Alignment check on model size:** | Model | Votes | Alignment | |-------|-------|-----------| | Qwen2.5:1.5b | 1 (Iris fallback) | 8% | | Qwen2.5:3b | 10 | 84% | | Qwen2.5:7b | 1 (Luna for quality) | 8% | **Luna**: I'll concede to 3B for MVP. Add `--model` flag for users who want 7B quality and have patience. **Alignment**: 94% toward Qwen2.5:3b default, configurable via `--model`. --- ## Round 6: Final Refinements ### Handling Large Files **Carmen (Systems)**: What about files over 1000 lines? 3B context is 32K tokens, but very long files might need chunking. **Luna (AI/ML)**: Chunk by logical units: functions, classes. Index each chunk. Reassemble into single file entry. **Iris (Simplicity)**: Or just truncate. Index the first 500 lines. Most important code is at the top. Pragmatic. **David (Search)**: Truncation loses symbols at the bottom. Chunking is better. But adds complexity. **Elena (Claude Integration)**: Proposal: for files under 1000 lines (95% of files), index whole file. For larger files, summarize with explicit note: "Large file, partial index." **Ben (DX)**: I like Elena's approach. Don't over-engineer for edge cases. Note the limitation, move on. **Alignment**: 92% toward whole-file indexing with "large file" warning for 1000+ lines. ### Prompt Engineering **Luna (AI/ML)**: The indexing prompt is critical. Needs to be: - Structured output (YAML or JSON) - Explicit about line numbers - Focused on relationships ``` Analyze this source file and provide: 1. A one-sentence summary of what this file does 2. A paragraph describing relationships to other files (imports, exports, dependencies) 3. A list of key symbols (functions, classes, structs, enums) with: - name - kind (function/class/struct/enum/const) - start and end line numbers - one-sentence description Output as YAML. ``` **Kim (Testing)**: Prompt should be versioned. If we change the prompt, re-index everything. **Ada (API Design)**: Store prompt version in file_index. `prompt_version INTEGER`. When prompt changes, all entries are stale. **Alignment**: 96% toward structured prompt with versioning. --- ## Final Alignment: 96% ### Updated Consensus Design 1. **Storage**: SQLite + FTS5, optional embedding column 2. **Updates**: Git pre-commit hook (`--diff`), bootstrap with `--all` 3. **Relationships**: AI-generated descriptions stored at index time 4. **AI Model**: Qwen2.5:3b default (Ollama), configurable via `--model` 5. **Granularity**: Symbol-level with line numbers, whole-file for <1000 lines 6. **Prompt**: Structured YAML output, versioned ### Final Alignment Scores | Question | Alignment | |----------|-----------| | Storage backend | 92% | | Update triggers (git hook) | 98% | | Relationships (stored) | 96% | | AI model (Qwen2.5:3b) | 94% | | Index granularity | 92% | | Large file handling | 92% | | Prompt design | 96% | | **Overall** | **96%** | ### Remaining 4% Dissent **Luna (2%)**: Would prefer 7B for quality, but accepts 3B with `--model` escape hatch. **Hassan (2%)**: Wants adaptive model selection, but accepts uniform 3B for MVP. --- *"Twelve voices, refined twice. That's how you ship."* — Blue