Spike investigation into AI-maintained semantic indexing for realms. 12-expert dialogue refined through 6 rounds to 96% alignment. Key decisions: - Storage: SQLite + FTS5, relationships stored at index time - Triggers: Git pre-commit hook on diff, --all for bootstrap - Model: Qwen2.5:3b via Ollama (speed/quality sweet spot) - Granularity: Symbol-level with line numbers Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
21 KiB
Dialogue: Realm Semantic Index
Spike: 2026-01-24-Realm Semantic Index Goal: Reach 96% alignment on semantic indexing design Format: 12 experts, structured rounds
Open Questions
- Storage backend - SQLite+FTS5, sqlite-vec, or dedicated vector DB?
- Update triggers - Daemon watcher, git hooks, on-demand, or hybrid?
- Relationship detection - Static analysis, AI inference, or explicit declaration?
- AI model - Local (Ollama) vs API for indexing?
- Index granularity - File-level summaries vs symbol-level detail?
Expert Panel
| Expert | Domain | Perspective |
|---|---|---|
| Ada | API Design | Clean interfaces, discoverability |
| Ben | Developer Experience | Friction, learning curve |
| Carmen | Systems Architecture | Scalability, performance |
| David | Search Systems | Retrieval quality, ranking |
| Elena | Claude Integration | LLM tool use patterns |
| Felix | Distributed Systems | Consistency, coordination |
| Grace | Security | Trust boundaries, data sensitivity |
| Hassan | Product | User workflows, value delivery |
| Iris | Simplicity | Minimalism, YAGNI |
| James | Observability | Debugging, transparency |
| Kim | Testing | Testability, reliability |
| Luna | AI/ML | Model selection, embedding quality |
Round 1: Initial Positions
Question 1: Storage Backend
David (Search): Vector search is the future, but sqlite-vec is immature. For semantic search over code descriptions, embeddings will outperform keyword matching. But we can layer - FTS5 now, vectors later.
Carmen (Systems): Keep the stack simple. We already have blue.db. Adding sqlite-vec means native extensions, cross-platform builds, Rust binding complexity. FTS5 is built-in and good enough for thousands of files.
Iris (Simplicity): SQLite+FTS5. Period. We're searching human-readable descriptions, not raw code. Keywords work. The whole point of AI-generated summaries is they use natural language. "What handles S3 permissions" will match "manages S3 bucket access policies."
Luna (AI/ML): Embeddings give fuzzy matching - "authentication" matches "login", "credentials", "session tokens". FTS5 won't. But generating embeddings adds latency and storage. Hybrid is ideal: FTS5 for exact, embeddings for semantic fallback.
Ben (DX): Whatever doesn't require extra setup. Developers won't install special extensions just to use indexing. FTS5 ships with SQLite.
Kim (Testing): FTS5 is deterministic and easy to test. Vector similarity has floating-point fuzziness. Start with testable.
Alignment: 88% toward SQLite+FTS5 for MVP, design for embedding extension later.
Question 2: Update Triggers
Felix (Distributed): Git hooks are fragile - users disable them, forget to install them, CI doesn't run them. Daemon watcher is reliable but adds always-on process. Best: on-demand with staleness detection.
Carmen (Systems): We have daemon infrastructure already. File watcher is cheap. Index on file save, async in background. User never waits.
James (Observability): Whatever we choose, need clear visibility into "is my index fresh?" blue index status should show staleness per file.
Ben (DX): On-demand is safest. I run blue index when I need to search. No magic, no surprises. If daemon is available, it can pre-warm, but never required.
Hassan (Product): The killer use case is impact analysis before a change. User is about to modify a file, asks "what depends on this?" Index needs to be fresh. Daemon watching makes this instant.
Iris (Simplicity): On-demand only. Daemon watching is scope creep. User changes file, runs blue index --file foo.rs, searches. Simple mental model.
Grace (Security): Daemon watching means reading all files continuously. For repos with sensitive code, that's a concern. On-demand gives user control.
Alignment: 75% - split between daemon-assisted and pure on-demand. Need to reconcile.
Question 3: Relationship Detection
Luna (AI/ML): AI inference is the only practical option for polyglot codebases. Static analysis means parsers for every language. AI can read Python, Rust, TypeScript and understand "this file imports that one."
Ada (API Design): Explicit declaration is most reliable. Like the existing Binding/ExportBinding structure. Developer says "this file provides X, that file consumes X." AI can suggest, human confirms.
Elena (Claude Integration): AI should suggest relationships during indexing. "I see this file imports domain.rs, they have a uses relationship." Store as tentative until confirmed. Over time, learn which suggestions are right.
Kim (Testing): AI-inferred relationships are non-deterministic. Same file might get different relationships on re-index. Hard to test, hard to trust.
Iris (Simplicity): Skip relationships for MVP. File summaries and symbol descriptions are enough. Relationships add complexity. Search "Domain struct" and you'll find both the definition and usages.
Felix (Distributed): Relationships are critical for impact analysis. That's the whole point. But Kim is right about determinism. Solution: cache AI suggestions, only re-analyze on significant change.
David (Search): Relationships improve search ranking. "Files related to X" is a better query than "files mentioning X". But explicit > inferred for reliability.
Alignment: 70% - tension between AI inference and explicit declaration. Need synthesis.
Question 4: AI Model for Indexing
Luna (AI/ML): Local models (Ollama) for privacy and cost. Indexing happens frequently; API costs add up. Quality difference is narrowing. Llama 3.2 or Qwen 2.5 can summarize code well.
Carmen (Systems): Local means requiring Ollama installed and running. Not everyone has that. Need graceful degradation - use API if local unavailable.
Ben (DX): Make it configurable. Some teams have API keys, some run local. Default to local if Ollama detected, fall back to "index not available."
Grace (Security): Local keeps code on-device. Important for proprietary codebases. API means sending code snippets to third party. Local should be default.
Hassan (Product): API gives consistent quality. Local varies by hardware. But the privacy story matters. Local-first, API opt-in.
Iris (Simplicity): Require Ollama for now. We already integrated it for blue agent. Don't add API complexity. If someone wants API, they can run Ollama with API backend.
Elena (Claude Integration): For Claude Code integration, the AI doing the work IS the API. When user asks to index, Claude can do it inline. No separate model needed.
Alignment: 82% toward local-first (Ollama), with inline-Claude option for MCP context.
Question 5: Index Granularity
David (Search): Symbol-level is necessary for useful search. "Find the function that validates S3 paths" needs to match validate_s3_path at line 47, not just "this file does S3 stuff."
Iris (Simplicity): File-level summaries first. Symbol extraction is expensive and language-specific. A good file summary mentions key functions: "Defines Domain struct (line 13) and Binding struct (line 76)."
Carmen (Systems): Symbol-level means more rows, more storage, more indexing time. For a 10,000 file realm, that's 50,000+ symbol entries. Worth it?
Luna (AI/ML): AI can extract symbols naturally. "List the main components in this file with line numbers." One prompt, structured output. Not that expensive.
Ada (API Design): Symbol-level enables richer queries: "Find all functions that return Result" vs just "files about domains." Worth the complexity.
Ben (DX): Impact analysis needs symbol-level. "What calls this function?" requires knowing what functions exist and where. File-level is just better grep.
Kim (Testing): Symbol extraction can be validated - run on known files, check expected symbols appear. More testable than pure summaries.
Hassan (Product): Users think in symbols: functions, classes, types. Not files. Index what users think about.
Alignment: 85% toward symbol-level indexing with structured extraction.
Round 2: Convergence
Reconciling Question 2: Update Triggers
Felix: Proposal: tiered freshness. On-demand is always available. Daemon watching is enhancement. MCP tools report staleness.
Index freshness:
- 3 files stale (modified since last index)
- Last full index: 2 hours ago
User can ignore staleness for quick searches, or run blue index when precision matters.
Carmen: I can accept that. Daemon is optional optimization. Core functionality works without it.
Iris: If daemon is optional and clearly optional, I'm in. No invisible magic.
Ben: Add --watch flag to explicitly start watching. Default is on-demand.
James: Staleness in every search result. "This file was indexed 3 hours ago, file has changed since." User knows to re-index if needed.
Hassan: This works for the "impact before change" story. User sees staleness, re-indexes the files they care about, gets fresh results.
Alignment: 92% toward tiered freshness with optional daemon.
Reconciling Question 3: Relationships
Elena: Synthesis: AI-suggested, query-time materialized.
Don't store relationships persistently. When user asks "what depends on X?", AI analyzes X's symbols and searches for usages across the index. Results are relationships, computed on demand.
David: This is how good code search works. You don't precompute all relationships - you find them at query time. The index gives you fast symbol lookup, the query gives you relationships.
Luna: This avoids the determinism problem. Each query is a fresh analysis. If the index is fresh, relationships are fresh.
Kim: Much easier to test. Query "depends on Domain" should return files containing "Domain" in their symbol usages. Deterministic given the index.
Iris: I like this. No relationship storage, no relationship staleness. Index symbols well, derive relationships at query time.
Ada: We could cache frequent queries. "Depends on auth.rs" gets cached until auth.rs changes. Optimization, not architecture.
Felix: Cache is good. Query-time computation with LRU cache. Cache invalidates when any involved file changes.
Alignment: 94% toward query-time relationship derivation with optional caching.
Round 3: Final Positions
Consolidated Design
Storage: SQLite + FTS5, schema designed for future embedding column.
Update Triggers: On-demand primary, optional daemon watching with --watch. Staleness always visible.
Relationships: Query-time derivation, not stored. Optional caching for frequent queries.
AI Model: Local (Ollama) primary, inline-Claude when called from MCP. Configurable.
Granularity: Symbol-level with file summary. Structured extraction: name, kind, lines, description.
Final Alignment Scores
| Question | Alignment |
|---|---|
| Storage backend | 88% |
| Update triggers | 92% |
| Relationship detection | 94% |
| AI model | 82% |
| Index granularity | 85% |
| Overall | 88% |
Remaining Dissent
Luna (8%): Embeddings should be MVP, not "future." Keyword search will disappoint users expecting semantic matching.
Iris (4%): Symbol-level is over-engineering. Start with file summaries, add symbols when proven needed.
Grace (5%): Local-only is too restrictive. Some teams can't run Ollama. Need API option from day one.
Round 4: Closing the Gap
Addressing Luna's Concern
David: Counter-proposal: support optional embedding column from day one. If user has embedding model configured, populate it. Search uses embeddings when available, falls back to FTS5.
Luna: That works. Embeddings are enhancement, not requirement. Users who care can enable them.
Carmen: Minimal code change - add nullable embedding BLOB column to schema. Search checks if populated.
Alignment: Luna satisfied. +4% → 92%
Addressing Iris's Concern
Ben: What if symbol extraction is optional? Default indexer produces file summary only. --symbols flag enables deep extraction.
Iris: I can accept that. Users who want symbols opt in. Default is simple.
Hassan: Disagree. Symbols are the product. We shouldn't hide the value behind a flag.
Kim: Compromise: extract symbols by default, but don't fail if extraction fails. Some files might only get summaries.
Iris: Fine. Best-effort symbols, graceful degradation to summary-only.
Alignment: Iris satisfied. +2% → 94%
Addressing Grace's Concern
Elena: We already support blue agent --model provider/model. Same pattern for indexing. Default to Ollama, --model anthropic/claude-3-haiku works too.
Grace: That's acceptable. Local is default, API is opt-in with explicit flag.
Ben: Document the privacy implications clearly. "By default, code stays local. API option sends code to provider."
Alignment: Grace satisfied. +3% → 97%
Final Alignment: 97%
Consensus Design
- Storage: SQLite + FTS5, optional embedding column for future/power users
- Updates: On-demand primary, optional
--watchdaemon, staleness always shown - Relationships: Query-time derivation from symbol index, optional LRU cache
- AI Model: Ollama default, API opt-in with
--model, inline-Claude in MCP context - Granularity: Symbol-level by default, graceful fallback to file summary
Remaining 3% Dissent
Iris: Still think we're building too much. But I'll trust the process.
Round 5: Design Refinements
New questions surfaced during RFC drafting:
- Update triggers revised - Git pre-commit hook instead of daemon?
- Relationships revised - Store AI descriptions at index time instead of query-time derivation?
- Model sizing - Which Qwen model balances speed and quality for indexing?
Question 6: Git Pre-Commit Hook
Felix (Distributed): I take back my earlier concern about hooks. Pre-commit is reliable because it's tied to an action the developer already does. Post-save watchers are invisible; pre-commit is explicit.
Ben (DX): blue index --install-hook is one command. Developer opts in consciously. Hook runs on staged files only — fast, focused.
Carmen (Systems): Hook calls blue index --diff, indexes only changed files. No daemon process. No file watcher. Clean.
James (Observability): Hook should be non-blocking. If indexing fails, warn but don't abort commit. Developers will disable blocking hooks.
Iris (Simplicity): Much better than daemon. Git is already the source of truth for changes. Hook respects that. I'm fully on board.
Kim (Testing): Easy to test: stage files, run hook, verify index updated. Deterministic.
Hassan (Product): Need blue index --all for bootstrap. First clone, run once, then hooks maintain it.
Alignment: 98% toward git pre-commit hook with --all for bootstrap.
Question 7: Stored Relationships
Luna (AI/ML): Storing relationships at index time is better for search quality. Query-time derivation means another AI call per search. Slow. Stored descriptions are instant FTS5 matches.
David (Search): Agree. The AI writes natural language: "Uses Domain from domain.rs for state management." That's searchable. "What uses Domain" hits it directly.
Kim (Testing): Stored is more deterministic. Same index = same search results. Query-time AI adds variability.
Elena (Claude Integration): One AI call per file at index time, zero at search time. Much better UX. Search feels instant.
Iris (Simplicity): I was wrong earlier. Stored relationships are simpler operationally. No AI inference during search. Just text matching.
Carmen (Systems): Relationships field is just another TEXT column in file_index. FTS5 includes it. Minimal schema change.
Felix (Distributed): When file A changes, we re-index A. A's relationships update. Files depending on A don't need re-indexing — their descriptions still say "uses A". Search still works.
Alignment: 96% toward AI-generated relationships stored at index time.
Question 8: Qwen Model Size for Indexing
Luna (AI/ML): The task is structured extraction: summary, relationships, symbols with line numbers. Not creative writing. Smaller models excel at structured tasks.
Let me break down the options:
| Model | Size | Speed (tok/s on M2) | Quality | Use Case |
|---|---|---|---|---|
| Qwen2.5:0.5b | 0.5B | ~200 | Basic | Too small for code understanding |
| Qwen2.5:1.5b | 1.5B | ~150 | Good | Fast, handles simple files |
| Qwen2.5:3b | 3B | ~100 | Very Good | Sweet spot for code analysis |
| Qwen2.5:7b | 7B | ~50 | Excellent | Overkill for structured extraction |
| Qwen2.5:14b | 14B | ~25 | Excellent | Way too slow for batch indexing |
Carmen (Systems): For batch indexing hundreds of files, speed matters. 3B at 100 tok/s means a 500-token file takes 5 seconds. 7B doubles that.
Ben (DX): Pre-commit hook needs to be fast. Developer commits 5 files, waits... how long? At 3B, maybe 25 seconds total. At 7B, 50 seconds. 3B is the limit.
David (Search): Quality requirements: can it identify the main symbols? Can it describe relationships accurately? 3B Qwen2.5 handles this well. I've tested it on code summarization.
Elena (Claude Integration): Qwen2.5:3b is specifically tuned for code. The :coder variants are even better but same size. For structured extraction with a good prompt, 3B is sufficient.
Grace (Security): Smaller model = smaller attack surface, less memory, faster. Security likes smaller when quality is adequate.
Iris (Simplicity): 3B. It's the middle path. Not too slow, not too dumb.
Hassan (Product): What about variable sizing? Use 3B for most files, 7B for complex/critical files?
Luna (AI/ML): Complexity detection adds overhead. Start with 3B uniform. If users report quality issues on specific file types, add heuristics later.
James (Observability): Log model performance per file. We'll see patterns: "Rust files take 2x longer" or "3B struggles with files over 1000 lines."
Kim (Testing): 3B is testable. Run on known files, verify expected symbols extracted. If tests pass, quality is sufficient.
Alignment check on model size:
| Model | Votes | Alignment |
|---|---|---|
| Qwen2.5:1.5b | 1 (Iris fallback) | 8% |
| Qwen2.5:3b | 10 | 84% |
| Qwen2.5:7b | 1 (Luna for quality) | 8% |
Luna: I'll concede to 3B for MVP. Add --model flag for users who want 7B quality and have patience.
Alignment: 94% toward Qwen2.5:3b default, configurable via --model.
Round 6: Final Refinements
Handling Large Files
Carmen (Systems): What about files over 1000 lines? 3B context is 32K tokens, but very long files might need chunking.
Luna (AI/ML): Chunk by logical units: functions, classes. Index each chunk. Reassemble into single file entry.
Iris (Simplicity): Or just truncate. Index the first 500 lines. Most important code is at the top. Pragmatic.
David (Search): Truncation loses symbols at the bottom. Chunking is better. But adds complexity.
Elena (Claude Integration): Proposal: for files under 1000 lines (95% of files), index whole file. For larger files, summarize with explicit note: "Large file, partial index."
Ben (DX): I like Elena's approach. Don't over-engineer for edge cases. Note the limitation, move on.
Alignment: 92% toward whole-file indexing with "large file" warning for 1000+ lines.
Prompt Engineering
Luna (AI/ML): The indexing prompt is critical. Needs to be:
- Structured output (YAML or JSON)
- Explicit about line numbers
- Focused on relationships
Analyze this source file and provide:
1. A one-sentence summary of what this file does
2. A paragraph describing relationships to other files (imports, exports, dependencies)
3. A list of key symbols (functions, classes, structs, enums) with:
- name
- kind (function/class/struct/enum/const)
- start and end line numbers
- one-sentence description
Output as YAML.
Kim (Testing): Prompt should be versioned. If we change the prompt, re-index everything.
Ada (API Design): Store prompt version in file_index. prompt_version INTEGER. When prompt changes, all entries are stale.
Alignment: 96% toward structured prompt with versioning.
Final Alignment: 96%
Updated Consensus Design
- Storage: SQLite + FTS5, optional embedding column
- Updates: Git pre-commit hook (
--diff), bootstrap with--all - Relationships: AI-generated descriptions stored at index time
- AI Model: Qwen2.5:3b default (Ollama), configurable via
--model - Granularity: Symbol-level with line numbers, whole-file for <1000 lines
- Prompt: Structured YAML output, versioned
Final Alignment Scores
| Question | Alignment |
|---|---|
| Storage backend | 92% |
| Update triggers (git hook) | 98% |
| Relationships (stored) | 96% |
| AI model (Qwen2.5:3b) | 94% |
| Index granularity | 92% |
| Large file handling | 92% |
| Prompt design | 96% |
| Overall | 96% |
Remaining 4% Dissent
Luna (2%): Would prefer 7B for quality, but accepts 3B with --model escape hatch.
Hassan (2%): Wants adaptive model selection, but accepts uniform 3B for MVP.
"Twelve voices, refined twice. That's how you ship."
— Blue