Adds the foundation for AI-maintained semantic file indexing: Schema (v4 migration): - file_index table with summary, relationships, prompt_version - symbol_index table with name, kind, line numbers, description - FTS5 virtual tables for full-text search CLI commands (blue index): - --all: Bootstrap full index - --diff: Index staged files (for pre-commit hook) - --file: Single file indexing - --refresh: Re-index stale entries - --install-hook: Install git pre-commit hook - status: Show index freshness MCP tools: - blue_index_status: Get index stats - blue_index_search: FTS5 search across files/symbols - blue_index_impact: Analyze change blast radius - blue_index_file: Store AI-generated index data - blue_index_realm: List all indexed files Remaining work: Ollama integration for actual AI indexing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
9.8 KiB
RFC 0010: Realm Semantic Index
| Status | In Progress |
| Date | 2026-01-24 |
| Source Spike | Realm Semantic Index |
| Dialogue | realm-semantic-index.dialogue.md |
| Alignment | 97% |
Summary
An AI-maintained semantic index for files within a realm. Each file gets a summary and symbol-level descriptions with line references. Enables semantic search for impact analysis: "what depends on this file?" and "what's the blast radius of this change?"
Problem
When working across repos in a realm:
- No quick way to know what a file does without reading it
- No way to find files related to a concept ("authentication", "S3 access")
- No impact analysis before making changes
- Existing search is keyword-only, misses semantic matches
Proposal
Index Structure
Each indexed file contains:
file: src/realm/domain.rs
last_indexed: 2026-01-24T10:30:00Z
file_hash: abc123
summary: "Domain definitions for cross-repo coordination"
relationships: |
Core types used by service.rs for realm state management.
Loaded/saved by repo.rs for persistence.
Referenced by daemon/client.rs for cross-repo messaging.
symbols:
- name: Domain
kind: struct
lines: [13, 73]
description: "Coordination context between repos with name, members, timestamps"
- name: Binding
kind: struct
lines: [76, 143]
description: "Declares repo exports and imports within a domain"
- name: ImportStatus
kind: enum
lines: [259, 274]
description: "Binding status: Pending, Current, Outdated, Broken"
Storage: SQLite + FTS5
Use existing blue.db with full-text search:
-- File-level index
CREATE TABLE file_index (
id INTEGER PRIMARY KEY,
realm TEXT NOT NULL,
repo TEXT NOT NULL,
file_path TEXT NOT NULL,
file_hash TEXT NOT NULL,
summary TEXT,
relationships TEXT, -- AI-generated relationship descriptions
indexed_at DATETIME DEFAULT CURRENT_TIMESTAMP,
prompt_version INTEGER DEFAULT 1, -- Invalidate on prompt changes
embedding BLOB, -- Optional, for future vector search
UNIQUE(realm, repo, file_path)
);
-- Symbol-level index
CREATE TABLE symbol_index (
id INTEGER PRIMARY KEY,
file_id INTEGER REFERENCES file_index(id) ON DELETE CASCADE,
name TEXT NOT NULL,
kind TEXT NOT NULL,
start_line INTEGER,
end_line INTEGER,
description TEXT
);
-- FTS5 for search
CREATE VIRTUAL TABLE file_search USING fts5(
file_path,
summary,
relationships,
content=file_index,
content_rowid=id
);
CREATE VIRTUAL TABLE symbol_search USING fts5(
name,
description,
content=symbol_index,
content_rowid=id
);
Update Triggers: Git-Driven
Primary: Pre-commit hook on diff
# .git/hooks/pre-commit (installed by blue index --install-hook)
#!/bin/sh
blue index --diff
The hook runs blue index --diff which:
- Gets staged files from
git diff --cached --name-only - Indexes only those files
- Commits include fresh index entries
Bootstrap: Full index from scratch
# First time setup - index everything
blue index --all
# Or index specific directory
blue index --all src/
On-demand: Single file or refresh
# Re-index specific file
blue index --file src/domain.rs
# Refresh stale entries (re-index files where hash changed)
blue index --refresh
MCP inline: When called from Claude, can index files during conversation.
Staleness Detection
blue index status
Index status:
Total files: 147
Indexed: 142 (96%)
Stale: 3 (hash mismatch)
Unindexed: 2 (new files)
Stale:
- src/realm/domain.rs
- src/realm/service.rs
Unindexed:
- src/new_feature.rs
- tests/new_test.rs
Relationships: AI-Generated at Index Time
When indexing a file, AI generates a concise relationships description alongside the summary:
file: src/realm/service.rs
summary: "RealmService coordinates cross-repo state and notifications"
relationships: |
Uses Domain and Binding from domain.rs for state representation.
Calls RepoConfig from config.rs for realm settings.
Provides notifications consumed by daemon/server.rs.
Tested by tests/realm_service_test.rs.
symbols:
- name: RealmService
kind: struct
lines: [15, 89]
description: "Main service coordinating realm operations"
The relationships field is a natural language description — searchable via FTS5:
Query: "what uses Domain"
→ Matches service.rs: "Uses Domain and Binding from domain.rs..."
Query: "what provides notifications"
→ Matches service.rs: "Provides notifications consumed by..."
AI does the relationship analysis once during indexing. Search is just text matching over stored descriptions. Fast and deterministic.
AI Model: Qwen2.5:3b via Ollama
Recommended: qwen2.5:3b — optimal balance of speed and quality for code indexing.
| Model | Speed (M2) | Quality | Verdict |
|---|---|---|---|
| qwen2.5:1.5b | ~150 tok/s | Basic | Too shallow for code analysis |
| qwen2.5:3b | ~100 tok/s | Very Good | Sweet spot — fast, accurate |
| qwen2.5:7b | ~50 tok/s | Excellent | Too slow for batch indexing |
At 3b, a 500-token file indexes in ~5 seconds. A 5-file commit takes ~25 seconds — acceptable for pre-commit hook.
Model priority:
1. Ollama qwen2.5:3b (default) - fast, local, private
2. --model flag - explicit override (e.g., qwen2.5:7b for quality)
3. Inline Claude - when called from MCP, use active model
Privacy: code stays local by default. API requires explicit opt-in.
Large File Handling
Files under 1000 lines: index whole file. Files over 1000 lines: summarize with warning "Large file, partial index."
No chunking for MVP. Note the limitation, move on.
Indexing Prompt
Versioned prompt for structured extraction:
Analyze this source file and provide:
1. A one-sentence summary of what this file does
2. A paragraph describing relationships to other files (imports, exports, dependencies)
3. A list of key symbols (functions, classes, structs, enums) with:
- name
- kind (function/class/struct/enum/const)
- start and end line numbers
- one-sentence description
Output as YAML.
Store prompt_version in file_index. When prompt changes, all entries are stale.
CLI Commands
# Bootstrap: index everything from scratch
blue index --all
# Install git pre-commit hook
blue index --install-hook
# Index staged files (called by hook)
blue index --diff
# Index single file
blue index --file src/domain.rs
# Refresh stale entries
blue index --refresh
# Check index freshness
blue index status
# Search the index
blue search "S3 permissions"
blue search --symbols "validate"
# Impact analysis
blue impact src/domain.rs
MCP Tools
| Tool | Purpose |
|---|---|
blue_index_realm |
Index all files in current realm |
blue_index_file |
Index a single file |
blue_index_status |
Show index freshness |
blue_index_search |
Search across indexed files |
blue_index_impact |
Show files depending on target |
Non-Goals
- Cross-realm search (scope to single realm for MVP)
- Automatic relationship storage (query-time only)
- Required embeddings (FTS5 is sufficient, embeddings are optional)
- Language-specific parsing (AI inference works across languages)
Test Plan
- Schema created in blue.db on first index
blue index --allindexes all files in realm, extracts symbolsblue index --diffindexes only staged filesblue index --fileindexes single file, updates existing entryblue index --install-hookcreates valid pre-commit hookblue index --refreshre-indexes stale entries onlyblue index statusshows staleness accuratelyblue searchreturns relevant files ranked by match qualityblue impactshows files with symbols referencing target- Staleness detection works (file hash comparison)
- Prompt version tracked; old versions marked stale
- Qwen2.5:3b produces valid YAML output
- Large files (>1000 lines) get partial index warning
- Ollama integration works for local indexing
--modelflag allows override to different model- MCP tools available and functional
- FTS5 search handles partial matches
- Pre-commit hook runs without blocking commit on failure
- Relationships field searchable via FTS5
Implementation Plan
- Add schema to blue.db (file_index, symbol_index, FTS5 tables)
- Create versioned indexing prompt for structured YAML extraction
- Implement Ollama integration with qwen2.5:3b default
- Implement
blue index --allfor bootstrap - Implement
blue index --difffor staged files - Implement
blue index --filefor single-file updates - Implement
blue index --install-hookfor git hook setup - Implement
blue index --refreshfor stale entry updates - Implement
blue index statusfor freshness reporting - Add large file handling (>1000 lines warning)
- Implement
blue searchwith FTS5 backend - Implement
blue impactfor dependency queries - Add MCP tools (5 tools)
- Add
--modelflag for model override - Optional: embedding column support
Open Questions (Resolved)
| Question | Resolution | Alignment |
|---|---|---|
| Storage backend | SQLite + FTS5, optional embedding column | 92% |
| Update triggers | Git pre-commit hook on diff, --all for bootstrap |
98% |
| Relationships | AI-generated descriptions stored at index time | 96% |
| AI model | Qwen2.5:3b via Ollama, --model for override |
94% |
| Granularity | Symbol-level with line numbers | 92% |
| Large files | Whole-file <1000 lines, warning for larger | 92% |
| Prompt design | Structured YAML, versioned | 96% |
"Index the realm. Know the impact. Change with confidence."
— Blue