Every document filename now mirrors its lifecycle state with a status suffix (e.g., .draft.md, .wip.md, .accepted.md). No more bare .md for tracked document types. Also renamed all from_str methods to parse to avoid FromStr trait confusion, introduced StagingDeploymentParams struct, and fixed all 19 clippy warnings across the codebase. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
5.4 KiB
Spike: Realm Semantic Index
| Status | In Progress |
| Date | 2026-01-24 |
| Time Box | 4 hours |
Question
How can we create an AI-maintained semantic index of files within a realm, tracking what each file does (with line references), its relationships to other files, and enabling semantic search for change impact analysis?
Context
Realms coordinate across repos. Domains define relationships (provider/consumer, exports/imports). But when a file changes, there's no quick way to know:
- What does this file actually do?
- What other files depend on it?
- What's the blast radius of a change?
We want an AI-maintained index that answers these questions via semantic search.
Design Space
What Gets Indexed
For each file in a realm:
file: src/realm/domain.rs
last_indexed: 2026-01-24T10:30:00Z
hash: abc123 # for change detection
summary: "Domain definitions for cross-repo coordination. Defines Domain, Binding, ExportBinding, ImportBinding types."
symbols:
- name: Domain
kind: struct
lines: [13, 73]
description: "A coordination context between repos with name, description, creation time, and member list"
- name: Binding
kind: struct
lines: [76, 143]
description: "Declares what a repo exports or imports in a domain"
- name: ImportStatus
kind: enum
lines: [259, 274]
description: "Status of an import binding: Pending, Current, Outdated, Broken"
relationships:
- target: src/realm/service.rs
kind: used_by
description: "RealmService uses Domain and Binding to manage cross-repo state"
- target: src/realm/repo.rs
kind: used_by
description: "Repo operations load/save Domain and Binding files"
Storage Options
| Option | Pros | Cons |
|---|---|---|
| SQLite + FTS5 | Already have blue.db, full-text search built-in | No semantic/vector search |
| SQLite + sqlite-vec | Vector similarity search, keeps single DB | Requires extension, Rust bindings unclear |
| Separate JSON files | Human-readable, git-tracked | Slow to search at scale |
| Embedded vector DB (lancedb) | Purpose-built for semantic search | Another dependency |
Recommendation: Start with SQLite + FTS5 for keyword search. Add embeddings later if needed.
Index Update Triggers
- On-demand -
blue indexcommand regenerates - Git hook - Post-commit hook calls
blue index --changed - File watcher - Daemon watches for changes (already have daemon infrastructure)
- MCP tool -
blue_index_filefor AI agents to update during work
Likely want combination: daemon watches + on-demand refresh.
Semantic Search Approaches
Phase 1: Keyword + Structure
- FTS5 for text search across summaries and descriptions
- Filter by file path, symbol kind, relationship type
- Good enough for "find files related to authentication"
Phase 2: Embeddings
- Generate embeddings for each symbol description
- Store in sqlite-vec or similar
- Query: "what handles S3 bucket permissions" → vector similarity
Relationship Detection
AI needs to identify relationships. Approaches:
- Static analysis - Parse imports/uses (language-specific, complex)
- AI inference - "Given file A and file B, describe their relationship"
- Explicit declarations - Like current ExportBinding/ImportBinding
- Hybrid - AI suggests, human confirms
Recommendation: AI inference with caching. When indexing file A, ask AI to describe relationships to files it references.
Proposed Schema
-- File-level index
CREATE TABLE file_index (
id INTEGER PRIMARY KEY,
realm TEXT NOT NULL,
repo TEXT NOT NULL,
file_path TEXT NOT NULL,
file_hash TEXT NOT NULL,
summary TEXT,
indexed_at DATETIME DEFAULT CURRENT_TIMESTAMP,
UNIQUE(realm, repo, file_path)
);
-- Symbol-level index
CREATE TABLE symbol_index (
id INTEGER PRIMARY KEY,
file_id INTEGER REFERENCES file_index(id),
name TEXT NOT NULL,
kind TEXT NOT NULL, -- struct, fn, enum, class, etc.
start_line INTEGER,
end_line INTEGER,
description TEXT
);
-- Relationships between files
CREATE TABLE file_relationships (
id INTEGER PRIMARY KEY,
source_file_id INTEGER REFERENCES file_index(id),
target_file_id INTEGER REFERENCES file_index(id),
kind TEXT NOT NULL, -- uses, used_by, imports, exports, tests
description TEXT
);
-- FTS5 virtual table for search
CREATE VIRTUAL TABLE file_search USING fts5(
file_path,
summary,
symbol_names,
symbol_descriptions,
content=file_index
);
Proposed MCP Tools
| Tool | Purpose |
|---|---|
blue_index_realm |
Index all files in a realm |
blue_index_file |
Index a single file (for incremental updates) |
blue_index_search |
Semantic search across the index |
blue_index_impact |
Given a file, show what depends on it |
blue_index_status |
Show indexing status and staleness |
Open Questions
- Which AI model for indexing? Local (Ollama) for cost, or API for quality?
- How to handle large files? Chunk by function/class? Summary only?
- Cross-realm relationships? Index within realm first, cross-realm later?
- Embedding model? If we go vector route, which embedding model?
Next Steps
If this spike looks good:
- Create RFC for the full design
- Start with SQLite schema + FTS5
- Add
blue_index_filetool that takes AI-generated index data - Add daemon file watcher for auto-indexing
Investigation notes by Blue