Every document filename now mirrors its lifecycle state with a status suffix (e.g., .draft.md, .wip.md, .accepted.md). No more bare .md for tracked document types. Also renamed all from_str methods to parse to avoid FromStr trait confusion, introduced StagingDeploymentParams struct, and fixed all 19 clippy warnings across the codebase. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
180 lines
5.4 KiB
Markdown
180 lines
5.4 KiB
Markdown
# Spike: Realm Semantic Index
|
|
|
|
| | |
|
|
|---|---|
|
|
| **Status** | In Progress |
|
|
| **Date** | 2026-01-24 |
|
|
| **Time Box** | 4 hours |
|
|
|
|
---
|
|
|
|
## Question
|
|
|
|
How can we create an AI-maintained semantic index of files within a realm, tracking what each file does (with line references), its relationships to other files, and enabling semantic search for change impact analysis?
|
|
|
|
---
|
|
|
|
## Context
|
|
|
|
Realms coordinate across repos. Domains define relationships (provider/consumer, exports/imports). But when a file changes, there's no quick way to know:
|
|
- What does this file actually do?
|
|
- What other files depend on it?
|
|
- What's the blast radius of a change?
|
|
|
|
We want an AI-maintained index that answers these questions via semantic search.
|
|
|
|
## Design Space
|
|
|
|
### What Gets Indexed
|
|
|
|
For each file in a realm:
|
|
|
|
```yaml
|
|
file: src/realm/domain.rs
|
|
last_indexed: 2026-01-24T10:30:00Z
|
|
hash: abc123 # for change detection
|
|
|
|
summary: "Domain definitions for cross-repo coordination. Defines Domain, Binding, ExportBinding, ImportBinding types."
|
|
|
|
symbols:
|
|
- name: Domain
|
|
kind: struct
|
|
lines: [13, 73]
|
|
description: "A coordination context between repos with name, description, creation time, and member list"
|
|
|
|
- name: Binding
|
|
kind: struct
|
|
lines: [76, 143]
|
|
description: "Declares what a repo exports or imports in a domain"
|
|
|
|
- name: ImportStatus
|
|
kind: enum
|
|
lines: [259, 274]
|
|
description: "Status of an import binding: Pending, Current, Outdated, Broken"
|
|
|
|
relationships:
|
|
- target: src/realm/service.rs
|
|
kind: used_by
|
|
description: "RealmService uses Domain and Binding to manage cross-repo state"
|
|
|
|
- target: src/realm/repo.rs
|
|
kind: used_by
|
|
description: "Repo operations load/save Domain and Binding files"
|
|
```
|
|
|
|
### Storage Options
|
|
|
|
| Option | Pros | Cons |
|
|
|--------|------|------|
|
|
| **SQLite + FTS5** | Already have blue.db, full-text search built-in | No semantic/vector search |
|
|
| **SQLite + sqlite-vec** | Vector similarity search, keeps single DB | Requires extension, Rust bindings unclear |
|
|
| **Separate JSON files** | Human-readable, git-tracked | Slow to search at scale |
|
|
| **Embedded vector DB (lancedb)** | Purpose-built for semantic search | Another dependency |
|
|
|
|
**Recommendation:** Start with SQLite + FTS5 for keyword search. Add embeddings later if needed.
|
|
|
|
### Index Update Triggers
|
|
|
|
1. **On-demand** - `blue index` command regenerates
|
|
2. **Git hook** - Post-commit hook calls `blue index --changed`
|
|
3. **File watcher** - Daemon watches for changes (already have daemon infrastructure)
|
|
4. **MCP tool** - `blue_index_file` for AI agents to update during work
|
|
|
|
Likely want combination: daemon watches + on-demand refresh.
|
|
|
|
### Semantic Search Approaches
|
|
|
|
**Phase 1: Keyword + Structure**
|
|
- FTS5 for text search across summaries and descriptions
|
|
- Filter by file path, symbol kind, relationship type
|
|
- Good enough for "find files related to authentication"
|
|
|
|
**Phase 2: Embeddings**
|
|
- Generate embeddings for each symbol description
|
|
- Store in sqlite-vec or similar
|
|
- Query: "what handles S3 bucket permissions" → vector similarity
|
|
|
|
### Relationship Detection
|
|
|
|
AI needs to identify relationships. Approaches:
|
|
|
|
1. **Static analysis** - Parse imports/uses (language-specific, complex)
|
|
2. **AI inference** - "Given file A and file B, describe their relationship"
|
|
3. **Explicit declarations** - Like current ExportBinding/ImportBinding
|
|
4. **Hybrid** - AI suggests, human confirms
|
|
|
|
**Recommendation:** AI inference with caching. When indexing file A, ask AI to describe relationships to files it references.
|
|
|
|
## Proposed Schema
|
|
|
|
```sql
|
|
-- File-level index
|
|
CREATE TABLE file_index (
|
|
id INTEGER PRIMARY KEY,
|
|
realm TEXT NOT NULL,
|
|
repo TEXT NOT NULL,
|
|
file_path TEXT NOT NULL,
|
|
file_hash TEXT NOT NULL,
|
|
summary TEXT,
|
|
indexed_at DATETIME DEFAULT CURRENT_TIMESTAMP,
|
|
UNIQUE(realm, repo, file_path)
|
|
);
|
|
|
|
-- Symbol-level index
|
|
CREATE TABLE symbol_index (
|
|
id INTEGER PRIMARY KEY,
|
|
file_id INTEGER REFERENCES file_index(id),
|
|
name TEXT NOT NULL,
|
|
kind TEXT NOT NULL, -- struct, fn, enum, class, etc.
|
|
start_line INTEGER,
|
|
end_line INTEGER,
|
|
description TEXT
|
|
);
|
|
|
|
-- Relationships between files
|
|
CREATE TABLE file_relationships (
|
|
id INTEGER PRIMARY KEY,
|
|
source_file_id INTEGER REFERENCES file_index(id),
|
|
target_file_id INTEGER REFERENCES file_index(id),
|
|
kind TEXT NOT NULL, -- uses, used_by, imports, exports, tests
|
|
description TEXT
|
|
);
|
|
|
|
-- FTS5 virtual table for search
|
|
CREATE VIRTUAL TABLE file_search USING fts5(
|
|
file_path,
|
|
summary,
|
|
symbol_names,
|
|
symbol_descriptions,
|
|
content=file_index
|
|
);
|
|
```
|
|
|
|
## Proposed MCP Tools
|
|
|
|
| Tool | Purpose |
|
|
|------|---------|
|
|
| `blue_index_realm` | Index all files in a realm |
|
|
| `blue_index_file` | Index a single file (for incremental updates) |
|
|
| `blue_index_search` | Semantic search across the index |
|
|
| `blue_index_impact` | Given a file, show what depends on it |
|
|
| `blue_index_status` | Show indexing status and staleness |
|
|
|
|
## Open Questions
|
|
|
|
1. **Which AI model for indexing?** Local (Ollama) for cost, or API for quality?
|
|
2. **How to handle large files?** Chunk by function/class? Summary only?
|
|
3. **Cross-realm relationships?** Index within realm first, cross-realm later?
|
|
4. **Embedding model?** If we go vector route, which embedding model?
|
|
|
|
## Next Steps
|
|
|
|
If this spike looks good:
|
|
1. Create RFC for the full design
|
|
2. Start with SQLite schema + FTS5
|
|
3. Add `blue_index_file` tool that takes AI-generated index data
|
|
4. Add daemon file watcher for auto-indexing
|
|
|
|
---
|
|
|
|
*Investigation notes by Blue*
|