blue/.blue/docs/spikes/2026-01-24T0800Z-realm-semantic-index.wip.md
Eric Garcia 0fea499957 feat: lifecycle suffixes for all document states + resolve all clippy warnings
Every document filename now mirrors its lifecycle state with a status
suffix (e.g., .draft.md, .wip.md, .accepted.md). No more bare .md for
tracked document types. Also renamed all from_str methods to parse to
avoid FromStr trait confusion, introduced StagingDeploymentParams struct,
and fixed all 19 clippy warnings across the codebase.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 12:19:46 -05:00

5.4 KiB

Spike: Realm Semantic Index

Status In Progress
Date 2026-01-24
Time Box 4 hours

Question

How can we create an AI-maintained semantic index of files within a realm, tracking what each file does (with line references), its relationships to other files, and enabling semantic search for change impact analysis?


Context

Realms coordinate across repos. Domains define relationships (provider/consumer, exports/imports). But when a file changes, there's no quick way to know:

  • What does this file actually do?
  • What other files depend on it?
  • What's the blast radius of a change?

We want an AI-maintained index that answers these questions via semantic search.

Design Space

What Gets Indexed

For each file in a realm:

file: src/realm/domain.rs
last_indexed: 2026-01-24T10:30:00Z
hash: abc123  # for change detection

summary: "Domain definitions for cross-repo coordination. Defines Domain, Binding, ExportBinding, ImportBinding types."

symbols:
  - name: Domain
    kind: struct
    lines: [13, 73]
    description: "A coordination context between repos with name, description, creation time, and member list"

  - name: Binding
    kind: struct
    lines: [76, 143]
    description: "Declares what a repo exports or imports in a domain"

  - name: ImportStatus
    kind: enum
    lines: [259, 274]
    description: "Status of an import binding: Pending, Current, Outdated, Broken"

relationships:
  - target: src/realm/service.rs
    kind: used_by
    description: "RealmService uses Domain and Binding to manage cross-repo state"

  - target: src/realm/repo.rs
    kind: used_by
    description: "Repo operations load/save Domain and Binding files"

Storage Options

Option Pros Cons
SQLite + FTS5 Already have blue.db, full-text search built-in No semantic/vector search
SQLite + sqlite-vec Vector similarity search, keeps single DB Requires extension, Rust bindings unclear
Separate JSON files Human-readable, git-tracked Slow to search at scale
Embedded vector DB (lancedb) Purpose-built for semantic search Another dependency

Recommendation: Start with SQLite + FTS5 for keyword search. Add embeddings later if needed.

Index Update Triggers

  1. On-demand - blue index command regenerates
  2. Git hook - Post-commit hook calls blue index --changed
  3. File watcher - Daemon watches for changes (already have daemon infrastructure)
  4. MCP tool - blue_index_file for AI agents to update during work

Likely want combination: daemon watches + on-demand refresh.

Semantic Search Approaches

Phase 1: Keyword + Structure

  • FTS5 for text search across summaries and descriptions
  • Filter by file path, symbol kind, relationship type
  • Good enough for "find files related to authentication"

Phase 2: Embeddings

  • Generate embeddings for each symbol description
  • Store in sqlite-vec or similar
  • Query: "what handles S3 bucket permissions" → vector similarity

Relationship Detection

AI needs to identify relationships. Approaches:

  1. Static analysis - Parse imports/uses (language-specific, complex)
  2. AI inference - "Given file A and file B, describe their relationship"
  3. Explicit declarations - Like current ExportBinding/ImportBinding
  4. Hybrid - AI suggests, human confirms

Recommendation: AI inference with caching. When indexing file A, ask AI to describe relationships to files it references.

Proposed Schema

-- File-level index
CREATE TABLE file_index (
    id INTEGER PRIMARY KEY,
    realm TEXT NOT NULL,
    repo TEXT NOT NULL,
    file_path TEXT NOT NULL,
    file_hash TEXT NOT NULL,
    summary TEXT,
    indexed_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(realm, repo, file_path)
);

-- Symbol-level index
CREATE TABLE symbol_index (
    id INTEGER PRIMARY KEY,
    file_id INTEGER REFERENCES file_index(id),
    name TEXT NOT NULL,
    kind TEXT NOT NULL,  -- struct, fn, enum, class, etc.
    start_line INTEGER,
    end_line INTEGER,
    description TEXT
);

-- Relationships between files
CREATE TABLE file_relationships (
    id INTEGER PRIMARY KEY,
    source_file_id INTEGER REFERENCES file_index(id),
    target_file_id INTEGER REFERENCES file_index(id),
    kind TEXT NOT NULL,  -- uses, used_by, imports, exports, tests
    description TEXT
);

-- FTS5 virtual table for search
CREATE VIRTUAL TABLE file_search USING fts5(
    file_path,
    summary,
    symbol_names,
    symbol_descriptions,
    content=file_index
);

Proposed MCP Tools

Tool Purpose
blue_index_realm Index all files in a realm
blue_index_file Index a single file (for incremental updates)
blue_index_search Semantic search across the index
blue_index_impact Given a file, show what depends on it
blue_index_status Show indexing status and staleness

Open Questions

  1. Which AI model for indexing? Local (Ollama) for cost, or API for quality?
  2. How to handle large files? Chunk by function/class? Summary only?
  3. Cross-realm relationships? Index within realm first, cross-realm later?
  4. Embedding model? If we go vector route, which embedding model?

Next Steps

If this spike looks good:

  1. Create RFC for the full design
  2. Start with SQLite schema + FTS5
  3. Add blue_index_file tool that takes AI-generated index data
  4. Add daemon file watcher for auto-indexing

Investigation notes by Blue