blue/.blue/docs/rfcs/0005-local-llm-integration.md

# RFC 0005: Local Llm Integration

| | |
|---|---|
| **Status** | Implemented |
| **Date** | 2026-01-24 |
| **Source Spike** | local-llm-integration, agentic-cli-integration |

---

## Summary

Blue needs local LLM capabilities for:
1. **Semantic tasks** - ADR relevance, runbook matching, dialogue summarization (lightweight, fast)
2. **Agentic coding** - Full code generation via Goose integration (heavyweight, powerful)

Unified approach: **Ollama as shared backend** + **Goose for agentic tasks** + **Blue's LlmProvider for semantic tasks**.

Must support CUDA > MPS > CPU backend priority.

## Background

### Two Use Cases

| Use Case | Latency | Complexity | Tool |
|----------|---------|------------|------|
| **Semantic tasks** | <500ms | Short prompts, structured output | Blue internal |
| **Agentic coding** | Minutes | Multi-turn, code generation | Goose |

### Blue's Semantic Tasks

| Feature | RFC | Need |
|---------|-----|------|
| ADR Relevance | 0004 | Match context to philosophical ADRs |
| Runbook Lookup | 0002 | Semantic action matching |
| Dialogue Summary | 0001 | Extract key decisions |

### Why Local LLM?

- **Privacy**: No data leaves the machine
- **Cost**: Zero per-query cost after model download
- **Speed**: Sub-second latency for short tasks
- **Offline**: Works without internet

### Why Embed Ollama?

| Approach | Pros | Cons |
|----------|------|------|
| llama-cpp-rs | Rust-native | Build complexity, no model management |
| Ollama (external) | Easy setup | User must install separately |
| **Ollama (embedded)** | Single install, full features | Larger binary, Go dependency |

**Embedded Ollama wins because:**
1. **Single install** - `cargo install blue` gives you everything
2. **Model management built-in** - pull, list, remove models
3. **Goose compatibility** - Goose connects to Blue's embedded Ollama
4. **Battle-tested** - Ollama handles CUDA/MPS/CPU, quantization, context
5. **One model, all uses** - Semantic tasks + agentic coding share model

### Ollama Version

Blue embeds a specific, tested Ollama version:

| Blue Version | Ollama Version | Release Date |
|--------------|----------------|--------------|
| 0.1.x | 0.5.4 | 2026-01 |

Version pinned in `build.rs`. Updated via Blue releases, not automatically.

## Proposal

### 1. LlmProvider Trait

```rust
#[async_trait]
pub trait LlmProvider: Send + Sync {
    async fn complete(&self, prompt: &str, options: &CompletionOptions) -> Result<String>;
    fn name(&self) -> &str;
}

pub struct CompletionOptions {
    pub max_tokens: usize,
    pub temperature: f32,
    pub stop_sequences: Vec<String>,
}
```

### 2. Implementations

```rust
pub enum LlmBackend {
    Ollama(OllamaLlm),   // Embedded Ollama server
    Api(ApiLlm),          // External API fallback
    Mock(MockLlm),        // Testing
}
```

**OllamaLlm**: Embedded Ollama server managed by Blue
**ApiLlm**: Uses Anthropic/OpenAI APIs (fallback)
**MockLlm**: Returns predefined responses (testing)

### 2.1 Embedded Ollama Architecture

```
┌─────────────────────────────────────────────────────────┐
│  Blue CLI                                                │
├─────────────────────────────────────────────────────────┤
│  blue-ollama (embedded)                                  │
│  ├── Ollama server (Go, compiled to lib)                │
│  ├── Model management (pull, list, remove)              │
│  └── HTTP API on localhost:11434                        │
├─────────────────────────────────────────────────────────┤
│  Consumers:                                              │
│  ├── Blue semantic tasks (ADR relevance, etc.)          │
│  ├── Goose (connects to localhost:11434)                │
│  └── Any Ollama-compatible client                       │
└─────────────────────────────────────────────────────────┘
```

**Embedding Strategy:**

```rust
// blue-ollama crate
pub struct EmbeddedOllama {
    process: Option<Child>,
    port: u16,
    models_dir: PathBuf,
}

impl EmbeddedOllama {
    /// Start embedded Ollama server
    pub async fn start(&mut self) -> Result<()> {
        // Ollama binary bundled in Blue release
        let ollama_bin = Self::bundled_binary_path();

        self.process = Some(
            Command::new(ollama_bin)
                .env("OLLAMA_MODELS", &self.models_dir)
                .env("OLLAMA_HOST", format!("127.0.0.1:{}", self.port))
                .spawn()?
        );

        self.wait_for_ready().await
    }

    /// Stop embedded server
    pub async fn stop(&mut self) -> Result<()> {
        if let Some(mut proc) = self.process.take() {
            proc.kill()?;
        }
        Ok(())
    }
}
```

### 3. Backend Priority (CUDA > MPS > CPU)

**Ollama handles this automatically.** Ollama detects GPU at runtime:

| Platform | Backend | Detection |
|----------|---------|-----------|
| NVIDIA GPU | CUDA | Auto-detected via driver |
| Apple Silicon | **Metal (MPS)** | Auto-detected on M1/M2/M3/M4 |
| AMD GPU | ROCm | Auto-detected on Linux |
| No GPU | CPU | Fallback |

```bash
# Ollama auto-detects best backend
ollama run qwen2.5:7b  # Uses CUDA → Metal → ROCm → CPU
```

**Apple Silicon (M1/M2/M3/M4):**
- Ollama uses Metal Performance Shaders (MPS) automatically
- No configuration needed - just works
- Full GPU acceleration on unified memory

**Blue just starts Ollama and lets it choose:**

```rust
impl EmbeddedOllama {
    pub async fn start(&mut self) -> Result<()> {
        let mut cmd = Command::new(Self::bundled_binary_path());

        // Force specific backend if configured
        match self.config.backend {
            BackendChoice::Cuda => {
                cmd.env("CUDA_VISIBLE_DEVICES", "0");
                cmd.env("OLLAMA_NO_METAL", "1");  // Prefer CUDA over Metal
            }
            BackendChoice::Mps => {
                // Metal/MPS on Apple Silicon (default on macOS)
                cmd.env("CUDA_VISIBLE_DEVICES", "");  // Disable CUDA
            }
            BackendChoice::Cpu => {
                cmd.env("CUDA_VISIBLE_DEVICES", "");  // Disable CUDA
                cmd.env("OLLAMA_NO_METAL", "1");      // Disable Metal/MPS
            }
            BackendChoice::Auto => {
                // Let Ollama decide: CUDA → MPS → ROCm → CPU
            }
        }

        self.process = Some(cmd.spawn()?);
        self.wait_for_ready().await
    }
}
```

**Backend verification:**

```rust
impl EmbeddedOllama {
    pub async fn detected_backend(&self) -> Result<String> {
        // Query Ollama for what it's using
        let resp = self.client.get("/api/version").await?;
        // Returns: {"version": "0.5.1", "gpu": "cuda"} or "metal" or "cpu"
        Ok(resp.gpu)
    }
}
```

### 4. Configuration

**Default: API (easier setup)**

New users get API by default - just set an env var:

```bash
export ANTHROPIC_API_KEY=sk-...
# That's it. Blue works.
```

**Opt-in: Local (better privacy/cost)**

```bash
blue_model_download name="qwen2.5-7b"
# Edit .blue/config.yaml to prefer local
```

**Full Configuration:**

```yaml
# .blue/config.yaml
llm:
  provider: auto  # auto | local | api | none

  # auto (default): Use local if model exists, else API, else keywords
  # local: Only use local, fail if unavailable
  # api: Only use API, fail if unavailable
  # none: Disable AI features entirely

  local:
    model: qwen2.5-7b  # Shorthand, resolves to full path
    # Or explicit: model_path: ~/.blue/models/qwen2.5-7b-instruct-q4_k_m.gguf
    backend: auto  # cuda | mps | cpu | auto
    context_size: 8192
    threads: 8  # for CPU backend

  api:
    provider: anthropic  # anthropic | openai
    model: claude-3-haiku-20240307
    api_key_env: ANTHROPIC_API_KEY  # Read from env var
```

**Zero-Config Experience:**

| User State | Behavior |
|------------|----------|
| No config, no env var | Keywords only (works offline) |
| `ANTHROPIC_API_KEY` set | API (easiest) |
| Model downloaded | Local (best) |
| Both available | Local preferred |

### 5. Model Management (via Embedded Ollama)

Blue wraps Ollama's model commands:

```
blue_model_list          # ollama list
blue_model_pull          # ollama pull
blue_model_remove        # ollama rm
blue_model_info          # ollama show
```

Model storage: `~/.ollama/models/` (Ollama default, shared with external Ollama)

**Recommended Models:**

| Model | Size | Use Case |
|-------|------|----------|
| `qwen2.5:7b` | ~4.4GB | Fast, good quality |
| `qwen2.5:32b` | ~19GB | Best quality |
| `qwen2.5-coder:7b` | ~4.4GB | Code-focused |
| `qwen2.5-coder:32b` | ~19GB | Best for agentic coding |

**Pull Example:**

```
blue_model_pull name="qwen2.5:7b"

→ Pulling qwen2.5:7b...
→ [████████████████████] 100% (4.4 GB)
→ Model ready. Run: blue_model_info name="qwen2.5:7b"
```

**Licensing:** Qwen2.5 models are Apache 2.0 - commercial use permitted.

### 5.1 Goose Integration

Blue bundles Goose binary and auto-configures it for local Ollama:

```
┌─────────────────────────────────────────────────────────┐
│  User runs: blue agent                                   │
│       ↓                                                  │
│  Blue detects Ollama on localhost:11434                 │
│       ↓                                                  │
│  Picks largest available model (e.g., qwen2.5:72b)      │
│       ↓                                                  │
│  Launches bundled Goose with Blue MCP extension         │
└─────────────────────────────────────────────────────────┘
```

**Zero Setup:**

```bash
# Just run it - Blue handles everything
blue agent

# What happens:
# 1. Uses bundled Goose binary (downloaded at build time)
# 2. Detects Ollama running on localhost:11434
# 3. Selects largest model (best for agentic work)
# 4. Sets GOOSE_PROVIDER=ollama, OLLAMA_HOST=...
# 5. Connects Blue MCP extension for workflow tools
```

**Manual Model Override:**

```bash
# Use a specific provider/model
blue agent --model ollama/qwen2.5:7b
blue agent --model anthropic/claude-sonnet-4-20250514

# Pass additional Goose arguments
blue agent -- --resume --name my-session
```

**Goose Binary Bundling:**

Blue's `build.rs` downloads the Goose binary for the target platform:

| Platform | Binary |
|----------|--------|
| macOS ARM64 | goose-aarch64-apple-darwin |
| macOS x86_64 | goose-x86_64-apple-darwin |
| Linux x86_64 | goose-x86_64-unknown-linux-gnu |
| Linux ARM64 | goose-aarch64-unknown-linux-gnu |
| Windows | goose-x86_64-pc-windows-gnu |

**Build-time Download:**

```rust
// apps/blue-cli/build.rs
const GOOSE_VERSION: &str = "1.21.1";

// Downloads goose-{arch}-{os}.tar.bz2 from GitHub releases
// Extracts to OUT_DIR, sets GOOSE_BINARY_PATH env var
```

**Runtime Discovery:**

1. Check for bundled binary next to `blue` executable
2. Check compile-time `GOOSE_BINARY_PATH`
3. Fall back to system PATH (validates it's Block's Goose, not the DB migration tool)

**Shared Model Benefits:**

| Without Blue | With Blue |
|--------------|-----------|
| Install Goose separately | Blue bundles Goose |
| Install Ollama separately | Blue bundles Ollama |
| Configure Goose manually | `blue agent` auto-configures |
| Model loaded twice | One model instance |
| 40GB RAM for two 32B models | 20GB for shared model |

### 6. Graceful Degradation

```rust
impl BlueState {
    pub async fn get_llm(&self) -> Option<&dyn LlmProvider> {
        // Try local first
        if let Some(local) = &self.local_llm {
            if local.is_ready() {
                return Some(local);
            }
        }

        // Fall back to API
        if let Some(api) = &self.api_llm {
            return Some(api);
        }

        // No LLM available
        None
    }
}
```

| Condition | Behavior |
|-----------|----------|
| Local model loaded | Use local (default) |
| Local unavailable, API configured | Fall back to API + warning |
| Neither available | Keyword matching only |
| `--no-ai` flag | Skip AI entirely |

### 7. Model Loading Strategy

**Problem:** Model load takes 5-10 seconds. Can't block MCP calls.

**Solution:** Daemon preloads model on startup.

```rust
impl EmbeddedOllama {
    pub async fn warmup(&self, model: &str) -> Result<()> {
        // Send a dummy request to load model into memory
        let resp = self.client
            .post("/api/generate")
            .json(&json!({
                "model": model,
                "prompt": "Hi",
                "options": { "num_predict": 1 }
            }))
            .send()
            .await?;

        // Model now loaded and warm
        Ok(())
    }
}
```

**Daemon Startup:**

```bash
blue daemon start

→ Starting embedded Ollama...
→ Ollama ready on localhost:11434
→ Warming up qwen2.5:7b... (5-10 seconds)
→ Model ready.
```

**MCP Tool Response During Load:**

```json
{
  "status": "loading",
  "message": "Model loading... Try again in a few seconds.",
  "retry_after_ms": 2000
}
```

**Auto-Warmup:** Daemon warms up configured model on start. First MCP request is fast.

**Manual Warmup:**

```
blue_model_warmup model="qwen2.5:32b"  # Load specific model
```

### 8. Multi-Session Model Handling

**Question:** What if user has multiple Blue MCP sessions (multiple IDE windows)?

**Answer:** All sessions share one Ollama instance via `blue daemon`.

```
┌─────────────────────────────────────────────────────────┐
│  blue daemon (singleton)                                 │
│  └── Embedded Ollama (localhost:11434)                  │
│      └── Model loaded once (~20GB for 32B)              │
├─────────────────────────────────────────────────────────┤
│  Blue MCP Session 1  ──┐                                │
│  Blue MCP Session 2  ──┼──→ HTTP to localhost:11434    │
│  Goose              ──┘                                │
└─────────────────────────────────────────────────────────┘
```

**Benefits:**
- One model in memory, not per-session
- Goose shares same model instance
- Daemon manages Ollama lifecycle
- Sessions can come and go

**Daemon Lifecycle:**

```bash
blue daemon start   # Start Ollama, keep running
blue daemon stop    # Stop Ollama
blue daemon status  # Check health and GPU info

# Auto-start: first MCP connection starts daemon if not running
```

**Status Output:**

```
$ blue daemon status

Blue Daemon: running
├── Ollama: healthy (v0.5.4)
├── Backend: Metal (MPS) - Apple M4 Max
├── Port: 11434
├── Models loaded: qwen2.5:32b (19GB)
├── Uptime: 2h 34m
└── Requests served: 1,247
```

### Daemon Health & Recovery

**Health checks:**

```rust
impl EmbeddedOllama {
    pub async fn health_check(&self) -> Result<HealthStatus> {
        match self.client.get("/api/version").await {
            Ok(resp) => Ok(HealthStatus::Healthy {
                version: resp.version,
                gpu: resp.gpu,
            }),
            Err(e) => Ok(HealthStatus::Unhealthy { error: e.to_string() }),
        }
    }

    pub fn start_health_monitor(&self) {
        tokio::spawn(async move {
            loop {
                tokio::time::sleep(Duration::from_secs(30)).await;

                if let Ok(HealthStatus::Unhealthy { .. }) = self.health_check().await {
                    log::warn!("Ollama unhealthy, attempting restart...");
                    self.restart().await;
                }
            }
        });
    }
}
```

**Crash recovery:**

| Scenario | Behavior |
|----------|----------|
| Ollama crashes | Auto-restart within 5 seconds |
| Restart fails 3x | Mark as failed, fall back to API |
| User calls `daemon restart` | Force restart, reset failure count |

**Graceful shutdown:**

```rust
impl EmbeddedOllama {
    pub async fn stop(&mut self) -> Result<()> {
        // Signal Ollama to finish current requests
        self.client.post("/api/shutdown").await.ok();

        // Wait up to 10 seconds for graceful shutdown
        tokio::time::timeout(
            Duration::from_secs(10),
            self.wait_for_exit()
        ).await.ok();

        // Force kill if still running
        if let Some(proc) = self.process.take() {
            proc.kill().ok();
        }

        Ok(())
    }
}
```

### 8. Integration Points

**ADR Relevance (RFC 0004):**
```rust
pub async fn find_relevant_adrs(
    llm: &dyn LlmProvider,
    context: &str,
    adrs: &[AdrSummary],
) -> Result<Vec<RelevanceResult>> {
    let prompt = format_relevance_prompt(context, adrs);
    let response = llm.complete(&prompt, &RELEVANCE_OPTIONS).await?;
    parse_relevance_response(&response)
}
```

**Runbook Matching (RFC 0002):**
```rust
pub async fn match_action_semantic(
    llm: &dyn LlmProvider,
    query: &str,
    actions: &[String],
) -> Result<Option<String>> {
    // Use LLM to find best semantic match
}
```

### 9. Cargo Features & Build

```toml
[features]
default = ["ollama"]
ollama = []  # Embeds Ollama binary

[dependencies]
reqwest = { version = "0.12", features = ["json"] }  # Ollama HTTP client
tokio = { version = "1", features = ["process"] }     # Process management

[build-dependencies]
# Download Ollama binary at build time
```

**Build Process:**

```rust
// build.rs
const OLLAMA_VERSION: &str = "0.5.4";

fn main() {
    let target = std::env::var("TARGET").unwrap();

    let (ollama_url, sha256) = match target.as_str() {
        // macOS (Universal - works on Intel and Apple Silicon)
        t if t.contains("darwin") =>
            (format!("https://github.com/ollama/ollama/releases/download/v{}/ollama-darwin", OLLAMA_VERSION),
             "abc123..."),

        // Linux x86_64
        t if t.contains("x86_64") && t.contains("linux") =>
            (format!("https://github.com/ollama/ollama/releases/download/v{}/ollama-linux-amd64", OLLAMA_VERSION),
             "def456..."),

        // Linux ARM64 (Raspberry Pi 4/5, AWS Graviton, etc.)
        t if t.contains("aarch64") && t.contains("linux") =>
            (format!("https://github.com/ollama/ollama/releases/download/v{}/ollama-linux-arm64", OLLAMA_VERSION),
             "ghi789..."),

        // Windows x86_64
        t if t.contains("windows") =>
            (format!("https://github.com/ollama/ollama/releases/download/v{}/ollama-windows-amd64.exe", OLLAMA_VERSION),
             "jkl012..."),

        _ => panic!("Unsupported target: {}", target),
    };

    download_and_verify(&ollama_url, sha256);
    println!("cargo:rerun-if-changed=build.rs");
}
```

**Supported Platforms:**

| Platform | Architecture | Ollama Binary |
|----------|--------------|---------------|
| macOS | x86_64 + ARM64 | ollama-darwin (universal) |
| Linux | x86_64 | ollama-linux-amd64 |
| Linux | ARM64 | ollama-linux-arm64 |
| Windows | x86_64 | ollama-windows-amd64.exe |

**ARM64 Linux Use Cases:**
- Raspberry Pi 4/5 (8GB+ recommended)
- AWS Graviton instances
- NVIDIA Jetson
- Apple Silicon Linux VMs

**Binary Size:**

| Component | Size |
|-----------|------|
| Blue CLI | ~5 MB |
| Ollama binary | ~50 MB |
| **Total** | ~55 MB |

Models downloaded separately on first use.

### 10. Performance Expectations

**Apple Silicon (M4 Max, 128GB, Metal/MPS):**

| Metric | Qwen2.5-7B | Qwen2.5-32B |
|--------|------------|-------------|
| Model load | 2-3 sec | 5-10 sec |
| Prompt processing | ~150 tok/s | ~100 tok/s |
| Generation | ~80 tok/s | ~50 tok/s |
| ADR relevance | 100-200ms | 200-400ms |

**NVIDIA GPU (RTX 4090, CUDA):**

| Metric | Qwen2.5-7B | Qwen2.5-32B |
|--------|------------|-------------|
| Model load | 1-2 sec | 3-5 sec |
| Prompt processing | ~200 tok/s | ~120 tok/s |
| Generation | ~100 tok/s | ~60 tok/s |
| ADR relevance | 80-150ms | 150-300ms |

**CPU Only (fallback):**

| Metric | Qwen2.5-7B | Qwen2.5-32B |
|--------|------------|-------------|
| Generation | ~10 tok/s | ~3 tok/s |
| ADR relevance | 1-2 sec | 5-10 sec |

Metal/MPS on Apple Silicon is first-class - not a fallback.

### 11. Memory Validation

Ollama handles memory management, but Blue validates before pull:

```rust
impl EmbeddedOllama {
    pub async fn validate_can_pull(&self, model: &str) -> Result<()> {
        let model_size = self.get_model_size(model).await?;
        let available = sys_info::mem_info()?.avail * 1024;
        let buffer = model_size / 5;  // 20% buffer

        if available < model_size + buffer {
            return Err(LlmError::InsufficientMemory {
                model: model.to_string(),
                required: model_size + buffer,
                available,
                suggestion: format!(
                    "Close some applications or use a smaller model. \
                     Try: blue_model_pull name=\"qwen2.5:7b\""
                ),
            });
        }
        Ok(())
    }
}
```

**Ollama's Own Handling:**

Ollama gracefully handles memory pressure by unloading models. Blue's validation is advisory.

### 12. Build Requirements

**Blue Build (all platforms):**
```bash
# Just Rust toolchain
cargo build --release
```

Blue's build.rs downloads the pre-built Ollama binary for the target platform. No C++ compiler needed.

**Runtime GPU Support:**

Ollama bundles GPU support. User just needs drivers:

**macOS (Metal):**
- Works out of box on Apple Silicon (M1/M2/M3/M4)
- No additional setup needed

**Linux (CUDA):**
```bash
# NVIDIA drivers (CUDA Toolkit not needed for inference)
nvidia-smi  # Verify driver installed
```

**Linux (ROCm):**
```bash
# AMD GPU support
rocminfo  # Verify ROCm installed
```

**Windows:**
- NVIDIA: Just need GPU drivers
- Works on CPU if no GPU

**Ollama handles everything else** - users don't need to install CUDA Toolkit, cuDNN, etc.

## Security Considerations

1. **Ollama binary integrity**: Verify SHA256 of bundled Ollama binary at build time
2. **Model provenance**: Ollama registry handles model verification
3. **Local only by default**: Ollama binds to localhost:11434, not exposed
4. **Prompt injection**: Sanitize user input before prompts
5. **Memory**: Ollama handles memory management
6. **No secrets in prompts**: ADR relevance only sends context strings
7. **Process isolation**: Ollama runs as subprocess, not linked

**Network Binding:**

```rust
impl EmbeddedOllama {
    pub async fn start(&mut self) -> Result<()> {
        let mut cmd = Command::new(Self::bundled_binary_path());

        // Bind to localhost only - not accessible from network
        cmd.env("OLLAMA_HOST", "127.0.0.1:11434");

        // ...
    }
}
```

**Goose Access:**

Goose connects to `localhost:11434` - works because it's on the same machine. Remote access requires explicit `OLLAMA_HOST=0.0.0.0:11434` override.

### Port Conflict Handling

**Scenario:** User already has Ollama running on port 11434.

```rust
impl EmbeddedOllama {
    pub async fn start(&mut self) -> Result<()> {
        // Check if port 11434 is in use
        if Self::port_in_use(11434) {
            // Check if it's Ollama
            if Self::is_ollama_running().await? {
                // Use existing Ollama instance
                self.mode = OllamaMode::External;
                return Ok(());
            } else {
                // Something else on port - use alternate
                self.port = Self::find_free_port(11435..11500)?;
            }
        }

        // Start embedded Ollama on chosen port
        self.start_embedded().await
    }
}
```

| Situation | Behavior |
|-----------|----------|
| Port 11434 free | Start embedded Ollama |
| Ollama already running | Use existing (no duplicate) |
| Other service on port | Use alternate port (11435+) |

**Config override:**

```yaml
# .blue/config.yaml
llm:
  local:
    ollama_port: 11500  # Force specific port
    use_external: true   # Never start embedded, use existing
```

### Binary Verification

**Build-time verification:**

```rust
// build.rs
const OLLAMA_SHA256: &str = "abc123...";  // Per-platform hashes

fn download_ollama() {
    let bytes = download(OLLAMA_URL)?;
    let hash = sha256(&bytes);

    if hash != OLLAMA_SHA256 {
        panic!("Ollama binary hash mismatch! Expected {}, got {}", OLLAMA_SHA256, hash);
    }

    write_binary(bytes)?;
}
```

**Runtime verification:**

```rust
impl EmbeddedOllama {
    fn verify_binary(&self) -> Result<()> {
        let expected = include_str!("ollama.sha256");
        let actual = sha256_file(Self::bundled_binary_path())?;

        if actual != expected {
            return Err(LlmError::BinaryTampered {
                expected: expected.to_string(),
                actual,
            });
        }
        Ok(())
    }

    pub async fn start(&mut self) -> Result<()> {
        self.verify_binary()?;  // Check before every start
        // ...
    }
}
```

### Air-Gapped Builds

For environments without internet during build:

```bash
# 1. Download Ollama binary manually
curl -L https://github.com/ollama/ollama/releases/download/v0.5.4/ollama-darwin \
  -o vendor/ollama-darwin

# 2. Build with BLUE_OLLAMA_PATH
BLUE_OLLAMA_PATH=vendor/ollama-darwin cargo build --release
```

```rust
// build.rs
fn get_ollama_binary() -> Vec<u8> {
    if let Ok(path) = std::env::var("BLUE_OLLAMA_PATH") {
        // Use pre-downloaded binary
        std::fs::read(path).expect("Failed to read BLUE_OLLAMA_PATH")
    } else {
        // Download from GitHub
        download_ollama()
    }
}
```

## Implementation Phases

**Phase 1: Embedded Ollama**
1. Add build.rs to download Ollama binary per platform
2. Create `blue-ollama` crate for embedded server management
3. Implement `EmbeddedOllama::start()` and `stop()`
4. Add `blue daemon start/stop` commands

**Phase 2: LLM Provider**
5. Add `LlmProvider` trait to blue-core
6. Implement `OllamaLlm` using HTTP client
7. Add `blue_model_pull`, `blue_model_list` tools
8. Implement auto-pull on first use

**Phase 3: Semantic Integration**
9. Integrate with ADR relevance (RFC 0004)
10. Add semantic runbook matching (RFC 0002)
11. Add fallback chain: Ollama → API → keywords

**Phase 4: Goose Integration**
12. Add `blue agent` command to launch Goose
13. Document Goose + Blue setup
14. Ship example configs

## CI/CD Matrix

Test embedded Ollama on all platforms:

```yaml
# .github/workflows/ci.yml
jobs:
  test-ollama:
    strategy:
      matrix:
        include:
          - os: macos-latest
            ollama_binary: ollama-darwin
          - os: ubuntu-latest
            ollama_binary: ollama-linux-amd64
          - os: windows-latest
            ollama_binary: ollama-windows-amd64.exe

    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4

      - name: Build Blue (downloads Ollama binary)
        run: cargo build --release

      - name: Verify Ollama binary embedded
        run: |
          # Check binary exists in expected location
          ls -la target/release/ollama*

      - name: Test daemon start/stop
        run: |
          cargo run -- daemon start
          sleep 5
          curl -s http://localhost:11434/api/version
          cargo run -- daemon stop

      - name: Test with mock model (no download)
        run: cargo test ollama::mock

  # GPU tests run on self-hosted runners
  test-gpu:
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4
      - name: Test CUDA detection
        run: |
          cargo build --release
          cargo run -- daemon start
          # Verify GPU detected
          curl -s http://localhost:11434/api/version | jq .gpu
          cargo run -- daemon stop
```

**Note:** Full model integration tests run nightly (large downloads).

## Test Plan

**Embedded Ollama:**
- [ ] `blue daemon start` launches embedded Ollama
- [ ] `blue daemon stop` cleanly shuts down
- [ ] Ollama detects CUDA when available
- [ ] Ollama detects Metal on macOS
- [ ] Falls back to CPU when no GPU
- [ ] Health check returns backend type

**Model Management:**
- [ ] `blue_model_pull` downloads from Ollama registry
- [ ] `blue_model_list` shows pulled models
- [ ] `blue_model_remove` deletes model
- [ ] Auto-pull on first completion if model missing
- [ ] Progress indicator during pull

**LLM Provider:**
- [ ] `OllamaLlm::complete()` returns valid response
- [ ] Fallback chain: Ollama → API → keywords
- [ ] `--no-ai` flag skips LLM entirely
- [ ] Configuration parsing from .blue/config.yaml

**Semantic Integration:**
- [ ] ADR relevance uses embedded Ollama
- [ ] Runbook matching uses semantic search
- [ ] Response includes method used (ollama/api/keywords)

**Goose Integration:**
- [ ] `blue agent` starts Goose with Blue extension
- [ ] Goose connects to Blue's embedded Ollama
- [ ] Goose can use Blue MCP tools
- [ ] Model shared between Blue tasks and Goose

**Multi-Session:**
- [ ] Multiple Blue MCP sessions share one Ollama
- [ ] Concurrent completions handled correctly
- [ ] Daemon persists across shell sessions

**Port Conflict:**
- [ ] Detects existing Ollama on port 11434
- [ ] Uses existing Ollama instead of starting new
- [ ] Uses alternate port if non-Ollama on 11434
- [ ] `use_external: true` config works

**Health & Recovery:**
- [ ] Health check detects unhealthy Ollama
- [ ] Auto-restart on crash
- [ ] Falls back to API after 3 restart failures
- [ ] Graceful shutdown waits for requests

**Binary Verification:**
- [ ] Build fails if Ollama hash mismatch
- [ ] Runtime verification before start
- [ ] Tampered binary: clear error message
- [ ] Air-gapped build with BLUE_OLLAMA_PATH works

**CI Matrix:**
- [ ] macOS build includes darwin Ollama binary
- [ ] Linux x86_64 build includes amd64 binary
- [ ] Linux ARM64 build includes arm64 binary
- [ ] Windows build includes windows binary
- [ ] Integration tests with mock Ollama server

---

*"Right then. Let's get to it."*

— Blue