Eric Garcia 0fea499957 feat: lifecycle suffixes for all document states + resolve all clippy warnings

Every document filename now mirrors its lifecycle state with a status
suffix (e.g., .draft.md, .wip.md, .accepted.md). No more bare .md for
tracked document types. Also renamed all from_str methods to parse to
avoid FromStr trait confusion, introduced StagingDeploymentParams struct,
and fixed all 19 clippy warnings across the codebase.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-26 12:19:46 -05:00

30 KiB

Raw Blame History

RFC 0005: Local Llm Integration


Status	Implemented
Date	2026-01-24
Source Spike	local-llm-integration, agentic-cli-integration

Summary

Blue needs local LLM capabilities for:

Semantic tasks - ADR relevance, runbook matching, dialogue summarization (lightweight, fast)
Agentic coding - Full code generation via Goose integration (heavyweight, powerful)

Unified approach: Ollama as shared backend + Goose for agentic tasks + Blue's LlmProvider for semantic tasks.

Must support CUDA > MPS > CPU backend priority.

Background

Two Use Cases

Use Case	Latency	Complexity	Tool
Semantic tasks	<500ms	Short prompts, structured output	Blue internal
Agentic coding	Minutes	Multi-turn, code generation	Goose

Blue's Semantic Tasks

Feature	RFC	Need
ADR Relevance	0004	Match context to philosophical ADRs
Runbook Lookup	0002	Semantic action matching
Dialogue Summary	0001	Extract key decisions

Why Local LLM?

Privacy: No data leaves the machine
Cost: Zero per-query cost after model download
Speed: Sub-second latency for short tasks
Offline: Works without internet

Why Embed Ollama?

Approach	Pros	Cons
llama-cpp-rs	Rust-native	Build complexity, no model management
Ollama (external)	Easy setup	User must install separately
Ollama (embedded)	Single install, full features	Larger binary, Go dependency

Embedded Ollama wins because:

Single install - cargo install blue gives you everything
Model management built-in - pull, list, remove models
Goose compatibility - Goose connects to Blue's embedded Ollama
Battle-tested - Ollama handles CUDA/MPS/CPU, quantization, context
One model, all uses - Semantic tasks + agentic coding share model

Ollama Version

Blue embeds a specific, tested Ollama version:

Blue Version	Ollama Version	Release Date
0.1.x	0.5.4	2026-01

Version pinned in build.rs. Updated via Blue releases, not automatically.

Proposal

1. LlmProvider Trait

#[async_trait]
pub trait LlmProvider: Send + Sync {
    async fn complete(&self, prompt: &str, options: &CompletionOptions) -> Result<String>;
    fn name(&self) -> &str;
}

pub struct CompletionOptions {
    pub max_tokens: usize,
    pub temperature: f32,
    pub stop_sequences: Vec<String>,
}

2. Implementations

pub enum LlmBackend {
    Ollama(OllamaLlm),   // Embedded Ollama server
    Api(ApiLlm),          // External API fallback
    Mock(MockLlm),        // Testing
}

OllamaLlm: Embedded Ollama server managed by Blue ApiLlm: Uses Anthropic/OpenAI APIs (fallback) MockLlm: Returns predefined responses (testing)

2.1 Embedded Ollama Architecture

┌─────────────────────────────────────────────────────────┐
│  Blue CLI                                                │
├─────────────────────────────────────────────────────────┤
│  blue-ollama (embedded)                                  │
│  ├── Ollama server (Go, compiled to lib)                │
│  ├── Model management (pull, list, remove)              │
│  └── HTTP API on localhost:11434                        │
├─────────────────────────────────────────────────────────┤
│  Consumers:                                              │
│  ├── Blue semantic tasks (ADR relevance, etc.)          │
│  ├── Goose (connects to localhost:11434)                │
│  └── Any Ollama-compatible client                       │
└─────────────────────────────────────────────────────────┘

Embedding Strategy:

// blue-ollama crate
pub struct EmbeddedOllama {
    process: Option<Child>,
    port: u16,
    models_dir: PathBuf,
}

impl EmbeddedOllama {
    /// Start embedded Ollama server
    pub async fn start(&mut self) -> Result<()> {
        // Ollama binary bundled in Blue release
        let ollama_bin = Self::bundled_binary_path();

        self.process = Some(
            Command::new(ollama_bin)
                .env("OLLAMA_MODELS", &self.models_dir)
                .env("OLLAMA_HOST", format!("127.0.0.1:{}", self.port))
                .spawn()?
        );

        self.wait_for_ready().await
    }

    /// Stop embedded server
    pub async fn stop(&mut self) -> Result<()> {
        if let Some(mut proc) = self.process.take() {
            proc.kill()?;
        }
        Ok(())
    }
}

3. Backend Priority (CUDA > MPS > CPU)

Ollama handles this automatically. Ollama detects GPU at runtime:

Platform	Backend	Detection
NVIDIA GPU	CUDA	Auto-detected via driver
Apple Silicon	Metal (MPS)	Auto-detected on M1/M2/M3/M4
AMD GPU	ROCm	Auto-detected on Linux
No GPU	CPU	Fallback

# Ollama auto-detects best backend
ollama run qwen2.5:7b  # Uses CUDA → Metal → ROCm → CPU

Apple Silicon (M1/M2/M3/M4):

Ollama uses Metal Performance Shaders (MPS) automatically
No configuration needed - just works
Full GPU acceleration on unified memory

Blue just starts Ollama and lets it choose:

impl EmbeddedOllama {
    pub async fn start(&mut self) -> Result<()> {
        let mut cmd = Command::new(Self::bundled_binary_path());

        // Force specific backend if configured
        match self.config.backend {
            BackendChoice::Cuda => {
                cmd.env("CUDA_VISIBLE_DEVICES", "0");
                cmd.env("OLLAMA_NO_METAL", "1");  // Prefer CUDA over Metal
            }
            BackendChoice::Mps => {
                // Metal/MPS on Apple Silicon (default on macOS)
                cmd.env("CUDA_VISIBLE_DEVICES", "");  // Disable CUDA
            }
            BackendChoice::Cpu => {
                cmd.env("CUDA_VISIBLE_DEVICES", "");  // Disable CUDA
                cmd.env("OLLAMA_NO_METAL", "1");      // Disable Metal/MPS
            }
            BackendChoice::Auto => {
                // Let Ollama decide: CUDA → MPS → ROCm → CPU
            }
        }

        self.process = Some(cmd.spawn()?);
        self.wait_for_ready().await
    }
}

Backend verification:

impl EmbeddedOllama {
    pub async fn detected_backend(&self) -> Result<String> {
        // Query Ollama for what it's using
        let resp = self.client.get("/api/version").await?;
        // Returns: {"version": "0.5.1", "gpu": "cuda"} or "metal" or "cpu"
        Ok(resp.gpu)
    }
}

4. Configuration

Default: API (easier setup)

New users get API by default - just set an env var:

export ANTHROPIC_API_KEY=sk-...
# That's it. Blue works.

Opt-in: Local (better privacy/cost)

blue_model_download name="qwen2.5-7b"
# Edit .blue/config.yaml to prefer local

Full Configuration:

# .blue/config.yaml
llm:
  provider: auto  # auto | local | api | none

  # auto (default): Use local if model exists, else API, else keywords
  # local: Only use local, fail if unavailable
  # api: Only use API, fail if unavailable
  # none: Disable AI features entirely

  local:
    model: qwen2.5-7b  # Shorthand, resolves to full path
    # Or explicit: model_path: ~/.blue/models/qwen2.5-7b-instruct-q4_k_m.gguf
    backend: auto  # cuda | mps | cpu | auto
    context_size: 8192
    threads: 8  # for CPU backend

  api:
    provider: anthropic  # anthropic | openai
    model: claude-3-haiku-20240307
    api_key_env: ANTHROPIC_API_KEY  # Read from env var

Zero-Config Experience:

User State	Behavior
No config, no env var	Keywords only (works offline)
`ANTHROPIC_API_KEY` set	API (easiest)
Model downloaded	Local (best)
Both available	Local preferred

5. Model Management (via Embedded Ollama)

Blue wraps Ollama's model commands:

blue_model_list          # ollama list
blue_model_pull          # ollama pull
blue_model_remove        # ollama rm
blue_model_info          # ollama show

Model storage: ~/.ollama/models/ (Ollama default, shared with external Ollama)

Recommended Models:

Model	Size	Use Case
`qwen2.5:7b`	~4.4GB	Fast, good quality
`qwen2.5:32b`	~19GB	Best quality
`qwen2.5-coder:7b`	~4.4GB	Code-focused
`qwen2.5-coder:32b`	~19GB	Best for agentic coding

Pull Example:

blue_model_pull name="qwen2.5:7b"

→ Pulling qwen2.5:7b...
→ [████████████████████] 100% (4.4 GB)
→ Model ready. Run: blue_model_info name="qwen2.5:7b"

Licensing: Qwen2.5 models are Apache 2.0 - commercial use permitted.

5.1 Goose Integration

Blue bundles Goose binary and auto-configures it for local Ollama:

┌─────────────────────────────────────────────────────────┐
│  User runs: blue agent                                   │
│       ↓                                                  │
│  Blue detects Ollama on localhost:11434                 │
│       ↓                                                  │
│  Picks largest available model (e.g., qwen2.5:72b)      │
│       ↓                                                  │
│  Launches bundled Goose with Blue MCP extension         │
└─────────────────────────────────────────────────────────┘

Zero Setup:

# Just run it - Blue handles everything
blue agent

# What happens:
# 1. Uses bundled Goose binary (downloaded at build time)
# 2. Detects Ollama running on localhost:11434
# 3. Selects largest model (best for agentic work)
# 4. Sets GOOSE_PROVIDER=ollama, OLLAMA_HOST=...
# 5. Connects Blue MCP extension for workflow tools

Manual Model Override:

# Use a specific provider/model
blue agent --model ollama/qwen2.5:7b
blue agent --model anthropic/claude-sonnet-4-20250514

# Pass additional Goose arguments
blue agent -- --resume --name my-session

Goose Binary Bundling:

Blue's build.rs downloads the Goose binary for the target platform:

Platform	Binary
macOS ARM64	goose-aarch64-apple-darwin
macOS x86_64	goose-x86_64-apple-darwin
Linux x86_64	goose-x86_64-unknown-linux-gnu
Linux ARM64	goose-aarch64-unknown-linux-gnu
Windows	goose-x86_64-pc-windows-gnu

Build-time Download:

// apps/blue-cli/build.rs
const GOOSE_VERSION: &str = "1.21.1";

// Downloads goose-{arch}-{os}.tar.bz2 from GitHub releases
// Extracts to OUT_DIR, sets GOOSE_BINARY_PATH env var

Runtime Discovery:

Check for bundled binary next to blue executable
Check compile-time GOOSE_BINARY_PATH
Fall back to system PATH (validates it's Block's Goose, not the DB migration tool)

Shared Model Benefits:

Without Blue	With Blue
Install Goose separately	Blue bundles Goose
Install Ollama separately	Blue bundles Ollama
Configure Goose manually	`blue agent` auto-configures
Model loaded twice	One model instance
40GB RAM for two 32B models	20GB for shared model

6. Graceful Degradation

impl BlueState {
    pub async fn get_llm(&self) -> Option<&dyn LlmProvider> {
        // Try local first
        if let Some(local) = &self.local_llm {
            if local.is_ready() {
                return Some(local);
            }
        }

        // Fall back to API
        if let Some(api) = &self.api_llm {
            return Some(api);
        }

        // No LLM available
        None
    }
}

Condition	Behavior
Local model loaded	Use local (default)
Local unavailable, API configured	Fall back to API + warning
Neither available	Keyword matching only
`--no-ai` flag	Skip AI entirely

7. Model Loading Strategy

Problem: Model load takes 5-10 seconds. Can't block MCP calls.

Solution: Daemon preloads model on startup.

impl EmbeddedOllama {
    pub async fn warmup(&self, model: &str) -> Result<()> {
        // Send a dummy request to load model into memory
        let resp = self.client
            .post("/api/generate")
            .json(&json!({
                "model": model,
                "prompt": "Hi",
                "options": { "num_predict": 1 }
            }))
            .send()
            .await?;

        // Model now loaded and warm
        Ok(())
    }
}

Daemon Startup:

blue daemon start

→ Starting embedded Ollama...
→ Ollama ready on localhost:11434
→ Warming up qwen2.5:7b... (5-10 seconds)
→ Model ready.

MCP Tool Response During Load:

{
  "status": "loading",
  "message": "Model loading... Try again in a few seconds.",
  "retry_after_ms": 2000
}

Auto-Warmup: Daemon warms up configured model on start. First MCP request is fast.

Manual Warmup:

blue_model_warmup model="qwen2.5:32b"  # Load specific model

8. Multi-Session Model Handling

Question: What if user has multiple Blue MCP sessions (multiple IDE windows)?

Answer: All sessions share one Ollama instance via blue daemon.

┌─────────────────────────────────────────────────────────┐
│  blue daemon (singleton)                                 │
│  └── Embedded Ollama (localhost:11434)                  │
│      └── Model loaded once (~20GB for 32B)              │
├─────────────────────────────────────────────────────────┤
│  Blue MCP Session 1  ──┐                                │
│  Blue MCP Session 2  ──┼──→ HTTP to localhost:11434    │
│  Goose              ──┘                                │
└─────────────────────────────────────────────────────────┘

Benefits:

One model in memory, not per-session
Goose shares same model instance
Daemon manages Ollama lifecycle
Sessions can come and go

Daemon Lifecycle:

blue daemon start   # Start Ollama, keep running
blue daemon stop    # Stop Ollama
blue daemon status  # Check health and GPU info

# Auto-start: first MCP connection starts daemon if not running

Status Output:

$ blue daemon status

Blue Daemon: running
├── Ollama: healthy (v0.5.4)
├── Backend: Metal (MPS) - Apple M4 Max
├── Port: 11434
├── Models loaded: qwen2.5:32b (19GB)
├── Uptime: 2h 34m
└── Requests served: 1,247

Daemon Health & Recovery

Health checks:

impl EmbeddedOllama {
    pub async fn health_check(&self) -> Result<HealthStatus> {
        match self.client.get("/api/version").await {
            Ok(resp) => Ok(HealthStatus::Healthy {
                version: resp.version,
                gpu: resp.gpu,
            }),
            Err(e) => Ok(HealthStatus::Unhealthy { error: e.to_string() }),
        }
    }

    pub fn start_health_monitor(&self) {
        tokio::spawn(async move {
            loop {
                tokio::time::sleep(Duration::from_secs(30)).await;

                if let Ok(HealthStatus::Unhealthy { .. }) = self.health_check().await {
                    log::warn!("Ollama unhealthy, attempting restart...");
                    self.restart().await;
                }
            }
        });
    }
}

Crash recovery:

Scenario	Behavior
Ollama crashes	Auto-restart within 5 seconds
Restart fails 3x	Mark as failed, fall back to API
User calls `daemon restart`	Force restart, reset failure count

Graceful shutdown:

impl EmbeddedOllama {
    pub async fn stop(&mut self) -> Result<()> {
        // Signal Ollama to finish current requests
        self.client.post("/api/shutdown").await.ok();

        // Wait up to 10 seconds for graceful shutdown
        tokio::time::timeout(
            Duration::from_secs(10),
            self.wait_for_exit()
        ).await.ok();

        // Force kill if still running
        if let Some(proc) = self.process.take() {
            proc.kill().ok();
        }

        Ok(())
    }
}

8. Integration Points

ADR Relevance (RFC 0004):

pub async fn find_relevant_adrs(
    llm: &dyn LlmProvider,
    context: &str,
    adrs: &[AdrSummary],
) -> Result<Vec<RelevanceResult>> {
    let prompt = format_relevance_prompt(context, adrs);
    let response = llm.complete(&prompt, &RELEVANCE_OPTIONS).await?;
    parse_relevance_response(&response)
}

Runbook Matching (RFC 0002):

pub async fn match_action_semantic(
    llm: &dyn LlmProvider,
    query: &str,
    actions: &[String],
) -> Result<Option<String>> {
    // Use LLM to find best semantic match
}

9. Cargo Features & Build

[features]
default = ["ollama"]
ollama = []  # Embeds Ollama binary

[dependencies]
reqwest = { version = "0.12", features = ["json"] }  # Ollama HTTP client
tokio = { version = "1", features = ["process"] }     # Process management

[build-dependencies]
# Download Ollama binary at build time

Build Process:

// build.rs
const OLLAMA_VERSION: &str = "0.5.4";

fn main() {
    let target = std::env::var("TARGET").unwrap();

    let (ollama_url, sha256) = match target.as_str() {
        // macOS (Universal - works on Intel and Apple Silicon)
        t if t.contains("darwin") =>
            (format!("https://github.com/ollama/ollama/releases/download/v{}/ollama-darwin", OLLAMA_VERSION),
             "abc123..."),

        // Linux x86_64
        t if t.contains("x86_64") && t.contains("linux") =>
            (format!("https://github.com/ollama/ollama/releases/download/v{}/ollama-linux-amd64", OLLAMA_VERSION),
             "def456..."),

        // Linux ARM64 (Raspberry Pi 4/5, AWS Graviton, etc.)
        t if t.contains("aarch64") && t.contains("linux") =>
            (format!("https://github.com/ollama/ollama/releases/download/v{}/ollama-linux-arm64", OLLAMA_VERSION),
             "ghi789..."),

        // Windows x86_64
        t if t.contains("windows") =>
            (format!("https://github.com/ollama/ollama/releases/download/v{}/ollama-windows-amd64.exe", OLLAMA_VERSION),
             "jkl012..."),

        _ => panic!("Unsupported target: {}", target),
    };

    download_and_verify(&ollama_url, sha256);
    println!("cargo:rerun-if-changed=build.rs");
}

Supported Platforms:

Platform	Architecture	Ollama Binary
macOS	x86_64 + ARM64	ollama-darwin (universal)
Linux	x86_64	ollama-linux-amd64
Linux	ARM64	ollama-linux-arm64
Windows	x86_64	ollama-windows-amd64.exe

ARM64 Linux Use Cases:

Raspberry Pi 4/5 (8GB+ recommended)
AWS Graviton instances
NVIDIA Jetson
Apple Silicon Linux VMs

Binary Size:

Component	Size
Blue CLI	~5 MB
Ollama binary	~50 MB
Total	~55 MB

Models downloaded separately on first use.

10. Performance Expectations

Apple Silicon (M4 Max, 128GB, Metal/MPS):

Metric	Qwen2.5-7B	Qwen2.5-32B
Model load	2-3 sec	5-10 sec
Prompt processing	~150 tok/s	~100 tok/s
Generation	~80 tok/s	~50 tok/s
ADR relevance	100-200ms	200-400ms

NVIDIA GPU (RTX 4090, CUDA):

Metric	Qwen2.5-7B	Qwen2.5-32B
Model load	1-2 sec	3-5 sec
Prompt processing	~200 tok/s	~120 tok/s
Generation	~100 tok/s	~60 tok/s
ADR relevance	80-150ms	150-300ms

CPU Only (fallback):

Metric	Qwen2.5-7B	Qwen2.5-32B
Generation	~10 tok/s	~3 tok/s
ADR relevance	1-2 sec	5-10 sec

Metal/MPS on Apple Silicon is first-class - not a fallback.

11. Memory Validation

Ollama handles memory management, but Blue validates before pull:

impl EmbeddedOllama {
    pub async fn validate_can_pull(&self, model: &str) -> Result<()> {
        let model_size = self.get_model_size(model).await?;
        let available = sys_info::mem_info()?.avail * 1024;
        let buffer = model_size / 5;  // 20% buffer

        if available < model_size + buffer {
            return Err(LlmError::InsufficientMemory {
                model: model.to_string(),
                required: model_size + buffer,
                available,
                suggestion: format!(
                    "Close some applications or use a smaller model. \
                     Try: blue_model_pull name=\"qwen2.5:7b\""
                ),
            });
        }
        Ok(())
    }
}

Ollama's Own Handling:

Ollama gracefully handles memory pressure by unloading models. Blue's validation is advisory.

12. Build Requirements

Blue Build (all platforms):

# Just Rust toolchain
cargo build --release

Blue's build.rs downloads the pre-built Ollama binary for the target platform. No C++ compiler needed.

Runtime GPU Support:

Ollama bundles GPU support. User just needs drivers:

macOS (Metal):

Works out of box on Apple Silicon (M1/M2/M3/M4)
No additional setup needed

Linux (CUDA):

# NVIDIA drivers (CUDA Toolkit not needed for inference)
nvidia-smi  # Verify driver installed

Linux (ROCm):

# AMD GPU support
rocminfo  # Verify ROCm installed

Windows:

NVIDIA: Just need GPU drivers
Works on CPU if no GPU

Ollama handles everything else - users don't need to install CUDA Toolkit, cuDNN, etc.

Security Considerations

Ollama binary integrity: Verify SHA256 of bundled Ollama binary at build time
Model provenance: Ollama registry handles model verification
Local only by default: Ollama binds to localhost:11434, not exposed
Prompt injection: Sanitize user input before prompts
Memory: Ollama handles memory management
No secrets in prompts: ADR relevance only sends context strings
Process isolation: Ollama runs as subprocess, not linked

Network Binding:

impl EmbeddedOllama {
    pub async fn start(&mut self) -> Result<()> {
        let mut cmd = Command::new(Self::bundled_binary_path());

        // Bind to localhost only - not accessible from network
        cmd.env("OLLAMA_HOST", "127.0.0.1:11434");

        // ...
    }
}

Goose Access:

Goose connects to localhost:11434 - works because it's on the same machine. Remote access requires explicit OLLAMA_HOST=0.0.0.0:11434 override.

Port Conflict Handling

Scenario: User already has Ollama running on port 11434.

impl EmbeddedOllama {
    pub async fn start(&mut self) -> Result<()> {
        // Check if port 11434 is in use
        if Self::port_in_use(11434) {
            // Check if it's Ollama
            if Self::is_ollama_running().await? {
                // Use existing Ollama instance
                self.mode = OllamaMode::External;
                return Ok(());
            } else {
                // Something else on port - use alternate
                self.port = Self::find_free_port(11435..11500)?;
            }
        }

        // Start embedded Ollama on chosen port
        self.start_embedded().await
    }
}

Situation	Behavior
Port 11434 free	Start embedded Ollama
Ollama already running	Use existing (no duplicate)
Other service on port	Use alternate port (11435+)

Config override:

# .blue/config.yaml
llm:
  local:
    ollama_port: 11500  # Force specific port
    use_external: true   # Never start embedded, use existing

Binary Verification

Build-time verification:

// build.rs
const OLLAMA_SHA256: &str = "abc123...";  // Per-platform hashes

fn download_ollama() {
    let bytes = download(OLLAMA_URL)?;
    let hash = sha256(&bytes);

    if hash != OLLAMA_SHA256 {
        panic!("Ollama binary hash mismatch! Expected {}, got {}", OLLAMA_SHA256, hash);
    }

    write_binary(bytes)?;
}

Runtime verification:

impl EmbeddedOllama {
    fn verify_binary(&self) -> Result<()> {
        let expected = include_str!("ollama.sha256");
        let actual = sha256_file(Self::bundled_binary_path())?;

        if actual != expected {
            return Err(LlmError::BinaryTampered {
                expected: expected.to_string(),
                actual,
            });
        }
        Ok(())
    }

    pub async fn start(&mut self) -> Result<()> {
        self.verify_binary()?;  // Check before every start
        // ...
    }
}

Air-Gapped Builds

For environments without internet during build:

# 1. Download Ollama binary manually
curl -L https://github.com/ollama/ollama/releases/download/v0.5.4/ollama-darwin \
  -o vendor/ollama-darwin

# 2. Build with BLUE_OLLAMA_PATH
BLUE_OLLAMA_PATH=vendor/ollama-darwin cargo build --release

// build.rs
fn get_ollama_binary() -> Vec<u8> {
    if let Ok(path) = std::env::var("BLUE_OLLAMA_PATH") {
        // Use pre-downloaded binary
        std::fs::read(path).expect("Failed to read BLUE_OLLAMA_PATH")
    } else {
        // Download from GitHub
        download_ollama()
    }
}

Implementation Phases

Phase 1: Embedded Ollama

Add build.rs to download Ollama binary per platform
Create blue-ollama crate for embedded server management
Implement EmbeddedOllama::start() and stop()
Add blue daemon start/stop commands

Phase 2: LLM Provider 5. Add LlmProvider trait to blue-core 6. Implement OllamaLlm using HTTP client 7. Add blue_model_pull, blue_model_list tools 8. Implement auto-pull on first use

Phase 3: Semantic Integration 9. Integrate with ADR relevance (RFC 0004) 10. Add semantic runbook matching (RFC 0002) 11. Add fallback chain: Ollama → API → keywords

Phase 4: Goose Integration 12. Add blue agent command to launch Goose 13. Document Goose + Blue setup 14. Ship example configs

CI/CD Matrix

Test embedded Ollama on all platforms:

# .github/workflows/ci.yml
jobs:
  test-ollama:
    strategy:
      matrix:
        include:
          - os: macos-latest
            ollama_binary: ollama-darwin
          - os: ubuntu-latest
            ollama_binary: ollama-linux-amd64
          - os: windows-latest
            ollama_binary: ollama-windows-amd64.exe

    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4

      - name: Build Blue (downloads Ollama binary)
        run: cargo build --release

      - name: Verify Ollama binary embedded
        run: |
          # Check binary exists in expected location
          ls -la target/release/ollama*          

      - name: Test daemon start/stop
        run: |
          cargo run -- daemon start
          sleep 5
          curl -s http://localhost:11434/api/version
          cargo run -- daemon stop          

      - name: Test with mock model (no download)
        run: cargo test ollama::mock

  # GPU tests run on self-hosted runners
  test-gpu:
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4
      - name: Test CUDA detection
        run: |
          cargo build --release
          cargo run -- daemon start
          # Verify GPU detected
          curl -s http://localhost:11434/api/version | jq .gpu
          cargo run -- daemon stop

Note: Full model integration tests run nightly (large downloads).

Test Plan

Embedded Ollama:

blue daemon start launches embedded Ollama
blue daemon stop cleanly shuts down
Ollama detects CUDA when available
Ollama detects Metal on macOS
Falls back to CPU when no GPU
Health check returns backend type

Model Management:

blue_model_pull downloads from Ollama registry
blue_model_list shows pulled models
blue_model_remove deletes model
Auto-pull on first completion if model missing
Progress indicator during pull

LLM Provider:

OllamaLlm::complete() returns valid response
Fallback chain: Ollama → API → keywords
--no-ai flag skips LLM entirely
Configuration parsing from .blue/config.yaml

Semantic Integration:

ADR relevance uses embedded Ollama
Runbook matching uses semantic search
Response includes method used (ollama/api/keywords)

Goose Integration:

blue agent starts Goose with Blue extension
Goose connects to Blue's embedded Ollama
Goose can use Blue MCP tools
Model shared between Blue tasks and Goose

Multi-Session:

Multiple Blue MCP sessions share one Ollama
Concurrent completions handled correctly
Daemon persists across shell sessions

Port Conflict:

Detects existing Ollama on port 11434
Uses existing Ollama instead of starting new
Uses alternate port if non-Ollama on 11434
use_external: true config works

Health & Recovery:

Health check detects unhealthy Ollama
Auto-restart on crash
Falls back to API after 3 restart failures
Graceful shutdown waits for requests

Binary Verification:

Build fails if Ollama hash mismatch
Runtime verification before start
Tampered binary: clear error message
Air-gapped build with BLUE_OLLAMA_PATH works

CI Matrix:

macOS build includes darwin Ollama binary
Linux x86_64 build includes amd64 binary
Linux ARM64 build includes arm64 binary
Windows build includes windows binary
Integration tests with mock Ollama server

"Right then. Let's get to it."

— Blue

30 KiB Raw Blame History