# RFC 0005: Local Llm Integration | | | |---|---| | **Status** | Implemented | | **Date** | 2026-01-24 | | **Source Spike** | local-llm-integration, agentic-cli-integration | --- ## Summary Blue needs local LLM capabilities for: 1. **Semantic tasks** - ADR relevance, runbook matching, dialogue summarization (lightweight, fast) 2. **Agentic coding** - Full code generation via Goose integration (heavyweight, powerful) Unified approach: **Ollama as shared backend** + **Goose for agentic tasks** + **Blue's LlmProvider for semantic tasks**. Must support CUDA > MPS > CPU backend priority. ## Background ### Two Use Cases | Use Case | Latency | Complexity | Tool | |----------|---------|------------|------| | **Semantic tasks** | <500ms | Short prompts, structured output | Blue internal | | **Agentic coding** | Minutes | Multi-turn, code generation | Goose | ### Blue's Semantic Tasks | Feature | RFC | Need | |---------|-----|------| | ADR Relevance | 0004 | Match context to philosophical ADRs | | Runbook Lookup | 0002 | Semantic action matching | | Dialogue Summary | 0001 | Extract key decisions | ### Why Local LLM? - **Privacy**: No data leaves the machine - **Cost**: Zero per-query cost after model download - **Speed**: Sub-second latency for short tasks - **Offline**: Works without internet ### Why Embed Ollama? | Approach | Pros | Cons | |----------|------|------| | llama-cpp-rs | Rust-native | Build complexity, no model management | | Ollama (external) | Easy setup | User must install separately | | **Ollama (embedded)** | Single install, full features | Larger binary, Go dependency | **Embedded Ollama wins because:** 1. **Single install** - `cargo install blue` gives you everything 2. **Model management built-in** - pull, list, remove models 3. **Goose compatibility** - Goose connects to Blue's embedded Ollama 4. **Battle-tested** - Ollama handles CUDA/MPS/CPU, quantization, context 5. **One model, all uses** - Semantic tasks + agentic coding share model ### Ollama Version Blue embeds a specific, tested Ollama version: | Blue Version | Ollama Version | Release Date | |--------------|----------------|--------------| | 0.1.x | 0.5.4 | 2026-01 | Version pinned in `build.rs`. Updated via Blue releases, not automatically. ## Proposal ### 1. LlmProvider Trait ```rust #[async_trait] pub trait LlmProvider: Send + Sync { async fn complete(&self, prompt: &str, options: &CompletionOptions) -> Result; fn name(&self) -> &str; } pub struct CompletionOptions { pub max_tokens: usize, pub temperature: f32, pub stop_sequences: Vec, } ``` ### 2. Implementations ```rust pub enum LlmBackend { Ollama(OllamaLlm), // Embedded Ollama server Api(ApiLlm), // External API fallback Mock(MockLlm), // Testing } ``` **OllamaLlm**: Embedded Ollama server managed by Blue **ApiLlm**: Uses Anthropic/OpenAI APIs (fallback) **MockLlm**: Returns predefined responses (testing) ### 2.1 Embedded Ollama Architecture ``` ┌─────────────────────────────────────────────────────────┐ │ Blue CLI │ ├─────────────────────────────────────────────────────────┤ │ blue-ollama (embedded) │ │ ├── Ollama server (Go, compiled to lib) │ │ ├── Model management (pull, list, remove) │ │ └── HTTP API on localhost:11434 │ ├─────────────────────────────────────────────────────────┤ │ Consumers: │ │ ├── Blue semantic tasks (ADR relevance, etc.) │ │ ├── Goose (connects to localhost:11434) │ │ └── Any Ollama-compatible client │ └─────────────────────────────────────────────────────────┘ ``` **Embedding Strategy:** ```rust // blue-ollama crate pub struct EmbeddedOllama { process: Option, port: u16, models_dir: PathBuf, } impl EmbeddedOllama { /// Start embedded Ollama server pub async fn start(&mut self) -> Result<()> { // Ollama binary bundled in Blue release let ollama_bin = Self::bundled_binary_path(); self.process = Some( Command::new(ollama_bin) .env("OLLAMA_MODELS", &self.models_dir) .env("OLLAMA_HOST", format!("127.0.0.1:{}", self.port)) .spawn()? ); self.wait_for_ready().await } /// Stop embedded server pub async fn stop(&mut self) -> Result<()> { if let Some(mut proc) = self.process.take() { proc.kill()?; } Ok(()) } } ``` ### 3. Backend Priority (CUDA > MPS > CPU) **Ollama handles this automatically.** Ollama detects GPU at runtime: | Platform | Backend | Detection | |----------|---------|-----------| | NVIDIA GPU | CUDA | Auto-detected via driver | | Apple Silicon | **Metal (MPS)** | Auto-detected on M1/M2/M3/M4 | | AMD GPU | ROCm | Auto-detected on Linux | | No GPU | CPU | Fallback | ```bash # Ollama auto-detects best backend ollama run qwen2.5:7b # Uses CUDA → Metal → ROCm → CPU ``` **Apple Silicon (M1/M2/M3/M4):** - Ollama uses Metal Performance Shaders (MPS) automatically - No configuration needed - just works - Full GPU acceleration on unified memory **Blue just starts Ollama and lets it choose:** ```rust impl EmbeddedOllama { pub async fn start(&mut self) -> Result<()> { let mut cmd = Command::new(Self::bundled_binary_path()); // Force specific backend if configured match self.config.backend { BackendChoice::Cuda => { cmd.env("CUDA_VISIBLE_DEVICES", "0"); cmd.env("OLLAMA_NO_METAL", "1"); // Prefer CUDA over Metal } BackendChoice::Mps => { // Metal/MPS on Apple Silicon (default on macOS) cmd.env("CUDA_VISIBLE_DEVICES", ""); // Disable CUDA } BackendChoice::Cpu => { cmd.env("CUDA_VISIBLE_DEVICES", ""); // Disable CUDA cmd.env("OLLAMA_NO_METAL", "1"); // Disable Metal/MPS } BackendChoice::Auto => { // Let Ollama decide: CUDA → MPS → ROCm → CPU } } self.process = Some(cmd.spawn()?); self.wait_for_ready().await } } ``` **Backend verification:** ```rust impl EmbeddedOllama { pub async fn detected_backend(&self) -> Result { // Query Ollama for what it's using let resp = self.client.get("/api/version").await?; // Returns: {"version": "0.5.1", "gpu": "cuda"} or "metal" or "cpu" Ok(resp.gpu) } } ``` ### 4. Configuration **Default: API (easier setup)** New users get API by default - just set an env var: ```bash export ANTHROPIC_API_KEY=sk-... # That's it. Blue works. ``` **Opt-in: Local (better privacy/cost)** ```bash blue_model_download name="qwen2.5-7b" # Edit .blue/config.yaml to prefer local ``` **Full Configuration:** ```yaml # .blue/config.yaml llm: provider: auto # auto | local | api | none # auto (default): Use local if model exists, else API, else keywords # local: Only use local, fail if unavailable # api: Only use API, fail if unavailable # none: Disable AI features entirely local: model: qwen2.5-7b # Shorthand, resolves to full path # Or explicit: model_path: ~/.blue/models/qwen2.5-7b-instruct-q4_k_m.gguf backend: auto # cuda | mps | cpu | auto context_size: 8192 threads: 8 # for CPU backend api: provider: anthropic # anthropic | openai model: claude-3-haiku-20240307 api_key_env: ANTHROPIC_API_KEY # Read from env var ``` **Zero-Config Experience:** | User State | Behavior | |------------|----------| | No config, no env var | Keywords only (works offline) | | `ANTHROPIC_API_KEY` set | API (easiest) | | Model downloaded | Local (best) | | Both available | Local preferred | ### 5. Model Management (via Embedded Ollama) Blue wraps Ollama's model commands: ``` blue_model_list # ollama list blue_model_pull # ollama pull blue_model_remove # ollama rm blue_model_info # ollama show ``` Model storage: `~/.ollama/models/` (Ollama default, shared with external Ollama) **Recommended Models:** | Model | Size | Use Case | |-------|------|----------| | `qwen2.5:7b` | ~4.4GB | Fast, good quality | | `qwen2.5:32b` | ~19GB | Best quality | | `qwen2.5-coder:7b` | ~4.4GB | Code-focused | | `qwen2.5-coder:32b` | ~19GB | Best for agentic coding | **Pull Example:** ``` blue_model_pull name="qwen2.5:7b" → Pulling qwen2.5:7b... → [████████████████████] 100% (4.4 GB) → Model ready. Run: blue_model_info name="qwen2.5:7b" ``` **Licensing:** Qwen2.5 models are Apache 2.0 - commercial use permitted. ### 5.1 Goose Integration Blue bundles Goose binary and auto-configures it for local Ollama: ``` ┌─────────────────────────────────────────────────────────┐ │ User runs: blue agent │ │ ↓ │ │ Blue detects Ollama on localhost:11434 │ │ ↓ │ │ Picks largest available model (e.g., qwen2.5:72b) │ │ ↓ │ │ Launches bundled Goose with Blue MCP extension │ └─────────────────────────────────────────────────────────┘ ``` **Zero Setup:** ```bash # Just run it - Blue handles everything blue agent # What happens: # 1. Uses bundled Goose binary (downloaded at build time) # 2. Detects Ollama running on localhost:11434 # 3. Selects largest model (best for agentic work) # 4. Sets GOOSE_PROVIDER=ollama, OLLAMA_HOST=... # 5. Connects Blue MCP extension for workflow tools ``` **Manual Model Override:** ```bash # Use a specific provider/model blue agent --model ollama/qwen2.5:7b blue agent --model anthropic/claude-sonnet-4-20250514 # Pass additional Goose arguments blue agent -- --resume --name my-session ``` **Goose Binary Bundling:** Blue's `build.rs` downloads the Goose binary for the target platform: | Platform | Binary | |----------|--------| | macOS ARM64 | goose-aarch64-apple-darwin | | macOS x86_64 | goose-x86_64-apple-darwin | | Linux x86_64 | goose-x86_64-unknown-linux-gnu | | Linux ARM64 | goose-aarch64-unknown-linux-gnu | | Windows | goose-x86_64-pc-windows-gnu | **Build-time Download:** ```rust // apps/blue-cli/build.rs const GOOSE_VERSION: &str = "1.21.1"; // Downloads goose-{arch}-{os}.tar.bz2 from GitHub releases // Extracts to OUT_DIR, sets GOOSE_BINARY_PATH env var ``` **Runtime Discovery:** 1. Check for bundled binary next to `blue` executable 2. Check compile-time `GOOSE_BINARY_PATH` 3. Fall back to system PATH (validates it's Block's Goose, not the DB migration tool) **Shared Model Benefits:** | Without Blue | With Blue | |--------------|-----------| | Install Goose separately | Blue bundles Goose | | Install Ollama separately | Blue bundles Ollama | | Configure Goose manually | `blue agent` auto-configures | | Model loaded twice | One model instance | | 40GB RAM for two 32B models | 20GB for shared model | ### 6. Graceful Degradation ```rust impl BlueState { pub async fn get_llm(&self) -> Option<&dyn LlmProvider> { // Try local first if let Some(local) = &self.local_llm { if local.is_ready() { return Some(local); } } // Fall back to API if let Some(api) = &self.api_llm { return Some(api); } // No LLM available None } } ``` | Condition | Behavior | |-----------|----------| | Local model loaded | Use local (default) | | Local unavailable, API configured | Fall back to API + warning | | Neither available | Keyword matching only | | `--no-ai` flag | Skip AI entirely | ### 7. Model Loading Strategy **Problem:** Model load takes 5-10 seconds. Can't block MCP calls. **Solution:** Daemon preloads model on startup. ```rust impl EmbeddedOllama { pub async fn warmup(&self, model: &str) -> Result<()> { // Send a dummy request to load model into memory let resp = self.client .post("/api/generate") .json(&json!({ "model": model, "prompt": "Hi", "options": { "num_predict": 1 } })) .send() .await?; // Model now loaded and warm Ok(()) } } ``` **Daemon Startup:** ```bash blue daemon start → Starting embedded Ollama... → Ollama ready on localhost:11434 → Warming up qwen2.5:7b... (5-10 seconds) → Model ready. ``` **MCP Tool Response During Load:** ```json { "status": "loading", "message": "Model loading... Try again in a few seconds.", "retry_after_ms": 2000 } ``` **Auto-Warmup:** Daemon warms up configured model on start. First MCP request is fast. **Manual Warmup:** ``` blue_model_warmup model="qwen2.5:32b" # Load specific model ``` ### 8. Multi-Session Model Handling **Question:** What if user has multiple Blue MCP sessions (multiple IDE windows)? **Answer:** All sessions share one Ollama instance via `blue daemon`. ``` ┌─────────────────────────────────────────────────────────┐ │ blue daemon (singleton) │ │ └── Embedded Ollama (localhost:11434) │ │ └── Model loaded once (~20GB for 32B) │ ├─────────────────────────────────────────────────────────┤ │ Blue MCP Session 1 ──┐ │ │ Blue MCP Session 2 ──┼──→ HTTP to localhost:11434 │ │ Goose ──┘ │ └─────────────────────────────────────────────────────────┘ ``` **Benefits:** - One model in memory, not per-session - Goose shares same model instance - Daemon manages Ollama lifecycle - Sessions can come and go **Daemon Lifecycle:** ```bash blue daemon start # Start Ollama, keep running blue daemon stop # Stop Ollama blue daemon status # Check health and GPU info # Auto-start: first MCP connection starts daemon if not running ``` **Status Output:** ``` $ blue daemon status Blue Daemon: running ├── Ollama: healthy (v0.5.4) ├── Backend: Metal (MPS) - Apple M4 Max ├── Port: 11434 ├── Models loaded: qwen2.5:32b (19GB) ├── Uptime: 2h 34m └── Requests served: 1,247 ``` ### Daemon Health & Recovery **Health checks:** ```rust impl EmbeddedOllama { pub async fn health_check(&self) -> Result { match self.client.get("/api/version").await { Ok(resp) => Ok(HealthStatus::Healthy { version: resp.version, gpu: resp.gpu, }), Err(e) => Ok(HealthStatus::Unhealthy { error: e.to_string() }), } } pub fn start_health_monitor(&self) { tokio::spawn(async move { loop { tokio::time::sleep(Duration::from_secs(30)).await; if let Ok(HealthStatus::Unhealthy { .. }) = self.health_check().await { log::warn!("Ollama unhealthy, attempting restart..."); self.restart().await; } } }); } } ``` **Crash recovery:** | Scenario | Behavior | |----------|----------| | Ollama crashes | Auto-restart within 5 seconds | | Restart fails 3x | Mark as failed, fall back to API | | User calls `daemon restart` | Force restart, reset failure count | **Graceful shutdown:** ```rust impl EmbeddedOllama { pub async fn stop(&mut self) -> Result<()> { // Signal Ollama to finish current requests self.client.post("/api/shutdown").await.ok(); // Wait up to 10 seconds for graceful shutdown tokio::time::timeout( Duration::from_secs(10), self.wait_for_exit() ).await.ok(); // Force kill if still running if let Some(proc) = self.process.take() { proc.kill().ok(); } Ok(()) } } ``` ### 8. Integration Points **ADR Relevance (RFC 0004):** ```rust pub async fn find_relevant_adrs( llm: &dyn LlmProvider, context: &str, adrs: &[AdrSummary], ) -> Result> { let prompt = format_relevance_prompt(context, adrs); let response = llm.complete(&prompt, &RELEVANCE_OPTIONS).await?; parse_relevance_response(&response) } ``` **Runbook Matching (RFC 0002):** ```rust pub async fn match_action_semantic( llm: &dyn LlmProvider, query: &str, actions: &[String], ) -> Result> { // Use LLM to find best semantic match } ``` ### 9. Cargo Features & Build ```toml [features] default = ["ollama"] ollama = [] # Embeds Ollama binary [dependencies] reqwest = { version = "0.12", features = ["json"] } # Ollama HTTP client tokio = { version = "1", features = ["process"] } # Process management [build-dependencies] # Download Ollama binary at build time ``` **Build Process:** ```rust // build.rs const OLLAMA_VERSION: &str = "0.5.4"; fn main() { let target = std::env::var("TARGET").unwrap(); let (ollama_url, sha256) = match target.as_str() { // macOS (Universal - works on Intel and Apple Silicon) t if t.contains("darwin") => (format!("https://github.com/ollama/ollama/releases/download/v{}/ollama-darwin", OLLAMA_VERSION), "abc123..."), // Linux x86_64 t if t.contains("x86_64") && t.contains("linux") => (format!("https://github.com/ollama/ollama/releases/download/v{}/ollama-linux-amd64", OLLAMA_VERSION), "def456..."), // Linux ARM64 (Raspberry Pi 4/5, AWS Graviton, etc.) t if t.contains("aarch64") && t.contains("linux") => (format!("https://github.com/ollama/ollama/releases/download/v{}/ollama-linux-arm64", OLLAMA_VERSION), "ghi789..."), // Windows x86_64 t if t.contains("windows") => (format!("https://github.com/ollama/ollama/releases/download/v{}/ollama-windows-amd64.exe", OLLAMA_VERSION), "jkl012..."), _ => panic!("Unsupported target: {}", target), }; download_and_verify(&ollama_url, sha256); println!("cargo:rerun-if-changed=build.rs"); } ``` **Supported Platforms:** | Platform | Architecture | Ollama Binary | |----------|--------------|---------------| | macOS | x86_64 + ARM64 | ollama-darwin (universal) | | Linux | x86_64 | ollama-linux-amd64 | | Linux | ARM64 | ollama-linux-arm64 | | Windows | x86_64 | ollama-windows-amd64.exe | **ARM64 Linux Use Cases:** - Raspberry Pi 4/5 (8GB+ recommended) - AWS Graviton instances - NVIDIA Jetson - Apple Silicon Linux VMs **Binary Size:** | Component | Size | |-----------|------| | Blue CLI | ~5 MB | | Ollama binary | ~50 MB | | **Total** | ~55 MB | Models downloaded separately on first use. ### 10. Performance Expectations **Apple Silicon (M4 Max, 128GB, Metal/MPS):** | Metric | Qwen2.5-7B | Qwen2.5-32B | |--------|------------|-------------| | Model load | 2-3 sec | 5-10 sec | | Prompt processing | ~150 tok/s | ~100 tok/s | | Generation | ~80 tok/s | ~50 tok/s | | ADR relevance | 100-200ms | 200-400ms | **NVIDIA GPU (RTX 4090, CUDA):** | Metric | Qwen2.5-7B | Qwen2.5-32B | |--------|------------|-------------| | Model load | 1-2 sec | 3-5 sec | | Prompt processing | ~200 tok/s | ~120 tok/s | | Generation | ~100 tok/s | ~60 tok/s | | ADR relevance | 80-150ms | 150-300ms | **CPU Only (fallback):** | Metric | Qwen2.5-7B | Qwen2.5-32B | |--------|------------|-------------| | Generation | ~10 tok/s | ~3 tok/s | | ADR relevance | 1-2 sec | 5-10 sec | Metal/MPS on Apple Silicon is first-class - not a fallback. ### 11. Memory Validation Ollama handles memory management, but Blue validates before pull: ```rust impl EmbeddedOllama { pub async fn validate_can_pull(&self, model: &str) -> Result<()> { let model_size = self.get_model_size(model).await?; let available = sys_info::mem_info()?.avail * 1024; let buffer = model_size / 5; // 20% buffer if available < model_size + buffer { return Err(LlmError::InsufficientMemory { model: model.to_string(), required: model_size + buffer, available, suggestion: format!( "Close some applications or use a smaller model. \ Try: blue_model_pull name=\"qwen2.5:7b\"" ), }); } Ok(()) } } ``` **Ollama's Own Handling:** Ollama gracefully handles memory pressure by unloading models. Blue's validation is advisory. ### 12. Build Requirements **Blue Build (all platforms):** ```bash # Just Rust toolchain cargo build --release ``` Blue's build.rs downloads the pre-built Ollama binary for the target platform. No C++ compiler needed. **Runtime GPU Support:** Ollama bundles GPU support. User just needs drivers: **macOS (Metal):** - Works out of box on Apple Silicon (M1/M2/M3/M4) - No additional setup needed **Linux (CUDA):** ```bash # NVIDIA drivers (CUDA Toolkit not needed for inference) nvidia-smi # Verify driver installed ``` **Linux (ROCm):** ```bash # AMD GPU support rocminfo # Verify ROCm installed ``` **Windows:** - NVIDIA: Just need GPU drivers - Works on CPU if no GPU **Ollama handles everything else** - users don't need to install CUDA Toolkit, cuDNN, etc. ## Security Considerations 1. **Ollama binary integrity**: Verify SHA256 of bundled Ollama binary at build time 2. **Model provenance**: Ollama registry handles model verification 3. **Local only by default**: Ollama binds to localhost:11434, not exposed 4. **Prompt injection**: Sanitize user input before prompts 5. **Memory**: Ollama handles memory management 6. **No secrets in prompts**: ADR relevance only sends context strings 7. **Process isolation**: Ollama runs as subprocess, not linked **Network Binding:** ```rust impl EmbeddedOllama { pub async fn start(&mut self) -> Result<()> { let mut cmd = Command::new(Self::bundled_binary_path()); // Bind to localhost only - not accessible from network cmd.env("OLLAMA_HOST", "127.0.0.1:11434"); // ... } } ``` **Goose Access:** Goose connects to `localhost:11434` - works because it's on the same machine. Remote access requires explicit `OLLAMA_HOST=0.0.0.0:11434` override. ### Port Conflict Handling **Scenario:** User already has Ollama running on port 11434. ```rust impl EmbeddedOllama { pub async fn start(&mut self) -> Result<()> { // Check if port 11434 is in use if Self::port_in_use(11434) { // Check if it's Ollama if Self::is_ollama_running().await? { // Use existing Ollama instance self.mode = OllamaMode::External; return Ok(()); } else { // Something else on port - use alternate self.port = Self::find_free_port(11435..11500)?; } } // Start embedded Ollama on chosen port self.start_embedded().await } } ``` | Situation | Behavior | |-----------|----------| | Port 11434 free | Start embedded Ollama | | Ollama already running | Use existing (no duplicate) | | Other service on port | Use alternate port (11435+) | **Config override:** ```yaml # .blue/config.yaml llm: local: ollama_port: 11500 # Force specific port use_external: true # Never start embedded, use existing ``` ### Binary Verification **Build-time verification:** ```rust // build.rs const OLLAMA_SHA256: &str = "abc123..."; // Per-platform hashes fn download_ollama() { let bytes = download(OLLAMA_URL)?; let hash = sha256(&bytes); if hash != OLLAMA_SHA256 { panic!("Ollama binary hash mismatch! Expected {}, got {}", OLLAMA_SHA256, hash); } write_binary(bytes)?; } ``` **Runtime verification:** ```rust impl EmbeddedOllama { fn verify_binary(&self) -> Result<()> { let expected = include_str!("ollama.sha256"); let actual = sha256_file(Self::bundled_binary_path())?; if actual != expected { return Err(LlmError::BinaryTampered { expected: expected.to_string(), actual, }); } Ok(()) } pub async fn start(&mut self) -> Result<()> { self.verify_binary()?; // Check before every start // ... } } ``` ### Air-Gapped Builds For environments without internet during build: ```bash # 1. Download Ollama binary manually curl -L https://github.com/ollama/ollama/releases/download/v0.5.4/ollama-darwin \ -o vendor/ollama-darwin # 2. Build with BLUE_OLLAMA_PATH BLUE_OLLAMA_PATH=vendor/ollama-darwin cargo build --release ``` ```rust // build.rs fn get_ollama_binary() -> Vec { if let Ok(path) = std::env::var("BLUE_OLLAMA_PATH") { // Use pre-downloaded binary std::fs::read(path).expect("Failed to read BLUE_OLLAMA_PATH") } else { // Download from GitHub download_ollama() } } ``` ## Implementation Phases **Phase 1: Embedded Ollama** 1. Add build.rs to download Ollama binary per platform 2. Create `blue-ollama` crate for embedded server management 3. Implement `EmbeddedOllama::start()` and `stop()` 4. Add `blue daemon start/stop` commands **Phase 2: LLM Provider** 5. Add `LlmProvider` trait to blue-core 6. Implement `OllamaLlm` using HTTP client 7. Add `blue_model_pull`, `blue_model_list` tools 8. Implement auto-pull on first use **Phase 3: Semantic Integration** 9. Integrate with ADR relevance (RFC 0004) 10. Add semantic runbook matching (RFC 0002) 11. Add fallback chain: Ollama → API → keywords **Phase 4: Goose Integration** 12. Add `blue agent` command to launch Goose 13. Document Goose + Blue setup 14. Ship example configs ## CI/CD Matrix Test embedded Ollama on all platforms: ```yaml # .github/workflows/ci.yml jobs: test-ollama: strategy: matrix: include: - os: macos-latest ollama_binary: ollama-darwin - os: ubuntu-latest ollama_binary: ollama-linux-amd64 - os: windows-latest ollama_binary: ollama-windows-amd64.exe runs-on: ${{ matrix.os }} steps: - uses: actions/checkout@v4 - name: Build Blue (downloads Ollama binary) run: cargo build --release - name: Verify Ollama binary embedded run: | # Check binary exists in expected location ls -la target/release/ollama* - name: Test daemon start/stop run: | cargo run -- daemon start sleep 5 curl -s http://localhost:11434/api/version cargo run -- daemon stop - name: Test with mock model (no download) run: cargo test ollama::mock # GPU tests run on self-hosted runners test-gpu: runs-on: [self-hosted, gpu] steps: - uses: actions/checkout@v4 - name: Test CUDA detection run: | cargo build --release cargo run -- daemon start # Verify GPU detected curl -s http://localhost:11434/api/version | jq .gpu cargo run -- daemon stop ``` **Note:** Full model integration tests run nightly (large downloads). ## Test Plan **Embedded Ollama:** - [ ] `blue daemon start` launches embedded Ollama - [ ] `blue daemon stop` cleanly shuts down - [ ] Ollama detects CUDA when available - [ ] Ollama detects Metal on macOS - [ ] Falls back to CPU when no GPU - [ ] Health check returns backend type **Model Management:** - [ ] `blue_model_pull` downloads from Ollama registry - [ ] `blue_model_list` shows pulled models - [ ] `blue_model_remove` deletes model - [ ] Auto-pull on first completion if model missing - [ ] Progress indicator during pull **LLM Provider:** - [ ] `OllamaLlm::complete()` returns valid response - [ ] Fallback chain: Ollama → API → keywords - [ ] `--no-ai` flag skips LLM entirely - [ ] Configuration parsing from .blue/config.yaml **Semantic Integration:** - [ ] ADR relevance uses embedded Ollama - [ ] Runbook matching uses semantic search - [ ] Response includes method used (ollama/api/keywords) **Goose Integration:** - [ ] `blue agent` starts Goose with Blue extension - [ ] Goose connects to Blue's embedded Ollama - [ ] Goose can use Blue MCP tools - [ ] Model shared between Blue tasks and Goose **Multi-Session:** - [ ] Multiple Blue MCP sessions share one Ollama - [ ] Concurrent completions handled correctly - [ ] Daemon persists across shell sessions **Port Conflict:** - [ ] Detects existing Ollama on port 11434 - [ ] Uses existing Ollama instead of starting new - [ ] Uses alternate port if non-Ollama on 11434 - [ ] `use_external: true` config works **Health & Recovery:** - [ ] Health check detects unhealthy Ollama - [ ] Auto-restart on crash - [ ] Falls back to API after 3 restart failures - [ ] Graceful shutdown waits for requests **Binary Verification:** - [ ] Build fails if Ollama hash mismatch - [ ] Runtime verification before start - [ ] Tampered binary: clear error message - [ ] Air-gapped build with BLUE_OLLAMA_PATH works **CI Matrix:** - [ ] macOS build includes darwin Ollama binary - [ ] Linux x86_64 build includes amd64 binary - [ ] Linux ARM64 build includes arm64 binary - [ ] Windows build includes windows binary - [ ] Integration tests with mock Ollama server --- *"Right then. Let's get to it."* — Blue