RFC 0008: Status updates now sync to markdown files, not just DB RFC 0009: Add Audit as first-class document type, rename blue_audit to blue_health_check to avoid naming collision Also includes: - Update RFC 0005 with Ollama auto-detection and bundled Goose support - Mark RFCs 0001-0006 as Implemented - Add spikes documenting investigations Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
30 KiB
RFC 0005: Local Llm Integration
| Status | Implemented |
| Date | 2026-01-24 |
| Source Spike | local-llm-integration, agentic-cli-integration |
Summary
Blue needs local LLM capabilities for:
- Semantic tasks - ADR relevance, runbook matching, dialogue summarization (lightweight, fast)
- Agentic coding - Full code generation via Goose integration (heavyweight, powerful)
Unified approach: Ollama as shared backend + Goose for agentic tasks + Blue's LlmProvider for semantic tasks.
Must support CUDA > MPS > CPU backend priority.
Background
Two Use Cases
| Use Case | Latency | Complexity | Tool |
|---|---|---|---|
| Semantic tasks | <500ms | Short prompts, structured output | Blue internal |
| Agentic coding | Minutes | Multi-turn, code generation | Goose |
Blue's Semantic Tasks
| Feature | RFC | Need |
|---|---|---|
| ADR Relevance | 0004 | Match context to philosophical ADRs |
| Runbook Lookup | 0002 | Semantic action matching |
| Dialogue Summary | 0001 | Extract key decisions |
Why Local LLM?
- Privacy: No data leaves the machine
- Cost: Zero per-query cost after model download
- Speed: Sub-second latency for short tasks
- Offline: Works without internet
Why Embed Ollama?
| Approach | Pros | Cons |
|---|---|---|
| llama-cpp-rs | Rust-native | Build complexity, no model management |
| Ollama (external) | Easy setup | User must install separately |
| Ollama (embedded) | Single install, full features | Larger binary, Go dependency |
Embedded Ollama wins because:
- Single install -
cargo install bluegives you everything - Model management built-in - pull, list, remove models
- Goose compatibility - Goose connects to Blue's embedded Ollama
- Battle-tested - Ollama handles CUDA/MPS/CPU, quantization, context
- One model, all uses - Semantic tasks + agentic coding share model
Ollama Version
Blue embeds a specific, tested Ollama version:
| Blue Version | Ollama Version | Release Date |
|---|---|---|
| 0.1.x | 0.5.4 | 2026-01 |
Version pinned in build.rs. Updated via Blue releases, not automatically.
Proposal
1. LlmProvider Trait
#[async_trait]
pub trait LlmProvider: Send + Sync {
async fn complete(&self, prompt: &str, options: &CompletionOptions) -> Result<String>;
fn name(&self) -> &str;
}
pub struct CompletionOptions {
pub max_tokens: usize,
pub temperature: f32,
pub stop_sequences: Vec<String>,
}
2. Implementations
pub enum LlmBackend {
Ollama(OllamaLlm), // Embedded Ollama server
Api(ApiLlm), // External API fallback
Mock(MockLlm), // Testing
}
OllamaLlm: Embedded Ollama server managed by Blue ApiLlm: Uses Anthropic/OpenAI APIs (fallback) MockLlm: Returns predefined responses (testing)
2.1 Embedded Ollama Architecture
┌─────────────────────────────────────────────────────────┐
│ Blue CLI │
├─────────────────────────────────────────────────────────┤
│ blue-ollama (embedded) │
│ ├── Ollama server (Go, compiled to lib) │
│ ├── Model management (pull, list, remove) │
│ └── HTTP API on localhost:11434 │
├─────────────────────────────────────────────────────────┤
│ Consumers: │
│ ├── Blue semantic tasks (ADR relevance, etc.) │
│ ├── Goose (connects to localhost:11434) │
│ └── Any Ollama-compatible client │
└─────────────────────────────────────────────────────────┘
Embedding Strategy:
// blue-ollama crate
pub struct EmbeddedOllama {
process: Option<Child>,
port: u16,
models_dir: PathBuf,
}
impl EmbeddedOllama {
/// Start embedded Ollama server
pub async fn start(&mut self) -> Result<()> {
// Ollama binary bundled in Blue release
let ollama_bin = Self::bundled_binary_path();
self.process = Some(
Command::new(ollama_bin)
.env("OLLAMA_MODELS", &self.models_dir)
.env("OLLAMA_HOST", format!("127.0.0.1:{}", self.port))
.spawn()?
);
self.wait_for_ready().await
}
/// Stop embedded server
pub async fn stop(&mut self) -> Result<()> {
if let Some(mut proc) = self.process.take() {
proc.kill()?;
}
Ok(())
}
}
3. Backend Priority (CUDA > MPS > CPU)
Ollama handles this automatically. Ollama detects GPU at runtime:
| Platform | Backend | Detection |
|---|---|---|
| NVIDIA GPU | CUDA | Auto-detected via driver |
| Apple Silicon | Metal (MPS) | Auto-detected on M1/M2/M3/M4 |
| AMD GPU | ROCm | Auto-detected on Linux |
| No GPU | CPU | Fallback |
# Ollama auto-detects best backend
ollama run qwen2.5:7b # Uses CUDA → Metal → ROCm → CPU
Apple Silicon (M1/M2/M3/M4):
- Ollama uses Metal Performance Shaders (MPS) automatically
- No configuration needed - just works
- Full GPU acceleration on unified memory
Blue just starts Ollama and lets it choose:
impl EmbeddedOllama {
pub async fn start(&mut self) -> Result<()> {
let mut cmd = Command::new(Self::bundled_binary_path());
// Force specific backend if configured
match self.config.backend {
BackendChoice::Cuda => {
cmd.env("CUDA_VISIBLE_DEVICES", "0");
cmd.env("OLLAMA_NO_METAL", "1"); // Prefer CUDA over Metal
}
BackendChoice::Mps => {
// Metal/MPS on Apple Silicon (default on macOS)
cmd.env("CUDA_VISIBLE_DEVICES", ""); // Disable CUDA
}
BackendChoice::Cpu => {
cmd.env("CUDA_VISIBLE_DEVICES", ""); // Disable CUDA
cmd.env("OLLAMA_NO_METAL", "1"); // Disable Metal/MPS
}
BackendChoice::Auto => {
// Let Ollama decide: CUDA → MPS → ROCm → CPU
}
}
self.process = Some(cmd.spawn()?);
self.wait_for_ready().await
}
}
Backend verification:
impl EmbeddedOllama {
pub async fn detected_backend(&self) -> Result<String> {
// Query Ollama for what it's using
let resp = self.client.get("/api/version").await?;
// Returns: {"version": "0.5.1", "gpu": "cuda"} or "metal" or "cpu"
Ok(resp.gpu)
}
}
4. Configuration
Default: API (easier setup)
New users get API by default - just set an env var:
export ANTHROPIC_API_KEY=sk-...
# That's it. Blue works.
Opt-in: Local (better privacy/cost)
blue_model_download name="qwen2.5-7b"
# Edit .blue/config.yaml to prefer local
Full Configuration:
# .blue/config.yaml
llm:
provider: auto # auto | local | api | none
# auto (default): Use local if model exists, else API, else keywords
# local: Only use local, fail if unavailable
# api: Only use API, fail if unavailable
# none: Disable AI features entirely
local:
model: qwen2.5-7b # Shorthand, resolves to full path
# Or explicit: model_path: ~/.blue/models/qwen2.5-7b-instruct-q4_k_m.gguf
backend: auto # cuda | mps | cpu | auto
context_size: 8192
threads: 8 # for CPU backend
api:
provider: anthropic # anthropic | openai
model: claude-3-haiku-20240307
api_key_env: ANTHROPIC_API_KEY # Read from env var
Zero-Config Experience:
| User State | Behavior |
|---|---|
| No config, no env var | Keywords only (works offline) |
ANTHROPIC_API_KEY set |
API (easiest) |
| Model downloaded | Local (best) |
| Both available | Local preferred |
5. Model Management (via Embedded Ollama)
Blue wraps Ollama's model commands:
blue_model_list # ollama list
blue_model_pull # ollama pull
blue_model_remove # ollama rm
blue_model_info # ollama show
Model storage: ~/.ollama/models/ (Ollama default, shared with external Ollama)
Recommended Models:
| Model | Size | Use Case |
|---|---|---|
qwen2.5:7b |
~4.4GB | Fast, good quality |
qwen2.5:32b |
~19GB | Best quality |
qwen2.5-coder:7b |
~4.4GB | Code-focused |
qwen2.5-coder:32b |
~19GB | Best for agentic coding |
Pull Example:
blue_model_pull name="qwen2.5:7b"
→ Pulling qwen2.5:7b...
→ [████████████████████] 100% (4.4 GB)
→ Model ready. Run: blue_model_info name="qwen2.5:7b"
Licensing: Qwen2.5 models are Apache 2.0 - commercial use permitted.
5.1 Goose Integration
Blue bundles Goose binary and auto-configures it for local Ollama:
┌─────────────────────────────────────────────────────────┐
│ User runs: blue agent │
│ ↓ │
│ Blue detects Ollama on localhost:11434 │
│ ↓ │
│ Picks largest available model (e.g., qwen2.5:72b) │
│ ↓ │
│ Launches bundled Goose with Blue MCP extension │
└─────────────────────────────────────────────────────────┘
Zero Setup:
# Just run it - Blue handles everything
blue agent
# What happens:
# 1. Uses bundled Goose binary (downloaded at build time)
# 2. Detects Ollama running on localhost:11434
# 3. Selects largest model (best for agentic work)
# 4. Sets GOOSE_PROVIDER=ollama, OLLAMA_HOST=...
# 5. Connects Blue MCP extension for workflow tools
Manual Model Override:
# Use a specific provider/model
blue agent --model ollama/qwen2.5:7b
blue agent --model anthropic/claude-sonnet-4-20250514
# Pass additional Goose arguments
blue agent -- --resume --name my-session
Goose Binary Bundling:
Blue's build.rs downloads the Goose binary for the target platform:
| Platform | Binary |
|---|---|
| macOS ARM64 | goose-aarch64-apple-darwin |
| macOS x86_64 | goose-x86_64-apple-darwin |
| Linux x86_64 | goose-x86_64-unknown-linux-gnu |
| Linux ARM64 | goose-aarch64-unknown-linux-gnu |
| Windows | goose-x86_64-pc-windows-gnu |
Build-time Download:
// apps/blue-cli/build.rs
const GOOSE_VERSION: &str = "1.21.1";
// Downloads goose-{arch}-{os}.tar.bz2 from GitHub releases
// Extracts to OUT_DIR, sets GOOSE_BINARY_PATH env var
Runtime Discovery:
- Check for bundled binary next to
blueexecutable - Check compile-time
GOOSE_BINARY_PATH - Fall back to system PATH (validates it's Block's Goose, not the DB migration tool)
Shared Model Benefits:
| Without Blue | With Blue |
|---|---|
| Install Goose separately | Blue bundles Goose |
| Install Ollama separately | Blue bundles Ollama |
| Configure Goose manually | blue agent auto-configures |
| Model loaded twice | One model instance |
| 40GB RAM for two 32B models | 20GB for shared model |
6. Graceful Degradation
impl BlueState {
pub async fn get_llm(&self) -> Option<&dyn LlmProvider> {
// Try local first
if let Some(local) = &self.local_llm {
if local.is_ready() {
return Some(local);
}
}
// Fall back to API
if let Some(api) = &self.api_llm {
return Some(api);
}
// No LLM available
None
}
}
| Condition | Behavior |
|---|---|
| Local model loaded | Use local (default) |
| Local unavailable, API configured | Fall back to API + warning |
| Neither available | Keyword matching only |
--no-ai flag |
Skip AI entirely |
7. Model Loading Strategy
Problem: Model load takes 5-10 seconds. Can't block MCP calls.
Solution: Daemon preloads model on startup.
impl EmbeddedOllama {
pub async fn warmup(&self, model: &str) -> Result<()> {
// Send a dummy request to load model into memory
let resp = self.client
.post("/api/generate")
.json(&json!({
"model": model,
"prompt": "Hi",
"options": { "num_predict": 1 }
}))
.send()
.await?;
// Model now loaded and warm
Ok(())
}
}
Daemon Startup:
blue daemon start
→ Starting embedded Ollama...
→ Ollama ready on localhost:11434
→ Warming up qwen2.5:7b... (5-10 seconds)
→ Model ready.
MCP Tool Response During Load:
{
"status": "loading",
"message": "Model loading... Try again in a few seconds.",
"retry_after_ms": 2000
}
Auto-Warmup: Daemon warms up configured model on start. First MCP request is fast.
Manual Warmup:
blue_model_warmup model="qwen2.5:32b" # Load specific model
8. Multi-Session Model Handling
Question: What if user has multiple Blue MCP sessions (multiple IDE windows)?
Answer: All sessions share one Ollama instance via blue daemon.
┌─────────────────────────────────────────────────────────┐
│ blue daemon (singleton) │
│ └── Embedded Ollama (localhost:11434) │
│ └── Model loaded once (~20GB for 32B) │
├─────────────────────────────────────────────────────────┤
│ Blue MCP Session 1 ──┐ │
│ Blue MCP Session 2 ──┼──→ HTTP to localhost:11434 │
│ Goose ──┘ │
└─────────────────────────────────────────────────────────┘
Benefits:
- One model in memory, not per-session
- Goose shares same model instance
- Daemon manages Ollama lifecycle
- Sessions can come and go
Daemon Lifecycle:
blue daemon start # Start Ollama, keep running
blue daemon stop # Stop Ollama
blue daemon status # Check health and GPU info
# Auto-start: first MCP connection starts daemon if not running
Status Output:
$ blue daemon status
Blue Daemon: running
├── Ollama: healthy (v0.5.4)
├── Backend: Metal (MPS) - Apple M4 Max
├── Port: 11434
├── Models loaded: qwen2.5:32b (19GB)
├── Uptime: 2h 34m
└── Requests served: 1,247
Daemon Health & Recovery
Health checks:
impl EmbeddedOllama {
pub async fn health_check(&self) -> Result<HealthStatus> {
match self.client.get("/api/version").await {
Ok(resp) => Ok(HealthStatus::Healthy {
version: resp.version,
gpu: resp.gpu,
}),
Err(e) => Ok(HealthStatus::Unhealthy { error: e.to_string() }),
}
}
pub fn start_health_monitor(&self) {
tokio::spawn(async move {
loop {
tokio::time::sleep(Duration::from_secs(30)).await;
if let Ok(HealthStatus::Unhealthy { .. }) = self.health_check().await {
log::warn!("Ollama unhealthy, attempting restart...");
self.restart().await;
}
}
});
}
}
Crash recovery:
| Scenario | Behavior |
|---|---|
| Ollama crashes | Auto-restart within 5 seconds |
| Restart fails 3x | Mark as failed, fall back to API |
User calls daemon restart |
Force restart, reset failure count |
Graceful shutdown:
impl EmbeddedOllama {
pub async fn stop(&mut self) -> Result<()> {
// Signal Ollama to finish current requests
self.client.post("/api/shutdown").await.ok();
// Wait up to 10 seconds for graceful shutdown
tokio::time::timeout(
Duration::from_secs(10),
self.wait_for_exit()
).await.ok();
// Force kill if still running
if let Some(proc) = self.process.take() {
proc.kill().ok();
}
Ok(())
}
}
8. Integration Points
ADR Relevance (RFC 0004):
pub async fn find_relevant_adrs(
llm: &dyn LlmProvider,
context: &str,
adrs: &[AdrSummary],
) -> Result<Vec<RelevanceResult>> {
let prompt = format_relevance_prompt(context, adrs);
let response = llm.complete(&prompt, &RELEVANCE_OPTIONS).await?;
parse_relevance_response(&response)
}
Runbook Matching (RFC 0002):
pub async fn match_action_semantic(
llm: &dyn LlmProvider,
query: &str,
actions: &[String],
) -> Result<Option<String>> {
// Use LLM to find best semantic match
}
9. Cargo Features & Build
[features]
default = ["ollama"]
ollama = [] # Embeds Ollama binary
[dependencies]
reqwest = { version = "0.12", features = ["json"] } # Ollama HTTP client
tokio = { version = "1", features = ["process"] } # Process management
[build-dependencies]
# Download Ollama binary at build time
Build Process:
// build.rs
const OLLAMA_VERSION: &str = "0.5.4";
fn main() {
let target = std::env::var("TARGET").unwrap();
let (ollama_url, sha256) = match target.as_str() {
// macOS (Universal - works on Intel and Apple Silicon)
t if t.contains("darwin") =>
(format!("https://github.com/ollama/ollama/releases/download/v{}/ollama-darwin", OLLAMA_VERSION),
"abc123..."),
// Linux x86_64
t if t.contains("x86_64") && t.contains("linux") =>
(format!("https://github.com/ollama/ollama/releases/download/v{}/ollama-linux-amd64", OLLAMA_VERSION),
"def456..."),
// Linux ARM64 (Raspberry Pi 4/5, AWS Graviton, etc.)
t if t.contains("aarch64") && t.contains("linux") =>
(format!("https://github.com/ollama/ollama/releases/download/v{}/ollama-linux-arm64", OLLAMA_VERSION),
"ghi789..."),
// Windows x86_64
t if t.contains("windows") =>
(format!("https://github.com/ollama/ollama/releases/download/v{}/ollama-windows-amd64.exe", OLLAMA_VERSION),
"jkl012..."),
_ => panic!("Unsupported target: {}", target),
};
download_and_verify(&ollama_url, sha256);
println!("cargo:rerun-if-changed=build.rs");
}
Supported Platforms:
| Platform | Architecture | Ollama Binary |
|---|---|---|
| macOS | x86_64 + ARM64 | ollama-darwin (universal) |
| Linux | x86_64 | ollama-linux-amd64 |
| Linux | ARM64 | ollama-linux-arm64 |
| Windows | x86_64 | ollama-windows-amd64.exe |
ARM64 Linux Use Cases:
- Raspberry Pi 4/5 (8GB+ recommended)
- AWS Graviton instances
- NVIDIA Jetson
- Apple Silicon Linux VMs
Binary Size:
| Component | Size |
|---|---|
| Blue CLI | ~5 MB |
| Ollama binary | ~50 MB |
| Total | ~55 MB |
Models downloaded separately on first use.
10. Performance Expectations
Apple Silicon (M4 Max, 128GB, Metal/MPS):
| Metric | Qwen2.5-7B | Qwen2.5-32B |
|---|---|---|
| Model load | 2-3 sec | 5-10 sec |
| Prompt processing | ~150 tok/s | ~100 tok/s |
| Generation | ~80 tok/s | ~50 tok/s |
| ADR relevance | 100-200ms | 200-400ms |
NVIDIA GPU (RTX 4090, CUDA):
| Metric | Qwen2.5-7B | Qwen2.5-32B |
|---|---|---|
| Model load | 1-2 sec | 3-5 sec |
| Prompt processing | ~200 tok/s | ~120 tok/s |
| Generation | ~100 tok/s | ~60 tok/s |
| ADR relevance | 80-150ms | 150-300ms |
CPU Only (fallback):
| Metric | Qwen2.5-7B | Qwen2.5-32B |
|---|---|---|
| Generation | ~10 tok/s | ~3 tok/s |
| ADR relevance | 1-2 sec | 5-10 sec |
Metal/MPS on Apple Silicon is first-class - not a fallback.
11. Memory Validation
Ollama handles memory management, but Blue validates before pull:
impl EmbeddedOllama {
pub async fn validate_can_pull(&self, model: &str) -> Result<()> {
let model_size = self.get_model_size(model).await?;
let available = sys_info::mem_info()?.avail * 1024;
let buffer = model_size / 5; // 20% buffer
if available < model_size + buffer {
return Err(LlmError::InsufficientMemory {
model: model.to_string(),
required: model_size + buffer,
available,
suggestion: format!(
"Close some applications or use a smaller model. \
Try: blue_model_pull name=\"qwen2.5:7b\""
),
});
}
Ok(())
}
}
Ollama's Own Handling:
Ollama gracefully handles memory pressure by unloading models. Blue's validation is advisory.
12. Build Requirements
Blue Build (all platforms):
# Just Rust toolchain
cargo build --release
Blue's build.rs downloads the pre-built Ollama binary for the target platform. No C++ compiler needed.
Runtime GPU Support:
Ollama bundles GPU support. User just needs drivers:
macOS (Metal):
- Works out of box on Apple Silicon (M1/M2/M3/M4)
- No additional setup needed
Linux (CUDA):
# NVIDIA drivers (CUDA Toolkit not needed for inference)
nvidia-smi # Verify driver installed
Linux (ROCm):
# AMD GPU support
rocminfo # Verify ROCm installed
Windows:
- NVIDIA: Just need GPU drivers
- Works on CPU if no GPU
Ollama handles everything else - users don't need to install CUDA Toolkit, cuDNN, etc.
Security Considerations
- Ollama binary integrity: Verify SHA256 of bundled Ollama binary at build time
- Model provenance: Ollama registry handles model verification
- Local only by default: Ollama binds to localhost:11434, not exposed
- Prompt injection: Sanitize user input before prompts
- Memory: Ollama handles memory management
- No secrets in prompts: ADR relevance only sends context strings
- Process isolation: Ollama runs as subprocess, not linked
Network Binding:
impl EmbeddedOllama {
pub async fn start(&mut self) -> Result<()> {
let mut cmd = Command::new(Self::bundled_binary_path());
// Bind to localhost only - not accessible from network
cmd.env("OLLAMA_HOST", "127.0.0.1:11434");
// ...
}
}
Goose Access:
Goose connects to localhost:11434 - works because it's on the same machine. Remote access requires explicit OLLAMA_HOST=0.0.0.0:11434 override.
Port Conflict Handling
Scenario: User already has Ollama running on port 11434.
impl EmbeddedOllama {
pub async fn start(&mut self) -> Result<()> {
// Check if port 11434 is in use
if Self::port_in_use(11434) {
// Check if it's Ollama
if Self::is_ollama_running().await? {
// Use existing Ollama instance
self.mode = OllamaMode::External;
return Ok(());
} else {
// Something else on port - use alternate
self.port = Self::find_free_port(11435..11500)?;
}
}
// Start embedded Ollama on chosen port
self.start_embedded().await
}
}
| Situation | Behavior |
|---|---|
| Port 11434 free | Start embedded Ollama |
| Ollama already running | Use existing (no duplicate) |
| Other service on port | Use alternate port (11435+) |
Config override:
# .blue/config.yaml
llm:
local:
ollama_port: 11500 # Force specific port
use_external: true # Never start embedded, use existing
Binary Verification
Build-time verification:
// build.rs
const OLLAMA_SHA256: &str = "abc123..."; // Per-platform hashes
fn download_ollama() {
let bytes = download(OLLAMA_URL)?;
let hash = sha256(&bytes);
if hash != OLLAMA_SHA256 {
panic!("Ollama binary hash mismatch! Expected {}, got {}", OLLAMA_SHA256, hash);
}
write_binary(bytes)?;
}
Runtime verification:
impl EmbeddedOllama {
fn verify_binary(&self) -> Result<()> {
let expected = include_str!("ollama.sha256");
let actual = sha256_file(Self::bundled_binary_path())?;
if actual != expected {
return Err(LlmError::BinaryTampered {
expected: expected.to_string(),
actual,
});
}
Ok(())
}
pub async fn start(&mut self) -> Result<()> {
self.verify_binary()?; // Check before every start
// ...
}
}
Air-Gapped Builds
For environments without internet during build:
# 1. Download Ollama binary manually
curl -L https://github.com/ollama/ollama/releases/download/v0.5.4/ollama-darwin \
-o vendor/ollama-darwin
# 2. Build with BLUE_OLLAMA_PATH
BLUE_OLLAMA_PATH=vendor/ollama-darwin cargo build --release
// build.rs
fn get_ollama_binary() -> Vec<u8> {
if let Ok(path) = std::env::var("BLUE_OLLAMA_PATH") {
// Use pre-downloaded binary
std::fs::read(path).expect("Failed to read BLUE_OLLAMA_PATH")
} else {
// Download from GitHub
download_ollama()
}
}
Implementation Phases
Phase 1: Embedded Ollama
- Add build.rs to download Ollama binary per platform
- Create
blue-ollamacrate for embedded server management - Implement
EmbeddedOllama::start()andstop() - Add
blue daemon start/stopcommands
Phase 2: LLM Provider
5. Add LlmProvider trait to blue-core
6. Implement OllamaLlm using HTTP client
7. Add blue_model_pull, blue_model_list tools
8. Implement auto-pull on first use
Phase 3: Semantic Integration 9. Integrate with ADR relevance (RFC 0004) 10. Add semantic runbook matching (RFC 0002) 11. Add fallback chain: Ollama → API → keywords
Phase 4: Goose Integration
12. Add blue agent command to launch Goose
13. Document Goose + Blue setup
14. Ship example configs
CI/CD Matrix
Test embedded Ollama on all platforms:
# .github/workflows/ci.yml
jobs:
test-ollama:
strategy:
matrix:
include:
- os: macos-latest
ollama_binary: ollama-darwin
- os: ubuntu-latest
ollama_binary: ollama-linux-amd64
- os: windows-latest
ollama_binary: ollama-windows-amd64.exe
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
- name: Build Blue (downloads Ollama binary)
run: cargo build --release
- name: Verify Ollama binary embedded
run: |
# Check binary exists in expected location
ls -la target/release/ollama*
- name: Test daemon start/stop
run: |
cargo run -- daemon start
sleep 5
curl -s http://localhost:11434/api/version
cargo run -- daemon stop
- name: Test with mock model (no download)
run: cargo test ollama::mock
# GPU tests run on self-hosted runners
test-gpu:
runs-on: [self-hosted, gpu]
steps:
- uses: actions/checkout@v4
- name: Test CUDA detection
run: |
cargo build --release
cargo run -- daemon start
# Verify GPU detected
curl -s http://localhost:11434/api/version | jq .gpu
cargo run -- daemon stop
Note: Full model integration tests run nightly (large downloads).
Test Plan
Embedded Ollama:
blue daemon startlaunches embedded Ollamablue daemon stopcleanly shuts down- Ollama detects CUDA when available
- Ollama detects Metal on macOS
- Falls back to CPU when no GPU
- Health check returns backend type
Model Management:
blue_model_pulldownloads from Ollama registryblue_model_listshows pulled modelsblue_model_removedeletes model- Auto-pull on first completion if model missing
- Progress indicator during pull
LLM Provider:
OllamaLlm::complete()returns valid response- Fallback chain: Ollama → API → keywords
--no-aiflag skips LLM entirely- Configuration parsing from .blue/config.yaml
Semantic Integration:
- ADR relevance uses embedded Ollama
- Runbook matching uses semantic search
- Response includes method used (ollama/api/keywords)
Goose Integration:
blue agentstarts Goose with Blue extension- Goose connects to Blue's embedded Ollama
- Goose can use Blue MCP tools
- Model shared between Blue tasks and Goose
Multi-Session:
- Multiple Blue MCP sessions share one Ollama
- Concurrent completions handled correctly
- Daemon persists across shell sessions
Port Conflict:
- Detects existing Ollama on port 11434
- Uses existing Ollama instead of starting new
- Uses alternate port if non-Ollama on 11434
use_external: trueconfig works
Health & Recovery:
- Health check detects unhealthy Ollama
- Auto-restart on crash
- Falls back to API after 3 restart failures
- Graceful shutdown waits for requests
Binary Verification:
- Build fails if Ollama hash mismatch
- Runtime verification before start
- Tampered binary: clear error message
- Air-gapped build with BLUE_OLLAMA_PATH works
CI Matrix:
- macOS build includes darwin Ollama binary
- Linux x86_64 build includes amd64 binary
- Linux ARM64 build includes arm64 binary
- Windows build includes windows binary
- Integration tests with mock Ollama server
"Right then. Let's get to it."
— Blue