hearth/docs/architecture.md
Eric Garcia e78000831e Initial commit: Port infrastructure from coherence-mcp
Hearth is the infrastructure home for the letemcook ecosystem.

Ported from coherence-mcp/infra:
- Terraform modules (VPC, EKS, IAM, NLB, S3, storage)
- Kubernetes manifests (Forgejo, ingress, cert-manager, karpenter)
- Deployment scripts (phased rollout)

Status: Not deployed. EKS cluster needs to be provisioned.

Next steps:
1. Bootstrap terraform backend
2. Deploy phase 1 (foundation)
3. Deploy phase 2 (core services including Forgejo)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 06:06:13 -05:00

269 lines
7.7 KiB
Markdown

# Foundation Infrastructure
RFC 0039: ADR-Compliant Foundation Infrastructure
## Overview
This directory contains Terraform modules and Kubernetes manifests for deploying
the Alignment foundation infrastructure on AWS EKS.
## Architecture
```
Internet
|
+---------+----------+
| Shared NLB |
| (~$16/mo) |
+--------------------+
| :53 DNS (PowerDNS)|
| :25 SMTP |
| :587 Submission |
| :993 IMAPS |
| :443 HTTPS |
+--------+-----------+
|
+--------------------+--------------------+
| | |
+-----+------+ +-----+------+ +------+-----+
| AZ-a | | AZ-b | | AZ-c |
+------------+ +------------+ +------------+
| | | | | |
| Karpenter | | Karpenter | | Karpenter |
| Spot Nodes | | Spot Nodes | | Spot Nodes |
| | | | | |
+------------+ +------------+ +------------+
| | | | | |
| CockroachDB| | CockroachDB| | CockroachDB|
| (m6i.large)| | (m6i.large)| | (m6i.large)|
| | | | | |
+------------+ +------------+ +------------+
```
## Cost Breakdown
| Component | Monthly Cost |
|-----------|--------------|
| EKS Control Plane | $73 |
| CockroachDB (3x m6i.large, 3yr) | $105 |
| NLB | $16 |
| EFS | $5 |
| S3 | $5 |
| Spot nodes (variable) | $0-50 |
| **Total** | **$204-254** |
## ADR Compliance
- **ADR 0003**: Self-hosted CockroachDB with FIPS 140-2
- **ADR 0004**: "Set It and Forget It" auto-scaling with Karpenter
- **ADR 0005**: Full-stack self-hosting (no SaaS dependencies)
## Prerequisites
1. AWS CLI configured with appropriate credentials
2. Terraform >= 1.6.0
3. kubectl
4. Helm 3.x
## Quick Start
### 1. Bootstrap Terraform Backend
First, create the S3 bucket and DynamoDB table for Terraform state:
```bash
cd terraform/environments/production
# Uncomment the backend.tf bootstrap code and run:
# terraform init && terraform apply
```
### 2. Deploy Foundation Infrastructure
```bash
cd terraform/environments/production
terraform init
terraform plan
terraform apply
```
### 3. Configure kubectl
```bash
aws eks update-kubeconfig --region us-east-1 --name alignment-production
```
### 4. Deploy Karpenter
```bash
# Set environment variables
export CLUSTER_NAME=$(terraform output -raw cluster_name)
export CLUSTER_ENDPOINT=$(terraform output -raw cluster_endpoint)
export KARPENTER_ROLE_ARN=$(terraform output -raw karpenter_role_arn)
export INTERRUPTION_QUEUE_NAME=$(terraform output -raw karpenter_interruption_queue_name)
# Install Karpenter
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
--namespace karpenter --create-namespace \
-f kubernetes/karpenter/helm-values.yaml \
--set settings.clusterName=$CLUSTER_NAME \
--set settings.clusterEndpoint=$CLUSTER_ENDPOINT \
--set settings.interruptionQueue=$INTERRUPTION_QUEUE_NAME \
--set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=$KARPENTER_ROLE_ARN
# Apply NodePool and EC2NodeClass
kubectl apply -f kubernetes/karpenter/nodepool.yaml
kubectl apply -f kubernetes/karpenter/ec2nodeclass.yaml
```
### 5. Deploy Storage Classes
```bash
export EFS_ID=$(terraform output -raw efs_id)
envsubst < kubernetes/storage/classes.yaml | kubectl apply -f -
```
## Directory Structure
```
infra/
├── terraform/
│ ├── main.tf # Root module
│ ├── variables.tf # Input variables
│ ├── outputs.tf # Output values
│ ├── versions.tf # Provider versions
│ ├── modules/
│ │ ├── vpc/ # VPC with multi-AZ subnets
│ │ ├── eks/ # EKS cluster with Fargate
│ │ ├── iam/ # IAM roles and IRSA
│ │ ├── storage/ # EFS and S3
│ │ ├── nlb/ # Shared NLB
│ │ └── cockroachdb/ # CockroachDB (future)
│ └── environments/
│ └── production/ # Production config
├── kubernetes/
│ ├── karpenter/ # Karpenter manifests
│ ├── cockroachdb/ # CockroachDB StatefulSet
│ ├── storage/ # Storage classes
│ ├── ingress/ # Ingress configuration
│ └── cert-manager/ # TLS certificates
└── README.md
```
## Modules
### VPC Module
Creates a VPC with:
- 3 availability zones
- Public subnets (for NLB, NAT Gateways)
- Private subnets (for EKS nodes, workloads)
- Database subnets (isolated, for CockroachDB)
- NAT Gateway per AZ for HA
- VPC endpoints for S3, ECR, STS, EC2
### EKS Module
Creates an EKS cluster with:
- Kubernetes 1.29
- Fargate profiles for Karpenter and kube-system
- OIDC provider for IRSA
- KMS encryption for secrets
- Cluster logging enabled
### IAM Module
Creates IAM roles for:
- Karpenter controller
- EBS CSI driver
- EFS CSI driver
- AWS Load Balancer Controller
- cert-manager
- External DNS
### Storage Module
Creates storage resources:
- EFS filesystem with encryption
- S3 bucket for backups (versioned, encrypted)
- S3 bucket for blob storage
- KMS key for encryption
### NLB Module
Creates a shared NLB with:
- HTTPS (443) for web traffic
- DNS (53 UDP/TCP) for PowerDNS
- SMTP (25), Submission (587), IMAPS (993) for email
- Cross-zone load balancing
- Target groups for each service
## Operations
### Scaling
Karpenter automatically scales nodes based on pending pods. No manual intervention required.
To adjust limits:
```bash
kubectl edit nodepool default
```
### Monitoring
Check Karpenter status:
```bash
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter -f
```
Check node status:
```bash
kubectl get nodes -L karpenter.sh/capacity-type,node.kubernetes.io/instance-type
```
### Troubleshooting
View Karpenter events:
```bash
kubectl get events -n karpenter --sort-by=.lastTimestamp
```
Check pending pods:
```bash
kubectl get pods --all-namespaces --field-selector=status.phase=Pending
```
## Security
- All storage encrypted at rest (KMS)
- TLS required for all connections
- IMDSv2 required for all nodes
- VPC Flow Logs enabled
- Cluster audit logging enabled
- FIPS 140-2 mode for CockroachDB
## Disaster Recovery
### Backups
CockroachDB backups are stored in S3 with:
- Daily full backups
- 30-day retention in Standard
- 90-day transition to Glacier
- 365-day noncurrent version retention
### Recovery
To restore from backup:
```bash
# Restore CockroachDB from S3 backup
cockroach restore ... FROM 's3://alignment-production-backups/...'
```
## References
- [RFC 0039: Foundation Infrastructure](../../../.repos/alignment-mcp/docs/rfcs/0039-foundation-infrastructure.md)
- [ADR 0003: CockroachDB Self-Hosted FIPS](../../../.repos/alignment-mcp/docs/adrs/0003-cockroachdb-self-hosted-fips.md)
- [ADR 0004: Set It and Forget It](../../../.repos/alignment-mcp/docs/adrs/0004-set-it-and-forget-it-architecture.md)
- [ADR 0005: Full-Stack Self-Hosting](../../../.repos/alignment-mcp/docs/adrs/0005-full-stack-self-hosting.md)
- [Karpenter Documentation](https://karpenter.sh/)
- [EKS Best Practices](https://aws.github.io/aws-eks-best-practices/)