nanochat/docs/chaos-runbook.md
Manmohan Sharma 0b8f9f0a5f
feat(ops): Day 2 operations automation and chaos runbook (#10)
Adds tooling and documentation for Day 2 cluster operations:

- scripts/rotate-nodes.sh: interactive node-rotation driver that applies
  terraform to pick up the latest SSM-resolved EKS AMI and watches the
  rolling replacement.
- scripts/demo-schema-change.sh: end-to-end demo of the zero-downtime
  is_favorited column migration via helm upgrade + migration hook.
- scripts/verify-deployment.sh: post-deploy health check across pods,
  per-service HTTP health endpoints, rollout status, and PDBs.
- docs/chaos-runbook.md: failure-mode playbook with simulate / Grafana /
  Loki / recovery steps for six scenarios (pod kill, node failure, DB
  pool exhaustion, inference OOM, high latency, SSL issues) plus a
  Loki quick-reference.
- terraform/modules/eks: expose current_node_ami_id output, add
  update_config.max_unavailable_percentage (configurable, default 33)
  so node-group rolls are controlled.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 12:25:47 -07:00

6.5 KiB

samosaChaat — Chaos Testing Runbook

This runbook covers failure scenarios for live defense. Each scenario includes how to simulate it, how to detect it via Grafana/Loki, and recovery steps.

Prerequisites


Scenario 1: Pod Crash / Kill

Simulate:

kubectl delete pod -l app.kubernetes.io/name=chat-api -n samosachaat-prod

Detect (Grafana):

  • Dashboard: Application Performance → look for gap in request rate
  • Alert: container restart count spike
  • Panel query: kube_pod_container_status_restarts_total{namespace="samosachaat-prod"}

Detect (Loki):

{namespace="samosachaat-prod"} | json | level="error"
{namespace="samosachaat-prod",app="chat-api"} | json | message=~".*startup.*|.*shutdown.*"

Recovery:

  • Kubernetes auto-restarts the pod (restartPolicy: Always)
  • HPA scales up if CPU threshold exceeded during recovery
  • PDB ensures other pods kept running during the kill
  • No manual action needed unless crash-looping (check logs for root cause)

Verify recovered:

kubectl get pods -n samosachaat-prod -l app.kubernetes.io/name=chat-api
curl -s https://samosachaat.art/api/health | jq .

Scenario 2: Node Failure

Simulate:

# Get a node instance ID
INSTANCE_ID=$(kubectl get nodes -o jsonpath='{.items[0].spec.providerID}' | cut -d/ -f5)
aws ec2 terminate-instances --instance-ids $INSTANCE_ID

Detect (Grafana):

  • Dashboard: Node Health → node disappears from CPU/Memory panels
  • Alert: HighCPU or HighMemory may fire on remaining nodes as pods redistribute
  • Panel query: kube_node_status_condition{condition="Ready",status="true"}

Detect (Loki):

{namespace="kube-system"} | json | message=~".*NotReady.*|.*node.*removed.*"

Recovery:

  • EKS auto-scaling group launches a replacement node (2-5 minutes)
  • Pods on the failed node are rescheduled to healthy nodes
  • PDBs prevent more than 1 pod per service from being unavailable
  • No manual action needed

Verify recovered:

kubectl get nodes    # New node should appear with STATUS Ready
kubectl get pods -n samosachaat-prod -o wide   # All pods Running

Scenario 3: Database Connection Pool Exhaustion

Simulate:

# Run a load test that exceeds the connection pool limit
kubectl run loadtest --image=busybox --restart=Never -n samosachaat-prod -- \
  sh -c 'for i in $(seq 1 200); do wget -q -O- http://chat-api:8002/api/health & done; wait'

Detect (Grafana):

  • Dashboard: Application Performance → spike in p99 latency, increase in 5xx errors
  • Alert: High5xxRate fires

Detect (Loki):

{app=~"auth|chat-api"} | json | message=~".*connection.*pool.*|.*timeout.*|.*asyncpg.*|.*QueuePool.*overflow.*"

Recovery:

  1. Identify which service is affected from Loki logs
  2. Check current connection count: kubectl exec deploy/chat-api -n samosachaat-prod -- python -c "..."
  3. Restart affected pods: kubectl rollout restart deploy/chat-api -n samosachaat-prod
  4. If persistent: increase pool size in service config (SQLALCHEMY_POOL_SIZE env var) and redeploy
  5. Check RDS max_connections: aws rds describe-db-parameters --db-parameter-group-name default.postgres15

Scenario 4: Inference Service OOM

Simulate:

# Set a low memory limit and load a large model
kubectl set resources deploy/inference -n samosachaat-prod --limits=memory=512Mi
# Or trigger by sending many concurrent requests

Detect (Grafana):

  • Dashboard: Inference Service → memory spike, then sudden drop (OOM kill)
  • Dashboard: Node Health → memory spike on the node hosting inference
  • Alert: HighMemory fires

Detect (Loki):

{app="inference"} | json | message=~".*OOMKilled.*|.*memory.*|.*killed.*"
# Also check events:
# kubectl get events -n samosachaat-prod --sort-by='.lastTimestamp' | grep -i oom

Recovery:

  1. Pod auto-restarts (but may crash-loop if model is too large for limit)
  2. Check what model is loaded: curl http://inference:8003/stats
  3. If model too large: swap to smaller model via POST /models/swap
  4. If limit too low: increase memory limit in values-prod.yaml and helm upgrade
  5. Restore original limits: kubectl set resources deploy/inference -n samosachaat-prod --limits=memory=8Gi

Scenario 5: High Latency / Degraded Performance

Simulate:

# Flood inference with concurrent requests
kubectl run loadtest --image=curlimages/curl --restart=Never -n samosachaat-prod -- \
  sh -c 'for i in $(seq 1 50); do curl -s -X POST http://chat-api:8002/api/conversations/test/messages -H "Content-Type: application/json" -d "{\"content\":\"tell me a story\"}" & done; wait'

Detect (Grafana):

  • Dashboard: Application Performance → p99 latency > 5s
  • Dashboard: Inference Service → worker pool utilization at 100%, queue depth growing
  • Alert: HighP99Latency fires

Detect (Loki):

{app="chat-api"} | json | inference_time_ms > 5000
{app="inference"} | json | message=~".*queue.*full.*|.*timeout.*|.*worker.*busy.*"

Recovery:

  1. Check inference worker pool: curl http://inference:8003/stats
  2. If all workers busy: HPA should scale inference pods (check HPA status)
  3. Manual scale: kubectl scale deploy/inference -n samosachaat-prod --replicas=5
  4. If single-pod bottleneck: check if model is too large for CPU inference, consider GPU nodes
  5. Verify recovery: watch latency dashboard return to normal

Scenario 6: SSL Certificate Issues

Detect:

  • Users report "connection not secure" errors
  • curl -vI https://samosachaat.art 2>&1 | grep -i "expire\|ssl\|certificate"

Recovery:

  • ACM certificates auto-renew — check ACM console for renewal status
  • If DNS validation failed: check Route53 CNAME records match ACM requirements
  • terraform apply to reconcile if records drifted

Quick Reference: Diagnostic Loki Queries

# All errors across all services
{namespace="samosachaat-prod"} | json | level="error" | line_format "{{.service}}: {{.message}}"

# Trace a request across services
{namespace="samosachaat-prod"} | json | trace_id="<TRACE_ID>"

# Auth failures
{app="auth"} | json | level="error" | message=~".*oauth.*|.*jwt.*|.*unauthorized.*"

# Inference issues
{app="inference"} | json | message=~".*error.*|.*timeout.*|.*OOM.*|.*worker.*"

# Slow database queries
{app=~"auth|chat-api"} | json | message=~".*slow.*query.*|.*timeout.*"

# Recent pod restarts
{namespace="samosachaat-prod"} | json | message=~".*started.*|.*shutdown.*|.*ready.*"