diff --git a/docs/chaos-runbook.md b/docs/chaos-runbook.md new file mode 100644 index 00000000..6f065192 --- /dev/null +++ b/docs/chaos-runbook.md @@ -0,0 +1,198 @@ +# samosaChaat — Chaos Testing Runbook + +This runbook covers failure scenarios for live defense. Each scenario includes +how to simulate it, how to detect it via Grafana/Loki, and recovery steps. + +## Prerequisites + +- kubectl configured for the target cluster +- Grafana accessible at https://grafana.samosachaat.art +- Loki datasource configured in Grafana + +--- + +## Scenario 1: Pod Crash / Kill + +**Simulate:** +```bash +kubectl delete pod -l app.kubernetes.io/name=chat-api -n samosachaat-prod +``` + +**Detect (Grafana):** +- Dashboard: Application Performance → look for gap in request rate +- Alert: container restart count spike +- Panel query: `kube_pod_container_status_restarts_total{namespace="samosachaat-prod"}` + +**Detect (Loki):** +```logql +{namespace="samosachaat-prod"} | json | level="error" +{namespace="samosachaat-prod",app="chat-api"} | json | message=~".*startup.*|.*shutdown.*" +``` + +**Recovery:** +- Kubernetes auto-restarts the pod (restartPolicy: Always) +- HPA scales up if CPU threshold exceeded during recovery +- PDB ensures other pods kept running during the kill +- No manual action needed unless crash-looping (check logs for root cause) + +**Verify recovered:** +```bash +kubectl get pods -n samosachaat-prod -l app.kubernetes.io/name=chat-api +curl -s https://samosachaat.art/api/health | jq . +``` + +--- + +## Scenario 2: Node Failure + +**Simulate:** +```bash +# Get a node instance ID +INSTANCE_ID=$(kubectl get nodes -o jsonpath='{.items[0].spec.providerID}' | cut -d/ -f5) +aws ec2 terminate-instances --instance-ids $INSTANCE_ID +``` + +**Detect (Grafana):** +- Dashboard: Node Health → node disappears from CPU/Memory panels +- Alert: HighCPU or HighMemory may fire on remaining nodes as pods redistribute +- Panel query: `kube_node_status_condition{condition="Ready",status="true"}` + +**Detect (Loki):** +```logql +{namespace="kube-system"} | json | message=~".*NotReady.*|.*node.*removed.*" +``` + +**Recovery:** +- EKS auto-scaling group launches a replacement node (2-5 minutes) +- Pods on the failed node are rescheduled to healthy nodes +- PDBs prevent more than 1 pod per service from being unavailable +- No manual action needed + +**Verify recovered:** +```bash +kubectl get nodes # New node should appear with STATUS Ready +kubectl get pods -n samosachaat-prod -o wide # All pods Running +``` + +--- + +## Scenario 3: Database Connection Pool Exhaustion + +**Simulate:** +```bash +# Run a load test that exceeds the connection pool limit +kubectl run loadtest --image=busybox --restart=Never -n samosachaat-prod -- \ + sh -c 'for i in $(seq 1 200); do wget -q -O- http://chat-api:8002/api/health & done; wait' +``` + +**Detect (Grafana):** +- Dashboard: Application Performance → spike in p99 latency, increase in 5xx errors +- Alert: High5xxRate fires + +**Detect (Loki):** +```logql +{app=~"auth|chat-api"} | json | message=~".*connection.*pool.*|.*timeout.*|.*asyncpg.*|.*QueuePool.*overflow.*" +``` + +**Recovery:** +1. Identify which service is affected from Loki logs +2. Check current connection count: `kubectl exec deploy/chat-api -n samosachaat-prod -- python -c "..."` +3. Restart affected pods: `kubectl rollout restart deploy/chat-api -n samosachaat-prod` +4. If persistent: increase pool size in service config (`SQLALCHEMY_POOL_SIZE` env var) and redeploy +5. Check RDS max_connections: `aws rds describe-db-parameters --db-parameter-group-name default.postgres15` + +--- + +## Scenario 4: Inference Service OOM + +**Simulate:** +```bash +# Set a low memory limit and load a large model +kubectl set resources deploy/inference -n samosachaat-prod --limits=memory=512Mi +# Or trigger by sending many concurrent requests +``` + +**Detect (Grafana):** +- Dashboard: Inference Service → memory spike, then sudden drop (OOM kill) +- Dashboard: Node Health → memory spike on the node hosting inference +- Alert: HighMemory fires + +**Detect (Loki):** +```logql +{app="inference"} | json | message=~".*OOMKilled.*|.*memory.*|.*killed.*" +# Also check events: +# kubectl get events -n samosachaat-prod --sort-by='.lastTimestamp' | grep -i oom +``` + +**Recovery:** +1. Pod auto-restarts (but may crash-loop if model is too large for limit) +2. Check what model is loaded: `curl http://inference:8003/stats` +3. If model too large: swap to smaller model via `POST /models/swap` +4. If limit too low: increase memory limit in values-prod.yaml and `helm upgrade` +5. Restore original limits: `kubectl set resources deploy/inference -n samosachaat-prod --limits=memory=8Gi` + +--- + +## Scenario 5: High Latency / Degraded Performance + +**Simulate:** +```bash +# Flood inference with concurrent requests +kubectl run loadtest --image=curlimages/curl --restart=Never -n samosachaat-prod -- \ + sh -c 'for i in $(seq 1 50); do curl -s -X POST http://chat-api:8002/api/conversations/test/messages -H "Content-Type: application/json" -d "{\"content\":\"tell me a story\"}" & done; wait' +``` + +**Detect (Grafana):** +- Dashboard: Application Performance → p99 latency > 5s +- Dashboard: Inference Service → worker pool utilization at 100%, queue depth growing +- Alert: HighP99Latency fires + +**Detect (Loki):** +```logql +{app="chat-api"} | json | inference_time_ms > 5000 +{app="inference"} | json | message=~".*queue.*full.*|.*timeout.*|.*worker.*busy.*" +``` + +**Recovery:** +1. Check inference worker pool: `curl http://inference:8003/stats` +2. If all workers busy: HPA should scale inference pods (check HPA status) +3. Manual scale: `kubectl scale deploy/inference -n samosachaat-prod --replicas=5` +4. If single-pod bottleneck: check if model is too large for CPU inference, consider GPU nodes +5. Verify recovery: watch latency dashboard return to normal + +--- + +## Scenario 6: SSL Certificate Issues + +**Detect:** +- Users report "connection not secure" errors +- `curl -vI https://samosachaat.art 2>&1 | grep -i "expire\|ssl\|certificate"` + +**Recovery:** +- ACM certificates auto-renew — check ACM console for renewal status +- If DNS validation failed: check Route53 CNAME records match ACM requirements +- `terraform apply` to reconcile if records drifted + +--- + +## Quick Reference: Diagnostic Loki Queries + +```logql +# All errors across all services +{namespace="samosachaat-prod"} | json | level="error" | line_format "{{.service}}: {{.message}}" + +# Trace a request across services +{namespace="samosachaat-prod"} | json | trace_id="" + +# Auth failures +{app="auth"} | json | level="error" | message=~".*oauth.*|.*jwt.*|.*unauthorized.*" + +# Inference issues +{app="inference"} | json | message=~".*error.*|.*timeout.*|.*OOM.*|.*worker.*" + +# Slow database queries +{app=~"auth|chat-api"} | json | message=~".*slow.*query.*|.*timeout.*" + +# Recent pod restarts +{namespace="samosachaat-prod"} | json | message=~".*started.*|.*shutdown.*|.*ready.*" +``` diff --git a/scripts/demo-schema-change.sh b/scripts/demo-schema-change.sh new file mode 100755 index 00000000..95732d92 --- /dev/null +++ b/scripts/demo-schema-change.sh @@ -0,0 +1,52 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Day 2 Demo: Apply schema change (migration 004 — add is_favorited) with zero downtime. +# Usage: ./scripts/demo-schema-change.sh +# Example: ./scripts/demo-schema-change.sh samosachaat-prod + +NAMESPACE="${1:?Usage: demo-schema-change.sh }" + +echo "=== samosaChaat Day 2: Schema Change Demo ===" + +echo "" +echo "Step 1: Current migration state" +kubectl exec -n "$NAMESPACE" deploy/chat-api -- alembic current 2>/dev/null || \ + echo "(Could not connect — ensure chat-api pod is running)" + +echo "" +echo "Step 2: Show the migration file" +echo "File: db/migrations/versions/004_add_favorited.py" +echo "Operation: ALTER TABLE conversations ADD COLUMN is_favorited BOOLEAN DEFAULT false NOT NULL" +echo "" +echo "Key points:" +echo " - ADD COLUMN with DEFAULT is non-blocking in PostgreSQL 11+" +echo " - No table lock, no downtime, existing rows get default value instantly" +echo " - Old pods (without the code change) simply ignore the new column" +echo " - New pods (with updated SQLAlchemy model) can use it immediately" + +echo "Step 3: Apply migration via Helm upgrade" +echo "The db-migrate-job.yaml Helm hook runs 'alembic upgrade head' before new pods start." +echo "" +echo "Running: helm upgrade samosachaat helm/samosachaat -n $NAMESPACE --reuse-values" +helm upgrade samosachaat helm/samosachaat -n "$NAMESPACE" --reuse-values + +echo "" +echo "Step 4: Verify migration applied" +kubectl exec -n "$NAMESPACE" deploy/chat-api -- alembic current + +echo "" +echo "Step 5: Verify column exists in database" +kubectl exec -n "$NAMESPACE" deploy/chat-api -- python -c " +from sqlalchemy import inspect, create_engine +import os +url = os.environ.get('DATABASE_URL', '').replace('+asyncpg', '') +if not url: + print('DATABASE_URL not set') + exit(1) +engine = create_engine(url) +cols = [c['name'] for c in inspect(engine).get_columns('conversations')] +print(f'Columns: {cols}') +assert 'is_favorited' in cols, 'FAIL: is_favorited not found!' +print('SUCCESS: is_favorited column present and migration is complete.') +" diff --git a/scripts/rotate-nodes.sh b/scripts/rotate-nodes.sh new file mode 100755 index 00000000..57792953 --- /dev/null +++ b/scripts/rotate-nodes.sh @@ -0,0 +1,43 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Rotate EKS managed node group to latest AMI with zero downtime. +# Usage: ./scripts/rotate-nodes.sh +# Example: ./scripts/rotate-nodes.sh dev + +ENVIRONMENT="${1:?Usage: rotate-nodes.sh (dev|uat|prod)}" +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +TF_DIR="$SCRIPT_DIR/../terraform/environments/$ENVIRONMENT" + +echo "=== samosaChaat Node Rotation — $ENVIRONMENT ===" + +echo "" +echo "Step 1: Check current AMI vs latest available" +cd "$TF_DIR" + +echo "" +echo "Step 2: Apply Terraform to update launch template with latest AMI" +echo "This triggers EKS managed node group rolling update." +echo "EKS will:" +echo " 1. Launch new nodes with patched AMI" +echo " 2. Cordon old nodes (stop scheduling new pods)" +echo " 3. Drain pods from old nodes (respecting PodDisruptionBudgets)" +echo " 4. Terminate old nodes" +echo "" +echo "PDBs ensure minAvailable: 1 for each service = zero downtime." +echo "" +read -p "Proceed with terraform apply? [y/N] " -n 1 -r +echo "" +if [[ $REPLY =~ ^[Yy]$ ]]; then + terraform apply -auto-approve +else + echo "Aborted." + exit 0 +fi + +echo "" +echo "Step 3: Monitor node rotation" +CLUSTER_NAME=$(terraform output -raw eks_cluster_name 2>/dev/null || echo "samosachaat-$ENVIRONMENT") +aws eks update-kubeconfig --name "$CLUSTER_NAME" --region us-west-2 2>/dev/null || true +echo "Watching nodes (Ctrl+C to stop):" +kubectl get nodes -w diff --git a/scripts/verify-deployment.sh b/scripts/verify-deployment.sh new file mode 100755 index 00000000..17f97189 --- /dev/null +++ b/scripts/verify-deployment.sh @@ -0,0 +1,52 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Verify all samosaChaat services are healthy after deployment. +# Usage: ./scripts/verify-deployment.sh + +NAMESPACE="${1:?Usage: verify-deployment.sh }" + +echo "=== samosaChaat Deployment Verification — $NAMESPACE ===" + +PASS=0 +FAIL=0 + +check() { + local name="$1" cmd="$2" + if eval "$cmd" > /dev/null 2>&1; then + echo " ✓ $name" + ((PASS++)) + else + echo " ✗ $name" + ((FAIL++)) + fi +} + +echo "" +echo "Pods:" +kubectl get pods -n "$NAMESPACE" --no-headers | while read line; do + echo " $line" +done + +echo "Health checks:" +check "Frontend" "kubectl exec -n $NAMESPACE deploy/frontend -- wget -qO- http://localhost:3000/api/health" +check "Auth" "kubectl exec -n $NAMESPACE deploy/auth -- python -c \"import urllib.request; urllib.request.urlopen('http://localhost:8001/auth/health')\"" +check "Chat API" "kubectl exec -n $NAMESPACE deploy/chat-api -- python -c \"import urllib.request; urllib.request.urlopen('http://localhost:8002/api/health')\"" +check "Inference" "kubectl exec -n $NAMESPACE deploy/inference -- python -c \"import urllib.request; urllib.request.urlopen('http://localhost:8003/health')\"" + +echo "" +echo "Deployments:" +check "Frontend available" "kubectl rollout status deploy/frontend -n $NAMESPACE --timeout=10s" +check "Auth available" "kubectl rollout status deploy/auth -n $NAMESPACE --timeout=10s" +check "Chat API available" "kubectl rollout status deploy/chat-api -n $NAMESPACE --timeout=10s" +check "Inference available" "kubectl rollout status deploy/inference -n $NAMESPACE --timeout=10s" + +echo "" +echo "PDBs:" +kubectl get pdb -n "$NAMESPACE" --no-headers 2>/dev/null | while read line; do + echo " $line" +done + +echo "" +echo "Result: $PASS passed, $FAIL failed" +[ "$FAIL" -eq 0 ] && echo "All checks passed!" || { echo "SOME CHECKS FAILED"; exit 1; } diff --git a/terraform/modules/eks/main.tf b/terraform/modules/eks/main.tf index 796f198b..997fc65b 100644 --- a/terraform/modules/eks/main.tf +++ b/terraform/modules/eks/main.tf @@ -50,6 +50,10 @@ module "eks" { instance_types = [var.node_instance_type] capacity_type = "ON_DEMAND" + update_config = { + max_unavailable_percentage = var.node_max_unavailable_percentage + } + labels = { role = "general" } diff --git a/terraform/modules/eks/outputs.tf b/terraform/modules/eks/outputs.tf index c6f044f2..a062d4c5 100644 --- a/terraform/modules/eks/outputs.tf +++ b/terraform/modules/eks/outputs.tf @@ -32,3 +32,8 @@ output "oidc_provider_url" { description = "IRSA OIDC issuer URL (without https://)." value = module.eks.oidc_provider } + +output "current_node_ami_id" { + description = "The current EKS-optimized AMI ID used by the node group." + value = data.aws_ssm_parameter.eks_ami_id.value +} diff --git a/terraform/modules/eks/variables.tf b/terraform/modules/eks/variables.tf index 0649befc..99f614ff 100644 --- a/terraform/modules/eks/variables.tf +++ b/terraform/modules/eks/variables.tf @@ -43,6 +43,12 @@ variable "node_desired_size" { default = 2 } +variable "node_max_unavailable_percentage" { + description = "Max percentage of nodes unavailable during rolling update." + type = number + default = 33 +} + variable "tags" { description = "Tags applied to every resource." type = map(string)