mirror of
https://github.com/karpathy/nanochat.git
synced 2026-05-10 18:00:17 +00:00
Merge pull request #19 from manmohan659/feat/day2-operations
feat(ops): Day 2 operations automation and chaos readiness (#10)
This commit is contained in:
commit
8a113d4757
198
docs/chaos-runbook.md
Normal file
198
docs/chaos-runbook.md
Normal file
|
|
@ -0,0 +1,198 @@
|
|||
# samosaChaat — Chaos Testing Runbook
|
||||
|
||||
This runbook covers failure scenarios for live defense. Each scenario includes
|
||||
how to simulate it, how to detect it via Grafana/Loki, and recovery steps.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- kubectl configured for the target cluster
|
||||
- Grafana accessible at https://grafana.samosachaat.art
|
||||
- Loki datasource configured in Grafana
|
||||
|
||||
---
|
||||
|
||||
## Scenario 1: Pod Crash / Kill
|
||||
|
||||
**Simulate:**
|
||||
```bash
|
||||
kubectl delete pod -l app.kubernetes.io/name=chat-api -n samosachaat-prod
|
||||
```
|
||||
|
||||
**Detect (Grafana):**
|
||||
- Dashboard: Application Performance → look for gap in request rate
|
||||
- Alert: container restart count spike
|
||||
- Panel query: `kube_pod_container_status_restarts_total{namespace="samosachaat-prod"}`
|
||||
|
||||
**Detect (Loki):**
|
||||
```logql
|
||||
{namespace="samosachaat-prod"} | json | level="error"
|
||||
{namespace="samosachaat-prod",app="chat-api"} | json | message=~".*startup.*|.*shutdown.*"
|
||||
```
|
||||
|
||||
**Recovery:**
|
||||
- Kubernetes auto-restarts the pod (restartPolicy: Always)
|
||||
- HPA scales up if CPU threshold exceeded during recovery
|
||||
- PDB ensures other pods kept running during the kill
|
||||
- No manual action needed unless crash-looping (check logs for root cause)
|
||||
|
||||
**Verify recovered:**
|
||||
```bash
|
||||
kubectl get pods -n samosachaat-prod -l app.kubernetes.io/name=chat-api
|
||||
curl -s https://samosachaat.art/api/health | jq .
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Scenario 2: Node Failure
|
||||
|
||||
**Simulate:**
|
||||
```bash
|
||||
# Get a node instance ID
|
||||
INSTANCE_ID=$(kubectl get nodes -o jsonpath='{.items[0].spec.providerID}' | cut -d/ -f5)
|
||||
aws ec2 terminate-instances --instance-ids $INSTANCE_ID
|
||||
```
|
||||
|
||||
**Detect (Grafana):**
|
||||
- Dashboard: Node Health → node disappears from CPU/Memory panels
|
||||
- Alert: HighCPU or HighMemory may fire on remaining nodes as pods redistribute
|
||||
- Panel query: `kube_node_status_condition{condition="Ready",status="true"}`
|
||||
|
||||
**Detect (Loki):**
|
||||
```logql
|
||||
{namespace="kube-system"} | json | message=~".*NotReady.*|.*node.*removed.*"
|
||||
```
|
||||
|
||||
**Recovery:**
|
||||
- EKS auto-scaling group launches a replacement node (2-5 minutes)
|
||||
- Pods on the failed node are rescheduled to healthy nodes
|
||||
- PDBs prevent more than 1 pod per service from being unavailable
|
||||
- No manual action needed
|
||||
|
||||
**Verify recovered:**
|
||||
```bash
|
||||
kubectl get nodes # New node should appear with STATUS Ready
|
||||
kubectl get pods -n samosachaat-prod -o wide # All pods Running
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Scenario 3: Database Connection Pool Exhaustion
|
||||
|
||||
**Simulate:**
|
||||
```bash
|
||||
# Run a load test that exceeds the connection pool limit
|
||||
kubectl run loadtest --image=busybox --restart=Never -n samosachaat-prod -- \
|
||||
sh -c 'for i in $(seq 1 200); do wget -q -O- http://chat-api:8002/api/health & done; wait'
|
||||
```
|
||||
|
||||
**Detect (Grafana):**
|
||||
- Dashboard: Application Performance → spike in p99 latency, increase in 5xx errors
|
||||
- Alert: High5xxRate fires
|
||||
|
||||
**Detect (Loki):**
|
||||
```logql
|
||||
{app=~"auth|chat-api"} | json | message=~".*connection.*pool.*|.*timeout.*|.*asyncpg.*|.*QueuePool.*overflow.*"
|
||||
```
|
||||
|
||||
**Recovery:**
|
||||
1. Identify which service is affected from Loki logs
|
||||
2. Check current connection count: `kubectl exec deploy/chat-api -n samosachaat-prod -- python -c "..."`
|
||||
3. Restart affected pods: `kubectl rollout restart deploy/chat-api -n samosachaat-prod`
|
||||
4. If persistent: increase pool size in service config (`SQLALCHEMY_POOL_SIZE` env var) and redeploy
|
||||
5. Check RDS max_connections: `aws rds describe-db-parameters --db-parameter-group-name default.postgres15`
|
||||
|
||||
---
|
||||
|
||||
## Scenario 4: Inference Service OOM
|
||||
|
||||
**Simulate:**
|
||||
```bash
|
||||
# Set a low memory limit and load a large model
|
||||
kubectl set resources deploy/inference -n samosachaat-prod --limits=memory=512Mi
|
||||
# Or trigger by sending many concurrent requests
|
||||
```
|
||||
|
||||
**Detect (Grafana):**
|
||||
- Dashboard: Inference Service → memory spike, then sudden drop (OOM kill)
|
||||
- Dashboard: Node Health → memory spike on the node hosting inference
|
||||
- Alert: HighMemory fires
|
||||
|
||||
**Detect (Loki):**
|
||||
```logql
|
||||
{app="inference"} | json | message=~".*OOMKilled.*|.*memory.*|.*killed.*"
|
||||
# Also check events:
|
||||
# kubectl get events -n samosachaat-prod --sort-by='.lastTimestamp' | grep -i oom
|
||||
```
|
||||
|
||||
**Recovery:**
|
||||
1. Pod auto-restarts (but may crash-loop if model is too large for limit)
|
||||
2. Check what model is loaded: `curl http://inference:8003/stats`
|
||||
3. If model too large: swap to smaller model via `POST /models/swap`
|
||||
4. If limit too low: increase memory limit in values-prod.yaml and `helm upgrade`
|
||||
5. Restore original limits: `kubectl set resources deploy/inference -n samosachaat-prod --limits=memory=8Gi`
|
||||
|
||||
---
|
||||
|
||||
## Scenario 5: High Latency / Degraded Performance
|
||||
|
||||
**Simulate:**
|
||||
```bash
|
||||
# Flood inference with concurrent requests
|
||||
kubectl run loadtest --image=curlimages/curl --restart=Never -n samosachaat-prod -- \
|
||||
sh -c 'for i in $(seq 1 50); do curl -s -X POST http://chat-api:8002/api/conversations/test/messages -H "Content-Type: application/json" -d "{\"content\":\"tell me a story\"}" & done; wait'
|
||||
```
|
||||
|
||||
**Detect (Grafana):**
|
||||
- Dashboard: Application Performance → p99 latency > 5s
|
||||
- Dashboard: Inference Service → worker pool utilization at 100%, queue depth growing
|
||||
- Alert: HighP99Latency fires
|
||||
|
||||
**Detect (Loki):**
|
||||
```logql
|
||||
{app="chat-api"} | json | inference_time_ms > 5000
|
||||
{app="inference"} | json | message=~".*queue.*full.*|.*timeout.*|.*worker.*busy.*"
|
||||
```
|
||||
|
||||
**Recovery:**
|
||||
1. Check inference worker pool: `curl http://inference:8003/stats`
|
||||
2. If all workers busy: HPA should scale inference pods (check HPA status)
|
||||
3. Manual scale: `kubectl scale deploy/inference -n samosachaat-prod --replicas=5`
|
||||
4. If single-pod bottleneck: check if model is too large for CPU inference, consider GPU nodes
|
||||
5. Verify recovery: watch latency dashboard return to normal
|
||||
|
||||
---
|
||||
|
||||
## Scenario 6: SSL Certificate Issues
|
||||
|
||||
**Detect:**
|
||||
- Users report "connection not secure" errors
|
||||
- `curl -vI https://samosachaat.art 2>&1 | grep -i "expire\|ssl\|certificate"`
|
||||
|
||||
**Recovery:**
|
||||
- ACM certificates auto-renew — check ACM console for renewal status
|
||||
- If DNS validation failed: check Route53 CNAME records match ACM requirements
|
||||
- `terraform apply` to reconcile if records drifted
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference: Diagnostic Loki Queries
|
||||
|
||||
```logql
|
||||
# All errors across all services
|
||||
{namespace="samosachaat-prod"} | json | level="error" | line_format "{{.service}}: {{.message}}"
|
||||
|
||||
# Trace a request across services
|
||||
{namespace="samosachaat-prod"} | json | trace_id="<TRACE_ID>"
|
||||
|
||||
# Auth failures
|
||||
{app="auth"} | json | level="error" | message=~".*oauth.*|.*jwt.*|.*unauthorized.*"
|
||||
|
||||
# Inference issues
|
||||
{app="inference"} | json | message=~".*error.*|.*timeout.*|.*OOM.*|.*worker.*"
|
||||
|
||||
# Slow database queries
|
||||
{app=~"auth|chat-api"} | json | message=~".*slow.*query.*|.*timeout.*"
|
||||
|
||||
# Recent pod restarts
|
||||
{namespace="samosachaat-prod"} | json | message=~".*started.*|.*shutdown.*|.*ready.*"
|
||||
```
|
||||
52
scripts/demo-schema-change.sh
Executable file
52
scripts/demo-schema-change.sh
Executable file
|
|
@ -0,0 +1,52 @@
|
|||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# Day 2 Demo: Apply schema change (migration 004 — add is_favorited) with zero downtime.
|
||||
# Usage: ./scripts/demo-schema-change.sh <namespace>
|
||||
# Example: ./scripts/demo-schema-change.sh samosachaat-prod
|
||||
|
||||
NAMESPACE="${1:?Usage: demo-schema-change.sh <namespace>}"
|
||||
|
||||
echo "=== samosaChaat Day 2: Schema Change Demo ==="
|
||||
|
||||
echo ""
|
||||
echo "Step 1: Current migration state"
|
||||
kubectl exec -n "$NAMESPACE" deploy/chat-api -- alembic current 2>/dev/null || \
|
||||
echo "(Could not connect — ensure chat-api pod is running)"
|
||||
|
||||
echo ""
|
||||
echo "Step 2: Show the migration file"
|
||||
echo "File: db/migrations/versions/004_add_favorited.py"
|
||||
echo "Operation: ALTER TABLE conversations ADD COLUMN is_favorited BOOLEAN DEFAULT false NOT NULL"
|
||||
echo ""
|
||||
echo "Key points:"
|
||||
echo " - ADD COLUMN with DEFAULT is non-blocking in PostgreSQL 11+"
|
||||
echo " - No table lock, no downtime, existing rows get default value instantly"
|
||||
echo " - Old pods (without the code change) simply ignore the new column"
|
||||
echo " - New pods (with updated SQLAlchemy model) can use it immediately"
|
||||
|
||||
echo "Step 3: Apply migration via Helm upgrade"
|
||||
echo "The db-migrate-job.yaml Helm hook runs 'alembic upgrade head' before new pods start."
|
||||
echo ""
|
||||
echo "Running: helm upgrade samosachaat helm/samosachaat -n $NAMESPACE --reuse-values"
|
||||
helm upgrade samosachaat helm/samosachaat -n "$NAMESPACE" --reuse-values
|
||||
|
||||
echo ""
|
||||
echo "Step 4: Verify migration applied"
|
||||
kubectl exec -n "$NAMESPACE" deploy/chat-api -- alembic current
|
||||
|
||||
echo ""
|
||||
echo "Step 5: Verify column exists in database"
|
||||
kubectl exec -n "$NAMESPACE" deploy/chat-api -- python -c "
|
||||
from sqlalchemy import inspect, create_engine
|
||||
import os
|
||||
url = os.environ.get('DATABASE_URL', '').replace('+asyncpg', '')
|
||||
if not url:
|
||||
print('DATABASE_URL not set')
|
||||
exit(1)
|
||||
engine = create_engine(url)
|
||||
cols = [c['name'] for c in inspect(engine).get_columns('conversations')]
|
||||
print(f'Columns: {cols}')
|
||||
assert 'is_favorited' in cols, 'FAIL: is_favorited not found!'
|
||||
print('SUCCESS: is_favorited column present and migration is complete.')
|
||||
"
|
||||
43
scripts/rotate-nodes.sh
Executable file
43
scripts/rotate-nodes.sh
Executable file
|
|
@ -0,0 +1,43 @@
|
|||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# Rotate EKS managed node group to latest AMI with zero downtime.
|
||||
# Usage: ./scripts/rotate-nodes.sh <environment>
|
||||
# Example: ./scripts/rotate-nodes.sh dev
|
||||
|
||||
ENVIRONMENT="${1:?Usage: rotate-nodes.sh <environment> (dev|uat|prod)}"
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
TF_DIR="$SCRIPT_DIR/../terraform/environments/$ENVIRONMENT"
|
||||
|
||||
echo "=== samosaChaat Node Rotation — $ENVIRONMENT ==="
|
||||
|
||||
echo ""
|
||||
echo "Step 1: Check current AMI vs latest available"
|
||||
cd "$TF_DIR"
|
||||
|
||||
echo ""
|
||||
echo "Step 2: Apply Terraform to update launch template with latest AMI"
|
||||
echo "This triggers EKS managed node group rolling update."
|
||||
echo "EKS will:"
|
||||
echo " 1. Launch new nodes with patched AMI"
|
||||
echo " 2. Cordon old nodes (stop scheduling new pods)"
|
||||
echo " 3. Drain pods from old nodes (respecting PodDisruptionBudgets)"
|
||||
echo " 4. Terminate old nodes"
|
||||
echo ""
|
||||
echo "PDBs ensure minAvailable: 1 for each service = zero downtime."
|
||||
echo ""
|
||||
read -p "Proceed with terraform apply? [y/N] " -n 1 -r
|
||||
echo ""
|
||||
if [[ $REPLY =~ ^[Yy]$ ]]; then
|
||||
terraform apply -auto-approve
|
||||
else
|
||||
echo "Aborted."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "Step 3: Monitor node rotation"
|
||||
CLUSTER_NAME=$(terraform output -raw eks_cluster_name 2>/dev/null || echo "samosachaat-$ENVIRONMENT")
|
||||
aws eks update-kubeconfig --name "$CLUSTER_NAME" --region us-west-2 2>/dev/null || true
|
||||
echo "Watching nodes (Ctrl+C to stop):"
|
||||
kubectl get nodes -w
|
||||
52
scripts/verify-deployment.sh
Executable file
52
scripts/verify-deployment.sh
Executable file
|
|
@ -0,0 +1,52 @@
|
|||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# Verify all samosaChaat services are healthy after deployment.
|
||||
# Usage: ./scripts/verify-deployment.sh <namespace>
|
||||
|
||||
NAMESPACE="${1:?Usage: verify-deployment.sh <namespace>}"
|
||||
|
||||
echo "=== samosaChaat Deployment Verification — $NAMESPACE ==="
|
||||
|
||||
PASS=0
|
||||
FAIL=0
|
||||
|
||||
check() {
|
||||
local name="$1" cmd="$2"
|
||||
if eval "$cmd" > /dev/null 2>&1; then
|
||||
echo " ✓ $name"
|
||||
((PASS++))
|
||||
else
|
||||
echo " ✗ $name"
|
||||
((FAIL++))
|
||||
fi
|
||||
}
|
||||
|
||||
echo ""
|
||||
echo "Pods:"
|
||||
kubectl get pods -n "$NAMESPACE" --no-headers | while read line; do
|
||||
echo " $line"
|
||||
done
|
||||
|
||||
echo "Health checks:"
|
||||
check "Frontend" "kubectl exec -n $NAMESPACE deploy/frontend -- wget -qO- http://localhost:3000/api/health"
|
||||
check "Auth" "kubectl exec -n $NAMESPACE deploy/auth -- python -c \"import urllib.request; urllib.request.urlopen('http://localhost:8001/auth/health')\""
|
||||
check "Chat API" "kubectl exec -n $NAMESPACE deploy/chat-api -- python -c \"import urllib.request; urllib.request.urlopen('http://localhost:8002/api/health')\""
|
||||
check "Inference" "kubectl exec -n $NAMESPACE deploy/inference -- python -c \"import urllib.request; urllib.request.urlopen('http://localhost:8003/health')\""
|
||||
|
||||
echo ""
|
||||
echo "Deployments:"
|
||||
check "Frontend available" "kubectl rollout status deploy/frontend -n $NAMESPACE --timeout=10s"
|
||||
check "Auth available" "kubectl rollout status deploy/auth -n $NAMESPACE --timeout=10s"
|
||||
check "Chat API available" "kubectl rollout status deploy/chat-api -n $NAMESPACE --timeout=10s"
|
||||
check "Inference available" "kubectl rollout status deploy/inference -n $NAMESPACE --timeout=10s"
|
||||
|
||||
echo ""
|
||||
echo "PDBs:"
|
||||
kubectl get pdb -n "$NAMESPACE" --no-headers 2>/dev/null | while read line; do
|
||||
echo " $line"
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo "Result: $PASS passed, $FAIL failed"
|
||||
[ "$FAIL" -eq 0 ] && echo "All checks passed!" || { echo "SOME CHECKS FAILED"; exit 1; }
|
||||
|
|
@ -50,6 +50,10 @@ module "eks" {
|
|||
instance_types = [var.node_instance_type]
|
||||
capacity_type = "ON_DEMAND"
|
||||
|
||||
update_config = {
|
||||
max_unavailable_percentage = var.node_max_unavailable_percentage
|
||||
}
|
||||
|
||||
labels = {
|
||||
role = "general"
|
||||
}
|
||||
|
|
|
|||
|
|
@ -32,3 +32,8 @@ output "oidc_provider_url" {
|
|||
description = "IRSA OIDC issuer URL (without https://)."
|
||||
value = module.eks.oidc_provider
|
||||
}
|
||||
|
||||
output "current_node_ami_id" {
|
||||
description = "The current EKS-optimized AMI ID used by the node group."
|
||||
value = data.aws_ssm_parameter.eks_ami_id.value
|
||||
}
|
||||
|
|
|
|||
|
|
@ -43,6 +43,12 @@ variable "node_desired_size" {
|
|||
default = 2
|
||||
}
|
||||
|
||||
variable "node_max_unavailable_percentage" {
|
||||
description = "Max percentage of nodes unavailable during rolling update."
|
||||
type = number
|
||||
default = 33
|
||||
}
|
||||
|
||||
variable "tags" {
|
||||
description = "Tags applied to every resource."
|
||||
type = map(string)
|
||||
|
|
|
|||
Loading…
Reference in New Issue
Block a user