Adds tooling and documentation for Day 2 cluster operations:
- scripts/rotate-nodes.sh: interactive node-rotation driver that applies
terraform to pick up the latest SSM-resolved EKS AMI and watches the
rolling replacement.
- scripts/demo-schema-change.sh: end-to-end demo of the zero-downtime
is_favorited column migration via helm upgrade + migration hook.
- scripts/verify-deployment.sh: post-deploy health check across pods,
per-service HTTP health endpoints, rollout status, and PDBs.
- docs/chaos-runbook.md: failure-mode playbook with simulate / Grafana /
Loki / recovery steps for six scenarios (pod kill, node failure, DB
pool exhaustion, inference OOM, high latency, SSL issues) plus a
Loki quick-reference.
- terraform/modules/eks: expose current_node_ami_id output, add
update_config.max_unavailable_percentage (configurable, default 33)
so node-group rolls are controlled.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add reusable Terraform modules and per-environment configs (dev/uat/prod)
in us-west-2 covering: VPC (3 AZ public/private), EKS 1.29 with IRSA and
ALB/EBS/EFS CSI add-ons, RDS PostgreSQL 15, four ECR repos, IAM roles
(EKS node, ALB controller IRSA, GitHub Actions OIDC), Route53 + ACM for
samosachaat.art, and EFS for model weights. State backend on S3
(samosachaat-terraform-state) with DynamoDB lock table.
terraform validate passes for dev, uat, and prod.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>