nanochat/scripts
Manmohan Sharma 0b8f9f0a5f
feat(ops): Day 2 operations automation and chaos runbook (#10)
Adds tooling and documentation for Day 2 cluster operations:

- scripts/rotate-nodes.sh: interactive node-rotation driver that applies
  terraform to pick up the latest SSM-resolved EKS AMI and watches the
  rolling replacement.
- scripts/demo-schema-change.sh: end-to-end demo of the zero-downtime
  is_favorited column migration via helm upgrade + migration hook.
- scripts/verify-deployment.sh: post-deploy health check across pods,
  per-service HTTP health endpoints, rollout status, and PDBs.
- docs/chaos-runbook.md: failure-mode playbook with simulate / Grafana /
  Loki / recovery steps for six scenarios (pod kill, node failure, DB
  pool exhaustion, inference OOM, high latency, SSL issues) plus a
  Loki quick-reference.
- terraform/modules/eks: expose current_node_ami_id output, add
  update_config.max_unavailable_percentage (configurable, default 33)
  so node-group rolls are controlled.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 12:25:47 -07:00
..
base_eval.py delete autocast, an unnecessary thorn in my side, manage dtypes directly 2026-03-04 23:55:30 +00:00
base_train.py Add pre-GPU tool training and checkpoint plumbing 2026-03-24 20:52:36 -04:00
build_tool_datasets.py Add pre-GPU tool training and checkpoint plumbing 2026-03-24 20:52:36 -04:00
chat_cli.py rebrand to samosaChaat: UI, logo, and server messages 2026-03-23 09:58:12 -04:00
chat_eval.py Add pre-GPU tool training and checkpoint plumbing 2026-03-24 20:52:36 -04:00
chat_rl.py Add pre-GPU tool training and checkpoint plumbing 2026-03-24 20:52:36 -04:00
chat_sft.py Add pre-GPU tool training and checkpoint plumbing 2026-03-24 20:52:36 -04:00
chat_tool_rl.py Add pre-GPU tool training and checkpoint plumbing 2026-03-24 20:52:36 -04:00
chat_web.py rebrand to samosaChaat: UI, logo, and server messages 2026-03-23 09:58:12 -04:00
demo-schema-change.sh feat(ops): Day 2 operations automation and chaos runbook (#10) 2026-04-16 12:25:47 -07:00
export_onnx.py redesign UI: artisan landing page + warm chat theme + ONNX export script 2026-03-23 11:54:07 -04:00
hf_sync_checkpoint.py Add pre-GPU tool training and checkpoint plumbing 2026-03-24 20:52:36 -04:00
import_hf_checkpoint.py Add pre-GPU tool training and checkpoint plumbing 2026-03-24 20:52:36 -04:00
local-dev.sh scaffold monorepo platform layout 2026-04-16 11:06:29 -07:00
rotate-nodes.sh feat(ops): Day 2 operations automation and chaos runbook (#10) 2026-04-16 12:25:47 -07:00
seed-db.sh scaffold monorepo platform layout 2026-04-16 11:06:29 -07:00
tok_eval.py initial commit 2025-10-13 06:49:24 -07:00
tok_train.py quick fix to not OOM main speedrun script 2026-01-26 22:31:42 +00:00
verify_external_access.py Add pre-GPU tool training and checkpoint plumbing 2026-03-24 20:52:36 -04:00
verify-deployment.sh feat(ops): Day 2 operations automation and chaos runbook (#10) 2026-04-16 12:25:47 -07:00