mirror of
https://github.com/karpathy/nanochat.git
synced 2026-05-08 08:49:53 +00:00
Adds tooling and documentation for Day 2 cluster operations: - scripts/rotate-nodes.sh: interactive node-rotation driver that applies terraform to pick up the latest SSM-resolved EKS AMI and watches the rolling replacement. - scripts/demo-schema-change.sh: end-to-end demo of the zero-downtime is_favorited column migration via helm upgrade + migration hook. - scripts/verify-deployment.sh: post-deploy health check across pods, per-service HTTP health endpoints, rollout status, and PDBs. - docs/chaos-runbook.md: failure-mode playbook with simulate / Grafana / Loki / recovery steps for six scenarios (pod kill, node failure, DB pool exhaustion, inference OOM, high latency, SSL issues) plus a Loki quick-reference. - terraform/modules/eks: expose current_node_ami_id output, add update_config.max_unavailable_percentage (configurable, default 33) so node-group rolls are controlled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| base_eval.py | ||
| base_train.py | ||
| build_tool_datasets.py | ||
| chat_cli.py | ||
| chat_eval.py | ||
| chat_rl.py | ||
| chat_sft.py | ||
| chat_tool_rl.py | ||
| chat_web.py | ||
| demo-schema-change.sh | ||
| export_onnx.py | ||
| hf_sync_checkpoint.py | ||
| import_hf_checkpoint.py | ||
| local-dev.sh | ||
| rotate-nodes.sh | ||
| seed-db.sh | ||
| tok_eval.py | ||
| tok_train.py | ||
| verify_external_access.py | ||
| verify-deployment.sh | ||