nanochat/scripts
William Thurston 194c98a5b3 Merge upstream/master (266 commits) into fork
Accept upstream's architectural changes wholesale:
- argparse replaces configurator.py across all scripts
- Unified MuonAdamW optimizer replaces separate AdamW + Muon
- Sliding window attention (SSSL pattern) + Flash Attention 3
- Value embeddings (ResFormer-style) with per-layer gating
- Per-layer learnable scalars (resid_lambdas, x0_lambdas)
- FP8 training support with Float8Linear
- Scaling laws (Power Lines batch sizing, T_epoch weight decay)
- Checkpoint resumption with dataloader state
- BOS-aligned bestfit-pad packing for SFT
- ChatCORE evaluation metric
- Consolidated base_loss.py into base_eval.py
- Removed mid_train.py (pipeline simplified)

Drops our MoE and tie_embeddings implementations in favor of
upstream's cleaner architecture. These can be re-added later
on top of the new codebase if needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 14:50:28 -08:00
..
base_eval.py small touchups to the eval script, re-order items etc, cosmetic 2026-02-03 21:03:42 +00:00
base_train.py Add tie_embeddings support and configurable logging interval 2026-02-22 14:42:58 -08:00
chat_cli.py remove leftover mid references (#491) 2026-02-02 08:33:46 -08:00
chat_eval.py Merge upstream/master (266 commits) into fork 2026-02-22 14:50:28 -08:00
chat_rl.py remove leftover mid references (#491) 2026-02-02 08:33:46 -08:00
chat_sft.py tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft 2026-02-18 15:49:18 +00:00
chat_web.py remove leftover mid references (#491) 2026-02-02 08:33:46 -08:00
tok_eval.py initial commit 2025-10-13 06:49:24 -07:00
tok_train.py quick fix to not OOM main speedrun script 2026-01-26 22:31:42 +00:00