nanochat/tests
William Thurston 194c98a5b3 Merge upstream/master (266 commits) into fork
Accept upstream's architectural changes wholesale:
- argparse replaces configurator.py across all scripts
- Unified MuonAdamW optimizer replaces separate AdamW + Muon
- Sliding window attention (SSSL pattern) + Flash Attention 3
- Value embeddings (ResFormer-style) with per-layer gating
- Per-layer learnable scalars (resid_lambdas, x0_lambdas)
- FP8 training support with Float8Linear
- Scaling laws (Power Lines batch sizing, T_epoch weight decay)
- Checkpoint resumption with dataloader state
- BOS-aligned bestfit-pad packing for SFT
- ChatCORE evaluation metric
- Consolidated base_loss.py into base_eval.py
- Removed mid_train.py (pipeline simplified)

Drops our MoE and tie_embeddings implementations in favor of
upstream's cleaner architecture. These can be re-added later
on top of the new codebase if needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 14:50:28 -08:00
..
test_attention_fallback.py Fix SDPA KV-cache decode to respect sliding window (#456) 2026-01-30 17:32:12 +00:00
test_engine.py Fix MockModel's device definition (#535) 2026-02-17 16:03:46 -08:00
test_moe.py Implement reset_parameters method in MoEFeedForward and update GPT to utilize it 2025-11-13 17:09:11 -08:00