nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-03-21 12:23:13 +00:00

History

William Thurston 194c98a5b3 Merge upstream/master (266 commits) into fork Accept upstream's architectural changes wholesale: - argparse replaces configurator.py across all scripts - Unified MuonAdamW optimizer replaces separate AdamW + Muon - Sliding window attention (SSSL pattern) + Flash Attention 3 - Value embeddings (ResFormer-style) with per-layer gating - Per-layer learnable scalars (resid_lambdas, x0_lambdas) - FP8 training support with Float8Linear - Scaling laws (Power Lines batch sizing, T_epoch weight decay) - Checkpoint resumption with dataloader state - BOS-aligned bestfit-pad packing for SFT - ChatCORE evaluation metric - Consolidated base_loss.py into base_eval.py - Removed mid_train.py (pipeline simplified) Drops our MoE and tie_embeddings implementations in favor of upstream's cleaner architecture. These can be re-added later on top of the new codebase if needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>		2026-02-22 14:50:28 -08:00
..
test_attention_fallback.py	Fix SDPA KV-cache decode to respect sliding window (#456 )	2026-01-30 17:32:12 +00:00
test_engine.py	Fix MockModel's device definition (#535 )	2026-02-17 16:03:46 -08:00
test_moe.py	Implement reset_parameters method in MoEFeedForward and update GPT to utilize it	2025-11-13 17:09:11 -08:00