nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-03-13 08:23:12 +00:00

History

William Thurston 194c98a5b3 Merge upstream/master (266 commits) into fork Accept upstream's architectural changes wholesale: - argparse replaces configurator.py across all scripts - Unified MuonAdamW optimizer replaces separate AdamW + Muon - Sliding window attention (SSSL pattern) + Flash Attention 3 - Value embeddings (ResFormer-style) with per-layer gating - Per-layer learnable scalars (resid_lambdas, x0_lambdas) - FP8 training support with Float8Linear - Scaling laws (Power Lines batch sizing, T_epoch weight decay) - Checkpoint resumption with dataloader state - BOS-aligned bestfit-pad packing for SFT - ChatCORE evaluation metric - Consolidated base_loss.py into base_eval.py - Removed mid_train.py (pipeline simplified) Drops our MoE and tie_embeddings implementations in favor of upstream's cleaner architecture. These can be re-added later on top of the new codebase if needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>		2026-02-22 14:50:28 -08:00
..
base_eval.py	small touchups to the eval script, re-order items etc, cosmetic	2026-02-03 21:03:42 +00:00
base_train.py	Add tie_embeddings support and configurable logging interval	2026-02-22 14:42:58 -08:00
chat_cli.py	remove leftover mid references (#491 )	2026-02-02 08:33:46 -08:00
chat_eval.py	Merge upstream/master (266 commits) into fork	2026-02-22 14:50:28 -08:00
chat_rl.py	remove leftover mid references (#491 )	2026-02-02 08:33:46 -08:00
chat_sft.py	tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft	2026-02-18 15:49:18 +00:00
chat_web.py	remove leftover mid references (#491 )	2026-02-02 08:33:46 -08:00
tok_eval.py	initial commit	2025-10-13 06:49:24 -07:00
tok_train.py	quick fix to not OOM main speedrun script	2026-01-26 22:31:42 +00:00