mirror of
https://github.com/karpathy/nanochat.git
synced 2026-03-20 20:03:19 +00:00
Accept upstream's architectural changes wholesale: - argparse replaces configurator.py across all scripts - Unified MuonAdamW optimizer replaces separate AdamW + Muon - Sliding window attention (SSSL pattern) + Flash Attention 3 - Value embeddings (ResFormer-style) with per-layer gating - Per-layer learnable scalars (resid_lambdas, x0_lambdas) - FP8 training support with Float8Linear - Scaling laws (Power Lines batch sizing, T_epoch weight decay) - Checkpoint resumption with dataloader state - BOS-aligned bestfit-pad packing for SFT - ChatCORE evaluation metric - Consolidated base_loss.py into base_eval.py - Removed mid_train.py (pipeline simplified) Drops our MoE and tie_embeddings implementations in favor of upstream's cleaner architecture. These can be re-added later on top of the new codebase if needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| estimate_gpt3_core.ipynb | ||
| gen_synthetic_data.py | ||
| generate_logo.html | ||
| LEADERBOARD.md | ||
| LOG.md | ||
| nanochat.png | ||
| repackage_data_reference.py | ||
| runmps_evals.sh | ||
| runmps.sh | ||
| scaling_analysis.ipynb | ||
| scaling_laws_jan26.png | ||