nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-05-08 00:39:50 +00:00

History

gio 119f567cda add MuonClip QK-Clip (--muon-qk-clip-tau) on top of upstream Run 6 Single-flag minimal change. When tau > 0, c_q/c_k weights are pulled into a dedicated Muon group and rescaled after each Muon step so that Frobenius/sqrt(min_dim) spectral-norm estimate <= sqrt(tau). Default tau=0 = no-op = bit-identical to v73. Reference: Kimi K2 paper (arxiv 2507.20534 §A). Caps max attention logit ~tau. Files touched (3): nanochat/optim.py: +_apply_qk_clip helper, called after MuonAdamW.step and DistMuonAdamW.step nanochat/gpt.py: +muon_qk_clip_tau arg in setup_optimizer; splits c_q/c_k into a dedicated Muon group when tau > 0 scripts/base_train.py: +--muon-qk-clip-tau CLI arg, threaded to setup_optimizer Validated overnight (private fork) at d22 6000-iter: v73 baseline: val_bpb 0.7242, CORE 0.2714, crosses GPT-2 CORE @ ~81 min v198 (tau=100): val_bpb 0.7242, CORE 0.2731, crosses GPT-2 CORE @ ~80 min All other stacks (warmdown, lr, warmup) regressed; tau sweep (50/100/200) showed sharp peak at tau=100. Generalizes across model depths because it's a Muon optimizer-level fix, not a recipe tweak. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-25 16:52:03 -05:00
..
base_eval.py	delete autocast, an unnecessary thorn in my side, manage dtypes directly	2026-03-04 23:55:30 +00:00
base_train.py	add MuonClip QK-Clip (--muon-qk-clip-tau) on top of upstream Run 6	2026-04-25 16:52:03 -05:00
chat_cli.py	delete autocast, an unnecessary thorn in my side, manage dtypes directly	2026-03-04 23:55:30 +00:00
chat_eval.py	delete autocast, an unnecessary thorn in my side, manage dtypes directly	2026-03-04 23:55:30 +00:00
chat_rl.py	delete autocast, an unnecessary thorn in my side, manage dtypes directly	2026-03-04 23:55:30 +00:00
chat_sft.py	Merge pull request #634 from 2bitbit/fix-docs-and-comments	2026-03-25 14:31:49 -07:00
chat_web.py	delete autocast, an unnecessary thorn in my side, manage dtypes directly	2026-03-04 23:55:30 +00:00
tok_eval.py	initial commit	2025-10-13 06:49:24 -07:00
tok_train.py	fix: correct minor typos in help text, README, and comments	2026-03-12 17:03:26 +08:00