mirror of
https://github.com/karpathy/nanochat.git
synced 2026-05-08 00:39:50 +00:00
Single-flag minimal change. When tau > 0, c_q/c_k weights are pulled into a
dedicated Muon group and rescaled after each Muon step so that Frobenius/sqrt(min_dim)
spectral-norm estimate <= sqrt(tau). Default tau=0 = no-op = bit-identical to v73.
Reference: Kimi K2 paper (arxiv 2507.20534 §A). Caps max attention logit ~tau.
Files touched (3):
nanochat/optim.py: +_apply_qk_clip helper, called after MuonAdamW.step
and DistMuonAdamW.step
nanochat/gpt.py: +muon_qk_clip_tau arg in setup_optimizer; splits c_q/c_k
into a dedicated Muon group when tau > 0
scripts/base_train.py: +--muon-qk-clip-tau CLI arg, threaded to setup_optimizer
Validated overnight (private fork) at d22 6000-iter:
v73 baseline: val_bpb 0.7242, CORE 0.2714, crosses GPT-2 CORE @ ~81 min
v198 (tau=100): val_bpb 0.7242, CORE 0.2731, crosses GPT-2 CORE @ ~80 min
All other stacks (warmdown, lr, warmup) regressed; tau sweep (50/100/200) showed
sharp peak at tau=100.
Generalizes across model depths because it's a Muon optimizer-level fix, not a
recipe tweak.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| base_eval.py | ||
| base_train.py | ||
| chat_cli.py | ||
| chat_eval.py | ||
| chat_rl.py | ||
| chat_sft.py | ||
| chat_web.py | ||
| tok_eval.py | ||
| tok_train.py | ||