d22 + 6000 iter + bs=1M + warmdown=0.85 + muonclip τ=100
- CORE 0.2646 in 88.2 min (matches Run 6 quality, 10.9% faster wall-clock)
- val_bpb 0.7241
Both warmdown=0.85 and muonclip individually regress at d22; together they
synergize. MuonClip is the only code addition — 66 LOC across optim.py +
gpt.py + base_train.py, default OFF preserves Run 6 behavior bit-identical.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-flag minimal change. When tau > 0, c_q/c_k weights are pulled into a
dedicated Muon group and rescaled after each Muon step so that Frobenius/sqrt(min_dim)
spectral-norm estimate <= sqrt(tau). Default tau=0 = no-op = bit-identical to v73.
Reference: Kimi K2 paper (arxiv 2507.20534 §A). Caps max attention logit ~tau.
Files touched (3):
nanochat/optim.py: +_apply_qk_clip helper, called after MuonAdamW.step
and DistMuonAdamW.step
nanochat/gpt.py: +muon_qk_clip_tau arg in setup_optimizer; splits c_q/c_k
into a dedicated Muon group when tau > 0
scripts/base_train.py: +--muon-qk-clip-tau CLI arg, threaded to setup_optimizer
Validated overnight (private fork) at d22 6000-iter:
v73 baseline: val_bpb 0.7242, CORE 0.2714, crosses GPT-2 CORE @ ~81 min
v198 (tau=100): val_bpb 0.7242, CORE 0.2731, crosses GPT-2 CORE @ ~80 min
All other stacks (warmdown, lr, warmup) regressed; tau sweep (50/100/200) showed
sharp peak at tau=100.
Generalizes across model depths because it's a Muon optimizer-level fix, not a
recipe tweak.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When swapping Float8Linear to Linear in disable_fp8 context manager,
using device=fp8_module.weight.device directly allocates new tensors
on GPU, causing unnecessary VRAM spike (~1GB for large models).
This fix uses device='meta' to avoid physical memory allocation,
then swaps in the weight tensor reference. This eliminates the
unnecessary VRAM spike during evaluation phase.
Fixes issue #592
Co-authored-by: RoomWithOutRoof <roomwithoutroof@sparklab.ai>
The bf16 cast is intentional for speed on Hopper+ GPUs, but should be
skipped on other platforms rather than blindly applied. fp16 is unstable
here due to its limited exponent range, and fp32 platforms don't benefit
from the cast. Now: bf16 when COMPUTE_DTYPE is bf16, no cast otherwise.
Inspired by PR #667.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New architectural features:
- Smear: mix previous token embedding into current position via learned
gate, providing cheap bigram-like info (works in training + KV cache)
- Backout: subtract learned fraction of mid-layer residual before logit
projection to remove low-level features
Hyperparameter tuning:
- Muon momentum warmdown 0.97→0.90 during LR warmdown phase
- Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05
- c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4
- Speedrun data:params ratio reduced to 8
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* printing steps count
* adding reply only loss for chat
* using the mask by render_conversation function of tokeniser
* undoing some changes
* putting back the comment which got removed accidently, no functionality change