nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-05-27 18:18:07 +00:00

History

Andrej Karpathy a825e63f81 Autoresearch round 2: smear, backout, and hyperparameter tuning New architectural features: - Smear: mix previous token embedding into current position via learned gate, providing cheap bigram-like info (works in training + KV cache) - Backout: subtract learned fraction of mid-layer residual before logit projection to remove low-level features Hyperparameter tuning: - Muon momentum warmdown 0.97→0.90 during LR warmdown phase - Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05 - c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4 - Speedrun data:params ratio reduced to 8 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>		2026-03-14 17:03:06 +00:00
..
miniseries.sh	at 28 and above we start to need batch size 8	2026-02-08 18:26:34 +00:00
runcpu.sh
scaling_laws.sh
speedrun.sh	Autoresearch round 2: smear, backout, and hyperparameter tuning	2026-03-14 17:03:06 +00:00