nanochat/runs
Andrej Karpathy a825e63f81 Autoresearch round 2: smear, backout, and hyperparameter tuning
New architectural features:
- Smear: mix previous token embedding into current position via learned
  gate, providing cheap bigram-like info (works in training + KV cache)
- Backout: subtract learned fraction of mid-layer residual before logit
  projection to remove low-level features

Hyperparameter tuning:
- Muon momentum warmdown 0.97→0.90 during LR warmdown phase
- Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05
- c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4
- Speedrun data:params ratio reduced to 8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 17:03:06 +00:00
..
miniseries.sh at 28 and above we start to need batch size 8 2026-02-08 18:26:34 +00:00
runcpu.sh merge two files base_loss and base_eval into a single file, it's nicer this way, and unify the huggingface code associated with both 2026-02-01 02:36:43 +00:00
scaling_laws.sh add engram-lite, add log, tune scaling laws analysis scripts 2026-01-27 22:31:17 +00:00
speedrun.sh Autoresearch round 2: smear, backout, and hyperparameter tuning 2026-03-14 17:03:06 +00:00