nanochat/nanochat
Andrej Karpathy a825e63f81 Autoresearch round 2: smear, backout, and hyperparameter tuning
New architectural features:
- Smear: mix previous token embedding into current position via learned
  gate, providing cheap bigram-like info (works in training + KV cache)
- Backout: subtract learned fraction of mid-layer residual before logit
  projection to remove low-level features

Hyperparameter tuning:
- Muon momentum warmdown 0.97→0.90 during LR warmdown phase
- Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05
- c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4
- Speedrun data:params ratio reduced to 8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 17:03:06 +00:00
..
__init__.py initial commit 2025-10-13 06:49:24 -07:00
checkpoint_manager.py tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft 2026-02-18 15:49:18 +00:00
common.py delete autocast, an unnecessary thorn in my side, manage dtypes directly 2026-03-04 23:55:30 +00:00
core_eval.py initial commit 2025-10-13 06:49:24 -07:00
dataloader.py big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise 2026-03-04 19:47:12 +00:00
dataset.py big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise 2026-03-04 19:47:12 +00:00
engine.py Autoresearch round 2: smear, backout, and hyperparameter tuning 2026-03-14 17:03:06 +00:00
execution.py nit delete redundant catch/raise in execute 2025-10-29 08:10:03 -07:00
flash_attention.py delete autocast, an unnecessary thorn in my side, manage dtypes directly 2026-03-04 23:55:30 +00:00
fp8.py delete autocast, an unnecessary thorn in my side, manage dtypes directly 2026-03-04 23:55:30 +00:00
gpt.py Autoresearch round 2: smear, backout, and hyperparameter tuning 2026-03-14 17:03:06 +00:00
logo.svg initial commit 2025-10-13 06:49:24 -07:00
loss_eval.py fix typos 2025-11-14 11:20:25 +01:00
optim.py All of these improvements were developed by Claude running autonomously over ~2 days using autoresearch. I didn't touch anything - incredible. All tuning was done on d12 but generalized easily to larger models (e.g. d24 in particular). This means we will also get a new "Time to GPT-2" Leaderboard entry, which I will push separately. 2026-03-09 20:45:17 +00:00
report.py remove leftover mid references (#491) 2026-02-02 08:33:46 -08:00
tokenizer.py adjust the comment on the regex pattern per recent experimnet see dev/LOG.md 2026-01-13 17:50:39 +00:00
ui.html Fix conversation scroll to bottom on some browsers + remove duplicated padding (#348) 2025-12-31 13:03:22 -08:00