Commit Graph

352 Commits

Author SHA1 Message Date
Andrej Karpathy
5019accc5b fix scaling laws scripts after the bigram embeddings were removed 2026-03-17 16:55:56 +00:00
Andrej Karpathy
1b1cc3c599 submit new time to GPT-2 leaderboard entry: 99 minutes 2026-03-14 17:15:01 +00:00
Andrej Karpathy
a825e63f81 Autoresearch round 2: smear, backout, and hyperparameter tuning
New architectural features:
- Smear: mix previous token embedding into current position via learned
  gate, providing cheap bigram-like info (works in training + KV cache)
- Backout: subtract learned fraction of mid-layer residual before logit
  projection to remove low-level features

Hyperparameter tuning:
- Muon momentum warmdown 0.97→0.90 during LR warmdown phase
- Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05
- c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4
- Speedrun data:params ratio reduced to 8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 17:03:06 +00:00
Andrej Karpathy
f068604948 new leaderboard entry coming from improvements of autoresearch round 1, time to gpt-2 from 2.02 hours to 1.80 hours 2026-03-10 06:26:39 +00:00
Andrej Karpathy
6ed7d1d82c All of these improvements were developed by Claude running autonomously over ~2 days using autoresearch. I didn't touch anything - incredible. All tuning was done on d12 but generalized easily to larger models (e.g. d24 in particular). This means we will also get a new "Time to GPT-2" Leaderboard entry, which I will push separately.
Optimizer & schedule changes:
- Increase unembedding LR 0.004 -> 0.008, weight decay 0.2 -> 0.28
- Per-group Adam betas and weight decay (instead of shared global betas)
- Muon beta2 0.95 -> 0.9, momentum warmup target 0.95 -> 0.97 over 400 steps
- Warmup: ratio-based -> absolute steps (default 40)
- Warmdown ratio 0.5 -> 0.65, final LR fraction 0.0 -> 0.05
- Weight decay schedule: linear -> cosine decay
- Polar express norm factor 1.02 -> 1.01

Architecture & init changes:
- VE gate: channels 32 -> 12, scale range 2x -> 3x, init small positive
- Add post-QK-norm scaling (q,k *= 1.15) for sharper attention
- Embedding init std 1.0 -> 0.8, MLP c_fc init 0.5x smaller
- RoPE base theta 10K -> 100K
- Short attention window: seq_len/2 -> ~seq_len/3 (ceil to 128 tile)
- Logit softcap 20 -> 15
2026-03-09 20:45:17 +00:00
Andrej Karpathy
1076f97059 delete autocast, an unnecessary thorn in my side, manage dtypes directly 2026-03-04 23:55:30 +00:00
Sofie Van Landeghem
752abc836e
Ensure that inputs and targets are contiguous (#569)
* call reshape instead of view in case the tensors are not contiguous

* fix directly in data loader instead
2026-03-04 13:58:27 -08:00
Andrej Karpathy
4b4077425b Document new Leaderboard entry congrats @ddudek for pointing out ClimbMix, time to GPT-2 is now 2.01 hours, down from 2.76 previously 2026-03-04 20:02:07 +00:00
Andrej Karpathy
324e69c45d big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise 2026-03-04 19:47:12 +00:00
Andrej Karpathy
b07604ebaa document the legacy fineweb100b dataset and the new climbmix400b dataset 2026-03-03 17:24:31 +00:00
Andrej Karpathy
aba30cb037 tune logit softcap? 2026-03-03 00:38:53 +00:00
Anish
83dccc20ae
Restore completion-only loss masking in SFT dataloader (#582)
* printing steps count

* adding reply only loss for chat

* using the mask by render_conversation function of tokeniser

* undoing some changes

* putting back the comment which got removed accidently, no functionality change
2026-03-02 16:37:47 -08:00
Dipesh Babu
c7ba252142
docs: fix typos in experiment log (#547) 2026-02-20 08:03:45 -08:00
Andrej Karpathy
2dffdc8cf6 document MoE exploration 2026-02-19 02:53:47 +00:00
Andrej Karpathy
48804bff3a report negative result on fineweb dataset 2026-02-18 23:45:31 +00:00
Andrej Karpathy
bb5137860e fix comment 2026-02-18 23:26:22 +00:00
Andrej Karpathy
458555117b Merge branch 'Chetter2-patch-1' 2026-02-18 23:17:39 +00:00
Andrej Karpathy
bac5a35dd7 fix minor bug in fp8 application to skip tiny matmuls 2026-02-18 23:17:29 +00:00
George Shakan
ad55575326 Fix bug in setting precision (#538) 2026-02-18 15:49:18 +00:00
Sofie Van Landeghem
cac43e8511 Fix MockModel's device definition (#535)
* fix MockModel's device definition

* cleanup
2026-02-18 15:49:18 +00:00
Andrej Karpathy
f5fe7925ed update dev log with recent 2026-02-18 15:49:18 +00:00
Andrej Karpathy
1415fb7617 tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft 2026-02-18 15:49:18 +00:00
Andrej Karpathy
77f8fb8303 a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps 2026-02-18 15:49:18 +00:00
George Shakan
0a23f87643
Fix bug in setting precision (#538) 2026-02-18 07:42:11 -08:00
Sofie Van Landeghem
4800c62f6e
Fix MockModel's device definition (#535)
* fix MockModel's device definition

* cleanup
2026-02-17 16:03:46 -08:00
Andrej Karpathy
4a6e47b0c6 update dev log with recent 2026-02-17 15:44:54 +00:00
Andrej Karpathy
8180e1d8c1 tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft 2026-02-16 20:23:04 +00:00
Andrej Karpathy
788dadeb88 a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps 2026-02-16 14:41:53 +00:00
Alan
124f49be98
Removed redundant qunatization of gradients 2026-02-15 15:41:33 +00:00
Alan
d9678ff0f9
Save FP8 tensors in autograd ctx instead of full-precision inputs
Store quantized input/weight and their inverse scales in _Float8Matmul ctx to avoid re-quantization in backward and reduce saved-activation memory without changing numerics.
2026-02-15 14:31:54 +00:00
Andrej Karpathy
2f09686724 clarify that this is bf16 mfu we're talking about 2026-02-10 23:35:00 +00:00
Andrej Karpathy
e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm 2026-02-10 18:46:39 +00:00
Andrej Karpathy
1ec0a34779 at 28 and above we start to need batch size 8 2026-02-08 18:26:34 +00:00
Andrej Karpathy
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing 2026-02-08 17:54:12 +00:00
Andrej Karpathy
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon 2026-02-06 19:22:28 +00:00
Andrej Karpathy
685271dc8d new optimal ratio for d26 training 2026-02-06 19:21:27 +00:00
Andrej Karpathy
e527521a3f briefly mention batch ramp experimentation too, too weak to merge in my few attempts 2026-02-05 22:21:03 +00:00
Andrej Karpathy
96522798f1 docs docs docs 2026-02-05 20:27:07 +00:00
Andrej Karpathy
5fdd5cdb24 new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier 2026-02-05 20:11:32 +00:00
Andrej Karpathy
2c062aaa94 nit: don't mutate args, create new var for total_batch_size 2026-02-05 19:59:46 +00:00
Andrej Karpathy
f41dd3cbd7 auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on 2026-02-05 19:40:37 +00:00
Andrej Karpathy
98eed6df18 bring back an assert guarding against bad param sizing 2026-02-05 18:14:30 +00:00
Sofie Van Landeghem
012da1a78b
Typo fixes (#480)
* small typo

* few more small fixes

* small fixes in leaderboard.md
2026-02-05 19:12:50 +01:00
Andrej Karpathy
75b302f331 fix hash commit on leaderboard and a paragraph clarification 2026-02-05 16:14:28 +00:00
Andrej Karpathy
1144d186ed try and fail relu^2 -> swiglu 2026-02-05 02:42:46 +00:00
Andrej Karpathy
d63b7ab9ac try and fail relu^2 -> swiglu 2026-02-05 02:41:46 +00:00
Andrej Karpathy
718e5e9d67 correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt 2026-02-05 01:39:26 +00:00
Andrej Karpathy
542beb0c8c bump speedrun to be the up to date leaderboard run 2026-02-04 02:12:04 +00:00
Andrej Karpathy
d510b1385b quick experiments to log 2026-02-03 23:21:39 +00:00
Andrej Karpathy
16b8ac7da3 oops forgot to attach leaderboard file too 2026-02-03 21:06:12 +00:00