suraj-self
9f9ef95adc
Merge branch 'master' into fix-batch-size-assertion
2026-03-26 08:26:25 +05:30
Andrej
a4ed96687b
Merge pull request #634 from 2bitbit/fix-docs-and-comments
...
fix: correct minor typos in help text, README, and comments
2026-03-25 14:31:49 -07:00
Andrej
7b70f6b411
Merge pull request #639 from mathieu-lacage/master
...
Verified: both paths return identical data (99,842 rows), and all splits under subset='all' load cleanly.
2026-03-25 14:29:30 -07:00
RoomWithOutRoof
47e983eea7
fix: use meta device in disable_fp8 to avoid VRAM spike ( #616 )
...
When swapping Float8Linear to Linear in disable_fp8 context manager,
using device=fp8_module.weight.device directly allocates new tensors
on GPU, causing unnecessary VRAM spike (~1GB for large models).
This fix uses device='meta' to avoid physical memory allocation,
then swaps in the weight tensor reference. This eliminates the
unnecessary VRAM spike during evaluation phase.
Fixes issue #592
Co-authored-by: RoomWithOutRoof <roomwithoutroof@sparklab.ai>
2026-03-25 14:24:57 -07:00
Andrej Karpathy
1cd94d768f
bump D:N ratio to 12 per recent scaling laws re-run
2026-03-24 19:25:50 +00:00
suraj-self
daba23cbb5
Merge branch 'master' into fix-batch-size-assertion
2026-03-15 21:06:31 +05:30
Andrej Karpathy
a825e63f81
Autoresearch round 2: smear, backout, and hyperparameter tuning
...
New architectural features:
- Smear: mix previous token embedding into current position via learned
gate, providing cheap bigram-like info (works in training + KV cache)
- Backout: subtract learned fraction of mid-layer residual before logit
projection to remove low-level features
Hyperparameter tuning:
- Muon momentum warmdown 0.97→0.90 during LR warmdown phase
- Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05
- c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4
- Speedrun data:params ratio reduced to 8
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 17:03:06 +00:00
Mathieu Lacage
a641b6ca96
MMLU main split is named auxiliary_train, not train
2026-03-13 13:19:10 +01:00
2bitbit
2bb93b2ae4
fix: correct minor typos in help text, README, and comments
2026-03-12 17:03:26 +08:00
suraj-self
0e5403e7f6
Merge branch 'master' into fix-batch-size-assertion
2026-03-10 07:41:07 +05:30
Andrej Karpathy
6ed7d1d82c
All of these improvements were developed by Claude running autonomously over ~2 days using autoresearch. I didn't touch anything - incredible. All tuning was done on d12 but generalized easily to larger models (e.g. d24 in particular). This means we will also get a new "Time to GPT-2" Leaderboard entry, which I will push separately.
...
Optimizer & schedule changes:
- Increase unembedding LR 0.004 -> 0.008, weight decay 0.2 -> 0.28
- Per-group Adam betas and weight decay (instead of shared global betas)
- Muon beta2 0.95 -> 0.9, momentum warmup target 0.95 -> 0.97 over 400 steps
- Warmup: ratio-based -> absolute steps (default 40)
- Warmdown ratio 0.5 -> 0.65, final LR fraction 0.0 -> 0.05
- Weight decay schedule: linear -> cosine decay
- Polar express norm factor 1.02 -> 1.01
Architecture & init changes:
- VE gate: channels 32 -> 12, scale range 2x -> 3x, init small positive
- Add post-QK-norm scaling (q,k *= 1.15) for sharper attention
- Embedding init std 1.0 -> 0.8, MLP c_fc init 0.5x smaller
- RoPE base theta 10K -> 100K
- Short attention window: seq_len/2 -> ~seq_len/3 (ceil to 128 tile)
- Logit softcap 20 -> 15
2026-03-09 20:45:17 +00:00
suraj-self
f2899a1b4a
Extend informative assertion message to chat_sft.py for consistency
2026-03-08 16:30:53 +05:30
suraj-self
28894e1262
Merge branch 'master' into fix-batch-size-assertion
2026-03-05 08:41:31 +05:30
Andrej Karpathy
1076f97059
delete autocast, an unnecessary thorn in my side, manage dtypes directly
2026-03-04 23:55:30 +00:00
Sofie Van Landeghem
752abc836e
Ensure that inputs and targets are contiguous ( #569 )
...
* call reshape instead of view in case the tensors are not contiguous
* fix directly in data loader instead
2026-03-04 13:58:27 -08:00
Andrej Karpathy
324e69c45d
big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise
2026-03-04 19:47:12 +00:00
suraj-self
6e9ef8f565
Merge branch 'master' into fix-batch-size-assertion
2026-03-03 11:25:58 +05:30
Anish
83dccc20ae
Restore completion-only loss masking in SFT dataloader ( #582 )
...
* printing steps count
* adding reply only loss for chat
* using the mask by render_conversation function of tokeniser
* undoing some changes
* putting back the comment which got removed accidently, no functionality change
2026-03-02 16:37:47 -08:00
suraj-self
998b8f846b
Simplify batch size assertion message
2026-02-21 08:43:25 +05:30
suraj-self
d489a1fa22
Merge remote-tracking branch 'upstream/master' into fix-batch-size-assertion
2026-02-21 08:30:41 +05:30
Andrej Karpathy
bac5a35dd7
fix minor bug in fp8 application to skip tiny matmuls
2026-02-18 23:17:29 +00:00
Andrej Karpathy
1415fb7617
tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft
2026-02-18 15:49:18 +00:00
Andrej Karpathy
77f8fb8303
a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps
2026-02-18 15:49:18 +00:00
suraj-self
240a60fec2
Add informative error message to batch size assertion
2026-02-16 22:02:35 +05:30
suraj-self
0f3b6a4654
Replace cryptic assertion with descriptive ValueError for batch size alignment
2026-02-16 21:20:53 +05:30
Andrej Karpathy
788dadeb88
a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps
2026-02-16 14:41:53 +00:00
Andrej Karpathy
2f09686724
clarify that this is bf16 mfu we're talking about
2026-02-10 23:35:00 +00:00
Andrej Karpathy
e569b59f92
delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
2026-02-10 18:46:39 +00:00
Andrej Karpathy
aeff095e97
better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
2026-02-06 19:22:28 +00:00
Andrej Karpathy
2c062aaa94
nit: don't mutate args, create new var for total_batch_size
2026-02-05 19:59:46 +00:00
Andrej Karpathy
f41dd3cbd7
auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on
2026-02-05 19:40:37 +00:00
Andrej Karpathy
6079f78fc3
add fp8 training with torchao
2026-02-03 21:03:42 +00:00
Andrej Karpathy
8ebc14b348
small touchups to the eval script, re-order items etc, cosmetic
2026-02-03 21:03:42 +00:00
Sofie Van Landeghem
72b9064f9d
remove leftover mid references ( #491 )
2026-02-02 08:33:46 -08:00
Andrej Karpathy
07c4dd4cd9
manually control the over-active garbage collector, save a small few minutes from a typical run
2026-02-02 01:44:30 +00:00
Andrej Karpathy
8b4849d548
fix bug in chat_sft, the attention window must be preserved sigh
2026-02-01 20:58:44 +00:00
Andrej Karpathy
eaf49a33c8
fix path which i think was modified during the refactor and this is a bug introduced by claude i believe
2026-02-01 20:15:19 +00:00
Andrej Karpathy
31b61d2d17
fix broken import sigh
2026-02-01 05:03:44 +00:00
Andrej Karpathy
0307997f9b
merge two files base_loss and base_eval into a single file, it's nicer this way, and unify the huggingface code associated with both
2026-02-01 02:36:43 +00:00
Andrej Karpathy
1ddaad1c1c
nuke midtraining from orbit, it's not as needed now that we have a BOS-aligned dataloader. Also change the README a lot. midtrianing is not yet fully properly erased across the board, but good enough for step 1
2026-01-31 19:12:25 +00:00
Andrej Karpathy
348fbb301b
fix dataloader for midtrain to never crop data. we can't just throw it away like we do in pretraining
2026-01-31 18:21:36 +00:00
Andrej Karpathy
3c3a3d7042
warmdown of 0.5 is slightly better:
2026-01-31 01:08:44 +00:00
Aarushi Singh
ace6740bdd
feat: allow top_k=0 in web api to disable filtering ( #458 )
...
* allow top_k=0 in web api to disable filtering
* adding a comment for clear reasoning
* adding change to docstring
2026-01-30 09:21:41 -08:00
Andrej Karpathy
41bb2eac32
Combine AdamW and Muon into single MuonAdamW optimizer, cleaner, ty @chrisjmccormick for idea/help
2026-01-29 00:52:08 +00:00
Andrej Karpathy
c88bbf8133
Merge branch 'engram'
2026-01-27 22:33:16 +00:00
Andrej Karpathy
c8d93beed2
add engram-lite, add log, tune scaling laws analysis scripts
2026-01-27 22:31:17 +00:00
Andrej Karpathy
8630d32be4
quick fix to not OOM main speedrun script
2026-01-26 22:31:42 +00:00
Andrej Karpathy
59e36cc727
first version of engram following modded nanogpt style
2026-01-25 18:59:51 +00:00
Andrej Karpathy
a91743c168
Merge branch 've'
2026-01-18 15:14:39 +00:00
Andrej Karpathy
cf5c9e5b8e
resolve a crash for odd depths because FA3 needs head_dim % 8 == 0
2026-01-18 00:07:08 +00:00