suraj-self
9f9ef95adc
Merge branch 'master' into fix-batch-size-assertion
2026-03-26 08:26:25 +05:30
RoomWithOutRoof
47e983eea7
fix: use meta device in disable_fp8 to avoid VRAM spike ( #616 )
...
When swapping Float8Linear to Linear in disable_fp8 context manager,
using device=fp8_module.weight.device directly allocates new tensors
on GPU, causing unnecessary VRAM spike (~1GB for large models).
This fix uses device='meta' to avoid physical memory allocation,
then swaps in the weight tensor reference. This eliminates the
unnecessary VRAM spike during evaluation phase.
Fixes issue #592
Co-authored-by: RoomWithOutRoof <roomwithoutroof@sparklab.ai>
2026-03-25 14:24:57 -07:00
Andrej Karpathy
1cd94d768f
bump D:N ratio to 12 per recent scaling laws re-run
2026-03-24 19:25:50 +00:00
suraj-self
daba23cbb5
Merge branch 'master' into fix-batch-size-assertion
2026-03-15 21:06:31 +05:30
Andrej Karpathy
a825e63f81
Autoresearch round 2: smear, backout, and hyperparameter tuning
...
New architectural features:
- Smear: mix previous token embedding into current position via learned
gate, providing cheap bigram-like info (works in training + KV cache)
- Backout: subtract learned fraction of mid-layer residual before logit
projection to remove low-level features
Hyperparameter tuning:
- Muon momentum warmdown 0.97→0.90 during LR warmdown phase
- Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05
- c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4
- Speedrun data:params ratio reduced to 8
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 17:03:06 +00:00
suraj-self
0e5403e7f6
Merge branch 'master' into fix-batch-size-assertion
2026-03-10 07:41:07 +05:30
Andrej Karpathy
6ed7d1d82c
All of these improvements were developed by Claude running autonomously over ~2 days using autoresearch. I didn't touch anything - incredible. All tuning was done on d12 but generalized easily to larger models (e.g. d24 in particular). This means we will also get a new "Time to GPT-2" Leaderboard entry, which I will push separately.
...
Optimizer & schedule changes:
- Increase unembedding LR 0.004 -> 0.008, weight decay 0.2 -> 0.28
- Per-group Adam betas and weight decay (instead of shared global betas)
- Muon beta2 0.95 -> 0.9, momentum warmup target 0.95 -> 0.97 over 400 steps
- Warmup: ratio-based -> absolute steps (default 40)
- Warmdown ratio 0.5 -> 0.65, final LR fraction 0.0 -> 0.05
- Weight decay schedule: linear -> cosine decay
- Polar express norm factor 1.02 -> 1.01
Architecture & init changes:
- VE gate: channels 32 -> 12, scale range 2x -> 3x, init small positive
- Add post-QK-norm scaling (q,k *= 1.15) for sharper attention
- Embedding init std 1.0 -> 0.8, MLP c_fc init 0.5x smaller
- RoPE base theta 10K -> 100K
- Short attention window: seq_len/2 -> ~seq_len/3 (ceil to 128 tile)
- Logit softcap 20 -> 15
2026-03-09 20:45:17 +00:00
suraj-self
28894e1262
Merge branch 'master' into fix-batch-size-assertion
2026-03-05 08:41:31 +05:30
Andrej Karpathy
1076f97059
delete autocast, an unnecessary thorn in my side, manage dtypes directly
2026-03-04 23:55:30 +00:00
Andrej Karpathy
324e69c45d
big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise
2026-03-04 19:47:12 +00:00
suraj-self
998b8f846b
Simplify batch size assertion message
2026-02-21 08:43:25 +05:30
suraj-self
d489a1fa22
Merge remote-tracking branch 'upstream/master' into fix-batch-size-assertion
2026-02-21 08:30:41 +05:30
Andrej Karpathy
bac5a35dd7
fix minor bug in fp8 application to skip tiny matmuls
2026-02-18 23:17:29 +00:00
Andrej Karpathy
77f8fb8303
a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps
2026-02-18 15:49:18 +00:00
suraj-self
240a60fec2
Add informative error message to batch size assertion
2026-02-16 22:02:35 +05:30
suraj-self
0f3b6a4654
Replace cryptic assertion with descriptive ValueError for batch size alignment
2026-02-16 21:20:53 +05:30
Andrej Karpathy
788dadeb88
a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps
2026-02-16 14:41:53 +00:00
Andrej Karpathy
2f09686724
clarify that this is bf16 mfu we're talking about
2026-02-10 23:35:00 +00:00
Andrej Karpathy
e569b59f92
delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
2026-02-10 18:46:39 +00:00
Andrej Karpathy
aeff095e97
better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
2026-02-06 19:22:28 +00:00
Andrej Karpathy
2c062aaa94
nit: don't mutate args, create new var for total_batch_size
2026-02-05 19:59:46 +00:00
Andrej Karpathy
f41dd3cbd7
auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on
2026-02-05 19:40:37 +00:00
Andrej Karpathy
6079f78fc3
add fp8 training with torchao
2026-02-03 21:03:42 +00:00
Andrej Karpathy
07c4dd4cd9
manually control the over-active garbage collector, save a small few minutes from a typical run
2026-02-02 01:44:30 +00:00
Andrej Karpathy
31b61d2d17
fix broken import sigh
2026-02-01 05:03:44 +00:00
Andrej Karpathy
3c3a3d7042
warmdown of 0.5 is slightly better:
2026-01-31 01:08:44 +00:00
Andrej Karpathy
41bb2eac32
Combine AdamW and Muon into single MuonAdamW optimizer, cleaner, ty @chrisjmccormick for idea/help
2026-01-29 00:52:08 +00:00
Andrej Karpathy
c8d93beed2
add engram-lite, add log, tune scaling laws analysis scripts
2026-01-27 22:31:17 +00:00
Andrej Karpathy
59e36cc727
first version of engram following modded nanogpt style
2026-01-25 18:59:51 +00:00
Andrej Karpathy
a91743c168
Merge branch 've'
2026-01-18 15:14:39 +00:00
Andrej Karpathy
cf5c9e5b8e
resolve a crash for odd depths because FA3 needs head_dim % 8 == 0
2026-01-18 00:07:08 +00:00
Andrej Karpathy
413e91aa0f
optimal ratio is now around 4
2026-01-17 23:51:09 +00:00
Andrej Karpathy
2955650327
add detection of device to report more correct mfu for bf16
2026-01-17 03:16:14 +00:00
Andrej Karpathy
8203efa919
implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user.
2026-01-16 17:37:51 +00:00
Andrej Karpathy
bdcc030ffa
oops legacy spurious line now
2026-01-15 23:32:20 +00:00
Andrej Karpathy
255f8b9af6
cleanly separate cpu and gpu sections
2026-01-15 23:30:11 +00:00
Andrej Karpathy
7312ec9898
fix buggy midtrain and update all kwargs to be idiomatic. that is, argparse uses dashes variables use underscores. the underscores are just a remnant of the previous Configurator object. This is the right way
2026-01-13 22:45:27 +00:00
Andrej Karpathy
43c29dd9d5
Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training
...
The new DataLoader ensures that every token sequence in train/val batches has a BOS token
at the beginning. Therefore, no token streams start abruptly in the middle of a document,
which could be confusing for the model. Note that this changes the loss scale because there
are fewer confusing tokens in the train/val batches. The main downside is that we now waste
about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md
entry for this change for a lot more information.
2026-01-13 20:05:47 +00:00
Andrej Karpathy
21608ec51e
allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway
2026-01-12 03:10:13 +00:00
Andrej Karpathy
b33e394528
oops actually make SSSL the default window pattern
2026-01-11 21:50:35 +00:00
Andrej Karpathy
fbc1484e8c
add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
2026-01-11 21:49:54 +00:00
Andrej Karpathy
aa530cdad5
Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb
2026-01-11 18:47:35 +00:00
Andrej Karpathy
2c4473dd1b
Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.
2026-01-11 16:56:59 +00:00
Andrej Karpathy
061f83c152
delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then
2026-01-08 02:16:50 +00:00
Andrej Karpathy
ccf4b7f9bf
nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script
2026-01-07 22:11:59 +00:00
Andrej Karpathy
ae0bf52529
tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3
2026-01-05 18:57:46 +00:00
Andrej Karpathy
9d4c9b786d
many small fixes to base_train: reporting ETA, allowing some additional kwarg flexibility, making sure we don't crash when e.g. depth = 11 - we now calculate the closest num_heads that works
2026-01-05 00:38:09 +00:00
Andrej Karpathy
eb7bbc1b66
delete the configurator in favor of argparse and clean up a lot of kwarg details to make them more consistent across all scripts
2026-01-04 19:14:23 +00:00
Andrej Karpathy
48abd7d85f
simplify, clarify and slightly tune model initialization. should be very slightly better possibly, but certainly a lot clearer
2026-01-01 21:15:09 +00:00
Andrej Karpathy
2874eda59a
update to new os env var to get rid of deprecation warning
2025-12-28 03:32:46 +00:00