Sermet Pekin
4cd1d3aeb5
Merge 3735eb9723 into 8180e1d8c1
2026-02-16 23:12:55 +01:00
Andrej Karpathy
8180e1d8c1
tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft
2026-02-16 20:23:04 +00:00
Andrej Karpathy
788dadeb88
a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps
2026-02-16 14:41:53 +00:00
Sermet Pekin
3735eb9723
simplify test.yml
...
- one command for both platforms
2026-02-16 16:41:56 +03:00
Sermet Pekin
7686d3c7e2
Update test.yml
...
- adds manual trigger for testing on GH actions
- adds --extra cpu parameter to uv run
2026-02-16 15:33:48 +03:00
Sermet Pekin
3185f928d7
Update .github/workflows/test.yml
...
- trigger tests on push and pull request only on branch master
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2026-02-16 13:47:08 +03:00
Sermet Pekin
4c5cf36035
Update test.yml
2026-02-16 11:01:16 +03:00
Sermet Pekin
90cdd1b5b8
Update test.yml
2026-02-16 10:58:39 +03:00
Sermet Pekin
dd147b018d
Create test.yml
2026-02-16 10:45:06 +03:00
Andrej Karpathy
2f09686724
clarify that this is bf16 mfu we're talking about
2026-02-10 23:35:00 +00:00
Andrej Karpathy
e569b59f92
delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
2026-02-10 18:46:39 +00:00
Andrej Karpathy
1ec0a34779
at 28 and above we start to need batch size 8
2026-02-08 18:26:34 +00:00
Andrej Karpathy
ff46300720
tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
2026-02-08 17:54:12 +00:00
Andrej Karpathy
aeff095e97
better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
2026-02-06 19:22:28 +00:00
Andrej Karpathy
685271dc8d
new optimal ratio for d26 training
2026-02-06 19:21:27 +00:00
Andrej Karpathy
e527521a3f
briefly mention batch ramp experimentation too, too weak to merge in my few attempts
2026-02-05 22:21:03 +00:00
Andrej Karpathy
96522798f1
docs docs docs
2026-02-05 20:27:07 +00:00
Andrej Karpathy
5fdd5cdb24
new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier
2026-02-05 20:11:32 +00:00
Andrej Karpathy
2c062aaa94
nit: don't mutate args, create new var for total_batch_size
2026-02-05 19:59:46 +00:00
Andrej Karpathy
f41dd3cbd7
auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on
2026-02-05 19:40:37 +00:00
Andrej Karpathy
98eed6df18
bring back an assert guarding against bad param sizing
2026-02-05 18:14:30 +00:00
Sofie Van Landeghem
012da1a78b
Typo fixes ( #480 )
...
* small typo
* few more small fixes
* small fixes in leaderboard.md
2026-02-05 19:12:50 +01:00
Andrej Karpathy
75b302f331
fix hash commit on leaderboard and a paragraph clarification
2026-02-05 16:14:28 +00:00
Andrej Karpathy
1144d186ed
try and fail relu^2 -> swiglu
2026-02-05 02:42:46 +00:00
Andrej Karpathy
d63b7ab9ac
try and fail relu^2 -> swiglu
2026-02-05 02:41:46 +00:00
Andrej Karpathy
718e5e9d67
correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt
2026-02-05 01:39:26 +00:00
Andrej Karpathy
542beb0c8c
bump speedrun to be the up to date leaderboard run
2026-02-04 02:12:04 +00:00
Andrej Karpathy
d510b1385b
quick experiments to log
2026-02-03 23:21:39 +00:00
Andrej Karpathy
16b8ac7da3
oops forgot to attach leaderboard file too
2026-02-03 21:06:12 +00:00
Andrej Karpathy
fe55b092b8
minor cosmetics for the table
2026-02-03 21:05:28 +00:00
Andrej Karpathy
a67eba35dc
add feb2 new leaderboard record from upgrading to fp8 training, +4.3% speedup to time to GPT-2
2026-02-03 21:03:42 +00:00
Andrej Karpathy
6079f78fc3
add fp8 training with torchao
2026-02-03 21:03:42 +00:00
Andrej Karpathy
8ebc14b348
small touchups to the eval script, re-order items etc, cosmetic
2026-02-03 21:03:42 +00:00
Sofie Van Landeghem
72b9064f9d
remove leftover mid references ( #491 )
2026-02-02 08:33:46 -08:00
Andrej Karpathy
b19b4f3e49
fix bug in speedrun script, batch size that doesn't OOM on 8XH100 for d24 is 16
2026-02-02 15:50:14 +00:00
Andrej Karpathy
230d6cf6c6
tune the synthetic data generation script. delete the king andrej stuff lol. also, upgrade to gemini 3
2026-02-02 01:45:59 +00:00
Andrej Karpathy
07c4dd4cd9
manually control the over-active garbage collector, save a small few minutes from a typical run
2026-02-02 01:44:30 +00:00
Andrej Karpathy
e8fec97d4c
slightly more efficient dataloader that reduces the number of python objects flying around and causing strain on runtime and garbage collector
2026-02-02 01:17:30 +00:00
Andrej Karpathy
8b4849d548
fix bug in chat_sft, the attention window must be preserved sigh
2026-02-01 20:58:44 +00:00
Andrej Karpathy
eaf49a33c8
fix path which i think was modified during the refactor and this is a bug introduced by claude i believe
2026-02-01 20:15:19 +00:00
Andrej Karpathy
31b61d2d17
fix broken import sigh
2026-02-01 05:03:44 +00:00
Sofie Van Landeghem
4d6415b8ef
use _PEAK_FLOPS_TABLE instead of if-else structure ( #479 )
2026-01-31 19:45:06 -08:00
Sofie Van Landeghem
43078c347e
clean up original tokenizing_distributed_data_loader ( #478 )
2026-01-31 19:44:12 -08:00
Franci Penov
dc291c627f
Add Blackwell (SM100) GPU support via SDPA fallback ( #475 )
2026-01-31 19:42:58 -08:00
Andrej Karpathy
0307997f9b
merge two files base_loss and base_eval into a single file, it's nicer this way, and unify the huggingface code associated with both
2026-02-01 02:36:43 +00:00
Andrej Karpathy
1ddaad1c1c
nuke midtraining from orbit, it's not as needed now that we have a BOS-aligned dataloader. Also change the README a lot. midtrianing is not yet fully properly erased across the board, but good enough for step 1
2026-01-31 19:12:25 +00:00
Andrej Karpathy
348fbb301b
fix dataloader for midtrain to never crop data. we can't just throw it away like we do in pretraining
2026-01-31 18:21:36 +00:00
Andrej Karpathy
3c3a3d7042
warmdown of 0.5 is slightly better:
2026-01-31 01:08:44 +00:00
Andrei Panferov
4d8dbaf6e0
Fix escape character in README bibtex entry ( #454 )
2026-01-30 09:34:02 -08:00
Andrej Karpathy
3ba42e8135
Fix SDPA KV-cache decode to respect sliding window ( #456 )
...
SDPA fallback now respects sliding window during single-token KV-cache
decode by slicing K/V to the last (window + 1) tokens.
Also simplifies the mask building for chunk inference to properly apply
sliding window in that path as well.
Fixes #452
Co-Authored-By: Kartik Vashishta <kartikv776@gmail.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 17:32:12 +00:00