Dipesh Babu
|
fef6543e38
|
Merge 1bf1fdaa0d into 0a23f87643
|
2026-02-18 13:20:40 -08:00 |
|
George Shakan
|
0a23f87643
|
Fix bug in setting precision (#538)
|
2026-02-18 07:42:11 -08:00 |
|
Sofie Van Landeghem
|
4800c62f6e
|
Fix MockModel's device definition (#535)
* fix MockModel's device definition
* cleanup
|
2026-02-17 16:03:46 -08:00 |
|
Andrej Karpathy
|
4a6e47b0c6
|
update dev log with recent
|
2026-02-17 15:44:54 +00:00 |
|
Andrej Karpathy
|
8180e1d8c1
|
tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft
|
2026-02-16 20:23:04 +00:00 |
|
Andrej Karpathy
|
788dadeb88
|
a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps
|
2026-02-16 14:41:53 +00:00 |
|
Andrej Karpathy
|
2f09686724
|
clarify that this is bf16 mfu we're talking about
|
2026-02-10 23:35:00 +00:00 |
|
Andrej Karpathy
|
e569b59f92
|
delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
|
2026-02-10 18:46:39 +00:00 |
|
Dipesh Babu
|
1bf1fdaa0d
|
remove unused import
|
2026-02-09 20:51:15 -05:00 |
|
Dipesh Babu
|
3675a44cd6
|
fix RoPE cache overflow with kv-cache by growing rope buffers
|
2026-02-09 20:46:31 -05:00 |
|
Dipesh Babu
|
56660c690b
|
Merge branch 'karpathy:master' into master
|
2026-02-09 20:44:44 -05:00 |
|
Andrej Karpathy
|
1ec0a34779
|
at 28 and above we start to need batch size 8
|
2026-02-08 18:26:34 +00:00 |
|
Andrej Karpathy
|
ff46300720
|
tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
|
2026-02-08 17:54:12 +00:00 |
|
Andrej Karpathy
|
aeff095e97
|
better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
|
2026-02-06 19:22:28 +00:00 |
|
Andrej Karpathy
|
685271dc8d
|
new optimal ratio for d26 training
|
2026-02-06 19:21:27 +00:00 |
|
Andrej Karpathy
|
e527521a3f
|
briefly mention batch ramp experimentation too, too weak to merge in my few attempts
|
2026-02-05 22:21:03 +00:00 |
|
Andrej Karpathy
|
96522798f1
|
docs docs docs
|
2026-02-05 20:27:07 +00:00 |
|
Andrej Karpathy
|
5fdd5cdb24
|
new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier
|
2026-02-05 20:11:32 +00:00 |
|
Andrej Karpathy
|
2c062aaa94
|
nit: don't mutate args, create new var for total_batch_size
|
2026-02-05 19:59:46 +00:00 |
|
Andrej Karpathy
|
f41dd3cbd7
|
auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on
|
2026-02-05 19:40:37 +00:00 |
|
Andrej Karpathy
|
98eed6df18
|
bring back an assert guarding against bad param sizing
|
2026-02-05 18:14:30 +00:00 |
|
Sofie Van Landeghem
|
012da1a78b
|
Typo fixes (#480)
* small typo
* few more small fixes
* small fixes in leaderboard.md
|
2026-02-05 19:12:50 +01:00 |
|
Andrej Karpathy
|
75b302f331
|
fix hash commit on leaderboard and a paragraph clarification
|
2026-02-05 16:14:28 +00:00 |
|
Andrej Karpathy
|
1144d186ed
|
try and fail relu^2 -> swiglu
|
2026-02-05 02:42:46 +00:00 |
|
Andrej Karpathy
|
d63b7ab9ac
|
try and fail relu^2 -> swiglu
|
2026-02-05 02:41:46 +00:00 |
|
Andrej Karpathy
|
718e5e9d67
|
correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt
|
2026-02-05 01:39:26 +00:00 |
|
Dipesh Babu
|
5e5c609b05
|
Merge branch 'karpathy:master' into master
|
2026-02-03 22:03:38 -05:00 |
|
Andrej Karpathy
|
542beb0c8c
|
bump speedrun to be the up to date leaderboard run
|
2026-02-04 02:12:04 +00:00 |
|
Andrej Karpathy
|
d510b1385b
|
quick experiments to log
|
2026-02-03 23:21:39 +00:00 |
|
Andrej Karpathy
|
16b8ac7da3
|
oops forgot to attach leaderboard file too
|
2026-02-03 21:06:12 +00:00 |
|
Andrej Karpathy
|
fe55b092b8
|
minor cosmetics for the table
|
2026-02-03 21:05:28 +00:00 |
|
Andrej Karpathy
|
a67eba35dc
|
add feb2 new leaderboard record from upgrading to fp8 training, +4.3% speedup to time to GPT-2
|
2026-02-03 21:03:42 +00:00 |
|
Andrej Karpathy
|
6079f78fc3
|
add fp8 training with torchao
|
2026-02-03 21:03:42 +00:00 |
|
Andrej Karpathy
|
8ebc14b348
|
small touchups to the eval script, re-order items etc, cosmetic
|
2026-02-03 21:03:42 +00:00 |
|
Sofie Van Landeghem
|
72b9064f9d
|
remove leftover mid references (#491)
|
2026-02-02 08:33:46 -08:00 |
|
Andrej Karpathy
|
b19b4f3e49
|
fix bug in speedrun script, batch size that doesn't OOM on 8XH100 for d24 is 16
|
2026-02-02 15:50:14 +00:00 |
|
Andrej Karpathy
|
230d6cf6c6
|
tune the synthetic data generation script. delete the king andrej stuff lol. also, upgrade to gemini 3
|
2026-02-02 01:45:59 +00:00 |
|
Andrej Karpathy
|
07c4dd4cd9
|
manually control the over-active garbage collector, save a small few minutes from a typical run
|
2026-02-02 01:44:30 +00:00 |
|
Andrej Karpathy
|
e8fec97d4c
|
slightly more efficient dataloader that reduces the number of python objects flying around and causing strain on runtime and garbage collector
|
2026-02-02 01:17:30 +00:00 |
|
Andrej Karpathy
|
8b4849d548
|
fix bug in chat_sft, the attention window must be preserved sigh
|
2026-02-01 20:58:44 +00:00 |
|
Andrej Karpathy
|
eaf49a33c8
|
fix path which i think was modified during the refactor and this is a bug introduced by claude i believe
|
2026-02-01 20:15:19 +00:00 |
|
Andrej Karpathy
|
31b61d2d17
|
fix broken import sigh
|
2026-02-01 05:03:44 +00:00 |
|
Sofie Van Landeghem
|
4d6415b8ef
|
use _PEAK_FLOPS_TABLE instead of if-else structure (#479)
|
2026-01-31 19:45:06 -08:00 |
|
Sofie Van Landeghem
|
43078c347e
|
clean up original tokenizing_distributed_data_loader (#478)
|
2026-01-31 19:44:12 -08:00 |
|
Franci Penov
|
dc291c627f
|
Add Blackwell (SM100) GPU support via SDPA fallback (#475)
|
2026-01-31 19:42:58 -08:00 |
|
Andrej Karpathy
|
0307997f9b
|
merge two files base_loss and base_eval into a single file, it's nicer this way, and unify the huggingface code associated with both
|
2026-02-01 02:36:43 +00:00 |
|
Dipesh Babu
|
beb34ac43c
|
fix: correct LR warmdown step range
|
2026-01-31 19:18:48 -05:00 |
|
Dipesh Babu
|
e336c12881
|
Merge branch 'karpathy:master' into master
|
2026-01-31 18:23:20 -05:00 |
|
Andrej Karpathy
|
1ddaad1c1c
|
nuke midtraining from orbit, it's not as needed now that we have a BOS-aligned dataloader. Also change the README a lot. midtrianing is not yet fully properly erased across the board, but good enough for step 1
|
2026-01-31 19:12:25 +00:00 |
|
Andrej Karpathy
|
348fbb301b
|
fix dataloader for midtrain to never crop data. we can't just throw it away like we do in pretraining
|
2026-01-31 18:21:36 +00:00 |
|