Junyang Chen
767df6ef61
dataloader: reuse cropped remainders to reduce token waste ~35% -> ~23%
...
When the BestFit-Crop algorithm crops a document to fill remaining row space,
the leftover tokens are currently discarded. This change puts the remainder
(with BOS prepended) back into the document buffer for future rows.
Simulation results at T=2048 with realistic document length distribution:
- Source token consumption reduced by ~15%
- Data efficiency improved by ~1.18x
- Estimated ~28 minutes saved on d24 speedrun (3.04h -> ~2.57h)
The change is minimal (6 lines in the crop branch) and preserves all existing
properties: BOS-aligned rows, 100% utilization, deterministic packing order.
2026-02-18 23:04:43 -08:00
Andrej Karpathy
2dffdc8cf6
document MoE exploration
2026-02-19 02:53:47 +00:00
Andrej Karpathy
48804bff3a
report negative result on fineweb dataset
2026-02-18 23:45:31 +00:00
Andrej Karpathy
bb5137860e
fix comment
2026-02-18 23:26:22 +00:00
Andrej Karpathy
458555117b
Merge branch 'Chetter2-patch-1'
2026-02-18 23:17:39 +00:00
Andrej Karpathy
bac5a35dd7
fix minor bug in fp8 application to skip tiny matmuls
2026-02-18 23:17:29 +00:00
George Shakan
ad55575326
Fix bug in setting precision ( #538 )
2026-02-18 15:49:18 +00:00
Sofie Van Landeghem
cac43e8511
Fix MockModel's device definition ( #535 )
...
* fix MockModel's device definition
* cleanup
2026-02-18 15:49:18 +00:00
Andrej Karpathy
f5fe7925ed
update dev log with recent
2026-02-18 15:49:18 +00:00
Andrej Karpathy
1415fb7617
tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft
2026-02-18 15:49:18 +00:00
Andrej Karpathy
77f8fb8303
a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps
2026-02-18 15:49:18 +00:00
George Shakan
0a23f87643
Fix bug in setting precision ( #538 )
2026-02-18 07:42:11 -08:00
Sofie Van Landeghem
4800c62f6e
Fix MockModel's device definition ( #535 )
...
* fix MockModel's device definition
* cleanup
2026-02-17 16:03:46 -08:00
Andrej Karpathy
4a6e47b0c6
update dev log with recent
2026-02-17 15:44:54 +00:00
Andrej Karpathy
8180e1d8c1
tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft
2026-02-16 20:23:04 +00:00
Andrej Karpathy
788dadeb88
a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps
2026-02-16 14:41:53 +00:00
Alan
124f49be98
Removed redundant qunatization of gradients
2026-02-15 15:41:33 +00:00
Alan
d9678ff0f9
Save FP8 tensors in autograd ctx instead of full-precision inputs
...
Store quantized input/weight and their inverse scales in _Float8Matmul ctx to avoid re-quantization in backward and reduce saved-activation memory without changing numerics.
2026-02-15 14:31:54 +00:00
Andrej Karpathy
2f09686724
clarify that this is bf16 mfu we're talking about
2026-02-10 23:35:00 +00:00
Andrej Karpathy
e569b59f92
delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
2026-02-10 18:46:39 +00:00
Andrej Karpathy
1ec0a34779
at 28 and above we start to need batch size 8
2026-02-08 18:26:34 +00:00
Andrej Karpathy
ff46300720
tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
2026-02-08 17:54:12 +00:00
Andrej Karpathy
aeff095e97
better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
2026-02-06 19:22:28 +00:00
Andrej Karpathy
685271dc8d
new optimal ratio for d26 training
2026-02-06 19:21:27 +00:00
Andrej Karpathy
e527521a3f
briefly mention batch ramp experimentation too, too weak to merge in my few attempts
2026-02-05 22:21:03 +00:00
Andrej Karpathy
96522798f1
docs docs docs
2026-02-05 20:27:07 +00:00
Andrej Karpathy
5fdd5cdb24
new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier
2026-02-05 20:11:32 +00:00
Andrej Karpathy
2c062aaa94
nit: don't mutate args, create new var for total_batch_size
2026-02-05 19:59:46 +00:00
Andrej Karpathy
f41dd3cbd7
auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on
2026-02-05 19:40:37 +00:00
Andrej Karpathy
98eed6df18
bring back an assert guarding against bad param sizing
2026-02-05 18:14:30 +00:00
Sofie Van Landeghem
012da1a78b
Typo fixes ( #480 )
...
* small typo
* few more small fixes
* small fixes in leaderboard.md
2026-02-05 19:12:50 +01:00
Andrej Karpathy
75b302f331
fix hash commit on leaderboard and a paragraph clarification
2026-02-05 16:14:28 +00:00
Andrej Karpathy
1144d186ed
try and fail relu^2 -> swiglu
2026-02-05 02:42:46 +00:00
Andrej Karpathy
d63b7ab9ac
try and fail relu^2 -> swiglu
2026-02-05 02:41:46 +00:00
Andrej Karpathy
718e5e9d67
correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt
2026-02-05 01:39:26 +00:00
Andrej Karpathy
542beb0c8c
bump speedrun to be the up to date leaderboard run
2026-02-04 02:12:04 +00:00
Andrej Karpathy
d510b1385b
quick experiments to log
2026-02-03 23:21:39 +00:00
Andrej Karpathy
16b8ac7da3
oops forgot to attach leaderboard file too
2026-02-03 21:06:12 +00:00
Andrej Karpathy
fe55b092b8
minor cosmetics for the table
2026-02-03 21:05:28 +00:00
Andrej Karpathy
a67eba35dc
add feb2 new leaderboard record from upgrading to fp8 training, +4.3% speedup to time to GPT-2
2026-02-03 21:03:42 +00:00
Andrej Karpathy
6079f78fc3
add fp8 training with torchao
2026-02-03 21:03:42 +00:00
Andrej Karpathy
8ebc14b348
small touchups to the eval script, re-order items etc, cosmetic
2026-02-03 21:03:42 +00:00
Sofie Van Landeghem
72b9064f9d
remove leftover mid references ( #491 )
2026-02-02 08:33:46 -08:00
Andrej Karpathy
b19b4f3e49
fix bug in speedrun script, batch size that doesn't OOM on 8XH100 for d24 is 16
2026-02-02 15:50:14 +00:00
Andrej Karpathy
230d6cf6c6
tune the synthetic data generation script. delete the king andrej stuff lol. also, upgrade to gemini 3
2026-02-02 01:45:59 +00:00
Andrej Karpathy
07c4dd4cd9
manually control the over-active garbage collector, save a small few minutes from a typical run
2026-02-02 01:44:30 +00:00
Andrej Karpathy
e8fec97d4c
slightly more efficient dataloader that reduces the number of python objects flying around and causing strain on runtime and garbage collector
2026-02-02 01:17:30 +00:00
Andrej Karpathy
8b4849d548
fix bug in chat_sft, the attention window must be preserved sigh
2026-02-01 20:58:44 +00:00
Andrej Karpathy
eaf49a33c8
fix path which i think was modified during the refactor and this is a bug introduced by claude i believe
2026-02-01 20:15:19 +00:00
Andrej Karpathy
31b61d2d17
fix broken import sigh
2026-02-01 05:03:44 +00:00