Commit Graph

349 Commits

Author SHA1 Message Date
Junyang Chen
eecf352d95
Merge 767df6ef61 into 1076f97059 2026-03-05 10:51:50 +01:00
Andrej Karpathy
1076f97059 delete autocast, an unnecessary thorn in my side, manage dtypes directly 2026-03-04 23:55:30 +00:00
Sofie Van Landeghem
752abc836e
Ensure that inputs and targets are contiguous (#569)
* call reshape instead of view in case the tensors are not contiguous

* fix directly in data loader instead
2026-03-04 13:58:27 -08:00
Andrej Karpathy
4b4077425b Document new Leaderboard entry congrats @ddudek for pointing out ClimbMix, time to GPT-2 is now 2.01 hours, down from 2.76 previously 2026-03-04 20:02:07 +00:00
Andrej Karpathy
324e69c45d big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise 2026-03-04 19:47:12 +00:00
Andrej Karpathy
b07604ebaa document the legacy fineweb100b dataset and the new climbmix400b dataset 2026-03-03 17:24:31 +00:00
Andrej Karpathy
aba30cb037 tune logit softcap? 2026-03-03 00:38:53 +00:00
Anish
83dccc20ae
Restore completion-only loss masking in SFT dataloader (#582)
* printing steps count

* adding reply only loss for chat

* using the mask by render_conversation function of tokeniser

* undoing some changes

* putting back the comment which got removed accidently, no functionality change
2026-03-02 16:37:47 -08:00
Dipesh Babu
c7ba252142
docs: fix typos in experiment log (#547) 2026-02-20 08:03:45 -08:00
Junyang Chen
767df6ef61 dataloader: reuse cropped remainders to reduce token waste ~35% -> ~23%
When the BestFit-Crop algorithm crops a document to fill remaining row space,
the leftover tokens are currently discarded. This change puts the remainder
(with BOS prepended) back into the document buffer for future rows.

Simulation results at T=2048 with realistic document length distribution:
- Source token consumption reduced by ~15%
- Data efficiency improved by ~1.18x
- Estimated ~28 minutes saved on d24 speedrun (3.04h -> ~2.57h)

The change is minimal (6 lines in the crop branch) and preserves all existing
properties: BOS-aligned rows, 100% utilization, deterministic packing order.
2026-02-18 23:04:43 -08:00
Andrej Karpathy
2dffdc8cf6 document MoE exploration 2026-02-19 02:53:47 +00:00
Andrej Karpathy
48804bff3a report negative result on fineweb dataset 2026-02-18 23:45:31 +00:00
Andrej Karpathy
bb5137860e fix comment 2026-02-18 23:26:22 +00:00
Andrej Karpathy
458555117b Merge branch 'Chetter2-patch-1' 2026-02-18 23:17:39 +00:00
Andrej Karpathy
bac5a35dd7 fix minor bug in fp8 application to skip tiny matmuls 2026-02-18 23:17:29 +00:00
George Shakan
ad55575326 Fix bug in setting precision (#538) 2026-02-18 15:49:18 +00:00
Sofie Van Landeghem
cac43e8511 Fix MockModel's device definition (#535)
* fix MockModel's device definition

* cleanup
2026-02-18 15:49:18 +00:00
Andrej Karpathy
f5fe7925ed update dev log with recent 2026-02-18 15:49:18 +00:00
Andrej Karpathy
1415fb7617 tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft 2026-02-18 15:49:18 +00:00
Andrej Karpathy
77f8fb8303 a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps 2026-02-18 15:49:18 +00:00
George Shakan
0a23f87643
Fix bug in setting precision (#538) 2026-02-18 07:42:11 -08:00
Sofie Van Landeghem
4800c62f6e
Fix MockModel's device definition (#535)
* fix MockModel's device definition

* cleanup
2026-02-17 16:03:46 -08:00
Andrej Karpathy
4a6e47b0c6 update dev log with recent 2026-02-17 15:44:54 +00:00
Andrej Karpathy
8180e1d8c1 tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft 2026-02-16 20:23:04 +00:00
Andrej Karpathy
788dadeb88 a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps 2026-02-16 14:41:53 +00:00
Alan
124f49be98
Removed redundant qunatization of gradients 2026-02-15 15:41:33 +00:00
Alan
d9678ff0f9
Save FP8 tensors in autograd ctx instead of full-precision inputs
Store quantized input/weight and their inverse scales in _Float8Matmul ctx to avoid re-quantization in backward and reduce saved-activation memory without changing numerics.
2026-02-15 14:31:54 +00:00
Andrej Karpathy
2f09686724 clarify that this is bf16 mfu we're talking about 2026-02-10 23:35:00 +00:00
Andrej Karpathy
e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm 2026-02-10 18:46:39 +00:00
Andrej Karpathy
1ec0a34779 at 28 and above we start to need batch size 8 2026-02-08 18:26:34 +00:00
Andrej Karpathy
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing 2026-02-08 17:54:12 +00:00
Andrej Karpathy
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon 2026-02-06 19:22:28 +00:00
Andrej Karpathy
685271dc8d new optimal ratio for d26 training 2026-02-06 19:21:27 +00:00
Andrej Karpathy
e527521a3f briefly mention batch ramp experimentation too, too weak to merge in my few attempts 2026-02-05 22:21:03 +00:00
Andrej Karpathy
96522798f1 docs docs docs 2026-02-05 20:27:07 +00:00
Andrej Karpathy
5fdd5cdb24 new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier 2026-02-05 20:11:32 +00:00
Andrej Karpathy
2c062aaa94 nit: don't mutate args, create new var for total_batch_size 2026-02-05 19:59:46 +00:00
Andrej Karpathy
f41dd3cbd7 auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on 2026-02-05 19:40:37 +00:00
Andrej Karpathy
98eed6df18 bring back an assert guarding against bad param sizing 2026-02-05 18:14:30 +00:00
Sofie Van Landeghem
012da1a78b
Typo fixes (#480)
* small typo

* few more small fixes

* small fixes in leaderboard.md
2026-02-05 19:12:50 +01:00
Andrej Karpathy
75b302f331 fix hash commit on leaderboard and a paragraph clarification 2026-02-05 16:14:28 +00:00
Andrej Karpathy
1144d186ed try and fail relu^2 -> swiglu 2026-02-05 02:42:46 +00:00
Andrej Karpathy
d63b7ab9ac try and fail relu^2 -> swiglu 2026-02-05 02:41:46 +00:00
Andrej Karpathy
718e5e9d67 correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt 2026-02-05 01:39:26 +00:00
Andrej Karpathy
542beb0c8c bump speedrun to be the up to date leaderboard run 2026-02-04 02:12:04 +00:00
Andrej Karpathy
d510b1385b quick experiments to log 2026-02-03 23:21:39 +00:00
Andrej Karpathy
16b8ac7da3 oops forgot to attach leaderboard file too 2026-02-03 21:06:12 +00:00
Andrej Karpathy
fe55b092b8 minor cosmetics for the table 2026-02-03 21:05:28 +00:00
Andrej Karpathy
a67eba35dc add feb2 new leaderboard record from upgrading to fp8 training, +4.3% speedup to time to GPT-2 2026-02-03 21:03:42 +00:00
Andrej Karpathy
6079f78fc3 add fp8 training with torchao 2026-02-03 21:03:42 +00:00