Andrej Karpathy
1076f97059
delete autocast, an unnecessary thorn in my side, manage dtypes directly
2026-03-04 23:55:30 +00:00
Andrej Karpathy
324e69c45d
big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise
2026-03-04 19:47:12 +00:00
Andrej Karpathy
aba30cb037
tune logit softcap?
2026-03-03 00:38:53 +00:00
Dipesh Babu
c7ba252142
docs: fix typos in experiment log ( #547 )
2026-02-20 08:03:45 -08:00
Andrej Karpathy
2dffdc8cf6
document MoE exploration
2026-02-19 02:53:47 +00:00
Andrej Karpathy
48804bff3a
report negative result on fineweb dataset
2026-02-18 23:45:31 +00:00
Andrej Karpathy
4a6e47b0c6
update dev log with recent
2026-02-17 15:44:54 +00:00
Andrej Karpathy
e527521a3f
briefly mention batch ramp experimentation too, too weak to merge in my few attempts
2026-02-05 22:21:03 +00:00
Andrej Karpathy
f41dd3cbd7
auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on
2026-02-05 19:40:37 +00:00
Andrej Karpathy
1144d186ed
try and fail relu^2 -> swiglu
2026-02-05 02:42:46 +00:00
Andrej Karpathy
d63b7ab9ac
try and fail relu^2 -> swiglu
2026-02-05 02:41:46 +00:00
Andrej Karpathy
718e5e9d67
correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt
2026-02-05 01:39:26 +00:00
Andrej Karpathy
d510b1385b
quick experiments to log
2026-02-03 23:21:39 +00:00
Andrej Karpathy
a67eba35dc
add feb2 new leaderboard record from upgrading to fp8 training, +4.3% speedup to time to GPT-2
2026-02-03 21:03:42 +00:00
Andrej Karpathy
ebd4d9bbf5
tried muonh, appealing but didn't work out of the box
2026-01-29 19:01:36 +00:00
Andrej Karpathy
74554be3b5
revert engram, not seeing an improvement at larger scale
2026-01-28 20:07:39 +00:00
Sofie Van Landeghem
d5418ea5a1
Fix link to DeepSeek Engram paper ( #470 )
...
* Fix link to DeepSeek Engram paper in LOG.md
Updated link to the DeepSeek Engram paper in the log.
* remove www
2026-01-28 08:31:44 -08:00
Andrej Karpathy
c8d93beed2
add engram-lite, add log, tune scaling laws analysis scripts
2026-01-27 22:31:17 +00:00
Andrej Karpathy
85b3e95e09
320 experiments just to tune the adam beta1 of x0 a little bit up from 0.8 to 0.96
2026-01-25 00:04:02 +00:00
Andrej Karpathy
d58fcd9d73
log for jan 17
2026-01-18 03:01:17 +00:00
Andrej Karpathy
1933e85046
brief update to log
2026-01-17 00:25:50 +00:00
Andrej Karpathy
184d4c12b1
also add to log about the FA3 changes
2026-01-16 18:25:04 +00:00
Andrej Karpathy
fbf2bbea25
update log with a bunch of attempts
2026-01-16 02:21:17 +00:00
Andrej Karpathy
747ed4491f
add negative result on olmo3 pretraining mix
2026-01-16 00:44:01 +00:00
Andrej Karpathy
f92efce169
add negative result about not allowing attention across BOS tokens. A lot more code complexity for basically no gain in performance
2026-01-13 21:33:54 +00:00
Andrej Karpathy
43c29dd9d5
Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training
...
The new DataLoader ensures that every token sequence in train/val batches has a BOS token
at the beginning. Therefore, no token streams start abruptly in the middle of a document,
which could be confusing for the model. Note that this changes the loss scale because there
are fewer confusing tokens in the train/val batches. The main downside is that we now waste
about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md
entry for this change for a lot more information.
2026-01-13 20:05:47 +00:00
Andrej Karpathy
64b48d0e5c
validated that \p{N}{1,2} is the correct number of digits to group up to in the regex pattern of the GPT-4 tokenizer (2 down from 3), leading to the best val_bpb for 32K vocabs
2026-01-13 17:45:06 +00:00
Andrej Karpathy
238353c998
document my struggle with fp8 integration yesterday, it's not working like i thought it would and i suffered. one day i will return to continue the fight.
2026-01-13 17:14:29 +00:00
Andrej Karpathy
4610a838a1
record negative result on MTP
2026-01-12 05:23:47 +00:00
Andrej Karpathy
fbc1484e8c
add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
2026-01-11 21:49:54 +00:00
Andrej Karpathy
2ff7d51252
integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge
2026-01-11 20:33:19 +00:00
Andrej Karpathy
aa530cdad5
Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb
2026-01-11 18:47:35 +00:00
Andrej Karpathy
2c4473dd1b
Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.
2026-01-11 16:56:59 +00:00
Andrej Karpathy
061f83c152
delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then
2026-01-08 02:16:50 +00:00