Andrej Karpathy
|
aa530cdad5
|
Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb
|
2026-01-11 18:47:35 +00:00 |
|
Andrej Karpathy
|
2c4473dd1b
|
Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.
|
2026-01-11 16:56:59 +00:00 |
|
Andrej Karpathy
|
061f83c152
|
delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then
|
2026-01-08 02:16:50 +00:00 |
|