Andrej Karpathy
|
21608ec51e
|
allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway
|
2026-01-12 03:10:13 +00:00 |
|
Andrej Karpathy
|
b33e394528
|
oops actually make SSSL the default window pattern
|
2026-01-11 21:50:35 +00:00 |
|
Andrej Karpathy
|
fbc1484e8c
|
add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
|
2026-01-11 21:49:54 +00:00 |
|
Andrej Karpathy
|
aa530cdad5
|
Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb
|
2026-01-11 18:47:35 +00:00 |
|
Andrej Karpathy
|
2c4473dd1b
|
Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.
|
2026-01-11 16:56:59 +00:00 |
|
Andrej Karpathy
|
061f83c152
|
delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then
|
2026-01-08 02:16:50 +00:00 |
|
Andrej Karpathy
|
ccf4b7f9bf
|
nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script
|
2026-01-07 22:11:59 +00:00 |
|
Andrej Karpathy
|
ae0bf52529
|
tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3
|
2026-01-05 18:57:46 +00:00 |
|
Andrej Karpathy
|
9d4c9b786d
|
many small fixes to base_train: reporting ETA, allowing some additional kwarg flexibility, making sure we don't crash when e.g. depth = 11 - we now calculate the closest num_heads that works
|
2026-01-05 00:38:09 +00:00 |
|
Andrej Karpathy
|
eb7bbc1b66
|
delete the configurator in favor of argparse and clean up a lot of kwarg details to make them more consistent across all scripts
|
2026-01-04 19:14:23 +00:00 |
|
Andrej Karpathy
|
48abd7d85f
|
simplify, clarify and slightly tune model initialization. should be very slightly better possibly, but certainly a lot clearer
|
2026-01-01 21:15:09 +00:00 |
|
Andrej Karpathy
|
2874eda59a
|
update to new os env var to get rid of deprecation warning
|
2025-12-28 03:32:46 +00:00 |
|
Sanzo00
|
53b3a4fb81
|
fix: missing val_bpb on resume
|
2025-11-22 11:04:20 +08:00 |
|
Andrej Karpathy
|
c6abcdfe3a
|
big change: add pretraining resumption logic so that checkpoints can now be approximately resumed and training can continue. this is useful for very long runs when you don't want the anxiety of your run crashing for some reason. alternatively, it's a way to recover training in the event of loss spikes. i mean, this should have been there in v0 but it's ok. the resumption is approximate to control complexity and bloat, but it's possible we want to change that in the future. to use, set --save_every to a step interval to write checkpoints with, and then use --resume_from_step to resume optimization from a given step. only base model training (pretraining) supports this atm, but it's ok because midtraining is comparably quite a bit faster.
|
2025-11-13 15:34:40 +00:00 |
|
Andrej Karpathy
|
c6b7ab7440
|
grad clip logging and printing and cosmetics
|
2025-11-05 21:08:30 +00:00 |
|
Andrej
|
dfc88334b6
|
fix tok/sec calculation bug when grad accum steps > 1
Fix tok/sec metrics for base_train and mid_train when gradient accumulation is not 1
|
2025-10-30 08:36:32 -07:00 |
|
svlandeg
|
8c9b004c99
|
typo fixes in scripts
|
2025-10-28 20:17:31 +01:00 |
|
water-vapor
|
a9de4b1038
|
Fix tok/sec metrics for base_train and mid_train when gradient accumulation is not 1
|
2025-10-26 01:43:49 -05:00 |
|
Andrej Karpathy
|
81597cd616
|
move the lr schedule args up in base_train so they are tunable in configurator
|
2025-10-24 13:27:31 +00:00 |
|
Andrej Karpathy
|
a088b7a6ec
|
use enable_gqa of pytorch sdpa, allows us to delete some code, didnt realize it's available
|
2025-10-21 18:07:33 +00:00 |
|
Andrej Karpathy
|
5bdc99abfb
|
merge and resolve conflict
|
2025-10-21 17:19:10 +00:00 |
|
Andrej Karpathy
|
dfcb1c16f1
|
Merge branch 'master' into cpu-mps-dev
|
2025-10-21 17:15:53 +00:00 |
|
Andrej Karpathy
|
c1d2ed1c13
|
use orig_model in sampling, silly of me to miss this
|
2025-10-20 00:05:09 +00:00 |
|
Andrej Karpathy
|
2bc521a6de
|
use orig_model in sampling, silly of me to miss this
|
2025-10-20 00:04:15 +00:00 |
|
karpathy
|
df600b6ed5
|
many small tweaks. base, eval, core work now i think
|
2025-10-16 15:46:18 -07:00 |
|
karpathy
|
786119d593
|
add autodetect of device and related stuff. getting weird warnings/errors still, so wip
|
2025-10-16 10:26:19 -07:00 |
|
karpathy
|
279b74312c
|
adjust comment/guidance on device type
|
2025-10-16 10:06:39 -07:00 |
|
karpathy
|
306bc380ab
|
add support for CPU and for MPS. I had to change a few cosmetic things. I also discovered I think a bit of a bug, where I was casting wte to bfloat16 in the wrong place (the model init) instead of in init_weights
|
2025-10-16 10:04:43 -07:00 |
|
Andrej Karpathy
|
722da4f543
|
trying to add basic cpu support, will try mps too
|
2025-10-16 16:14:38 +00:00 |
|
karpathy
|
3a5e0bc50b
|
initial commit
|
2025-10-13 06:49:24 -07:00 |
|