Andrej Karpathy
|
aa530cdad5
|
Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb
|
2026-01-11 18:47:35 +00:00 |
|
Andrej Karpathy
|
2c4473dd1b
|
Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.
|
2026-01-11 16:56:59 +00:00 |
|
Andrej Karpathy
|
061f83c152
|
delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then
|
2026-01-08 02:16:50 +00:00 |
|
Andrej Karpathy
|
ccf4b7f9bf
|
nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script
|
2026-01-07 22:11:59 +00:00 |
|
Adria Blancafort
|
1b5de29e71
|
Fix undefined variable in chat_rl after recent refactor
* Fix undefined variable
* Remove unused import
Remove unused import 're' from chat_rl.py
|
2026-01-07 09:08:57 -08:00 |
|
Andrej Karpathy
|
ae0bf52529
|
tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3
|
2026-01-05 18:57:46 +00:00 |
|
Andrej Karpathy
|
9d4c9b786d
|
many small fixes to base_train: reporting ETA, allowing some additional kwarg flexibility, making sure we don't crash when e.g. depth = 11 - we now calculate the closest num_heads that works
|
2026-01-05 00:38:09 +00:00 |
|
Andrej Karpathy
|
eb7bbc1b66
|
delete the configurator in favor of argparse and clean up a lot of kwarg details to make them more consistent across all scripts
|
2026-01-04 19:14:23 +00:00 |
|
Andrej Karpathy
|
48abd7d85f
|
simplify, clarify and slightly tune model initialization. should be very slightly better possibly, but certainly a lot clearer
|
2026-01-01 21:15:09 +00:00 |
|
helloaidank
|
389d019a0b
|
small change to doc string at top of tok_train.py (#402)
|
2025-12-31 12:57:26 -08:00 |
|
Andrej
|
088726aa7d
|
clean up model_tag handling across scripts a bit more.
|
2025-12-27 20:01:09 -08:00 |
|
Andrej Karpathy
|
2874eda59a
|
update to new os env var to get rid of deprecation warning
|
2025-12-28 03:32:46 +00:00 |
|
DU Wenjie
|
ea4229851b
|
bugfix
|
2025-12-26 19:02:12 +08:00 |
|
DU Wenjie
|
7840049189
|
bugfix keep same args style in scripts/base_eval.py
|
2025-12-26 17:29:08 +08:00 |
|
duwenjie
|
92c6654b95
|
bugfix save and load ckpt from model_tag dir
|
2025-12-21 15:07:04 +08:00 |
|
Andrej
|
39cccc527f
|
small bugfix make mid_train script work even with a tiny number of iterations
|
2025-12-08 18:27:32 -08:00 |
|
Andrej
|
8b1cecaa95
|
Apply suggestion from @svlandeg for nicer looking comparison
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
|
2025-12-08 18:27:06 -08:00 |
|
Andrej
|
58f3e84e01
|
clean up train/val loader in sft for consistency with mid/base
|
2025-12-08 18:23:57 -08:00 |
|
Sanzo00
|
53b3a4fb81
|
fix: missing val_bpb on resume
|
2025-11-22 11:04:20 +08:00 |
|
svlandeg
|
4bcc3bb698
|
clarify comment
|
2025-11-21 13:19:45 +01:00 |
|
Eric Silberstein
|
f37d45c21f
|
remove unneeded iter()
|
2025-11-20 15:14:56 -05:00 |
|
Eric Silberstein
|
dddb95caac
|
make mid_train script work even with a tiny number of iterations
|
2025-11-19 15:52:20 -05:00 |
|
Andrej
|
4763ce612a
|
Small fixes to typos
|
2025-11-14 07:25:59 -08:00 |
|
svlandeg
|
a2fb3c83a6
|
fix typos
|
2025-11-14 11:20:25 +01:00 |
|
Andrej Karpathy
|
c6abcdfe3a
|
big change: add pretraining resumption logic so that checkpoints can now be approximately resumed and training can continue. this is useful for very long runs when you don't want the anxiety of your run crashing for some reason. alternatively, it's a way to recover training in the event of loss spikes. i mean, this should have been there in v0 but it's ok. the resumption is approximate to control complexity and bloat, but it's possible we want to change that in the future. to use, set --save_every to a step interval to write checkpoints with, and then use --resume_from_step to resume optimization from a given step. only base model training (pretraining) supports this atm, but it's ok because midtraining is comparably quite a bit faster.
|
2025-11-13 15:34:40 +00:00 |
|
Andrej Karpathy
|
c6b7ab7440
|
grad clip logging and printing and cosmetics
|
2025-11-05 21:08:30 +00:00 |
|
svlandeg
|
2ce62ec076
|
ensure consistency of quotes within each statement
|
2025-11-03 21:52:02 +01:00 |
|
svlandeg
|
c72b8b2309
|
add explicit UTF-8 encoding
|
2025-11-03 21:27:12 +01:00 |
|
Dipesh Babu
|
226953b841
|
fix: open JSONL and results CSV with UTF-8 encoding for portability
|
2025-11-03 01:20:56 -05:00 |
|
svlandeg
|
52e85aaf80
|
Merge branch 'master' into fix/typo
|
2025-11-02 13:41:13 +01:00 |
|
Andrej Karpathy
|
cf587acb1a
|
move eval bundle download to be lazy and inside the python code so that we can substantially simplify the run bash scripts
|
2025-11-01 16:04:38 +00:00 |
|
Andrej Karpathy
|
7d2c4a3d95
|
delete pandas dep in base_eval use csv instead
|
2025-11-01 15:28:30 +00:00 |
|
Andrej
|
dfc88334b6
|
fix tok/sec calculation bug when grad accum steps > 1
Fix tok/sec metrics for base_train and mid_train when gradient accumulation is not 1
|
2025-10-30 08:36:32 -07:00 |
|
svlandeg
|
70319851fc
|
fix typo
|
2025-10-29 19:48:34 +01:00 |
|
svlandeg
|
8c9b004c99
|
typo fixes in scripts
|
2025-10-28 20:17:31 +01:00 |
|
water-vapor
|
a9de4b1038
|
Fix tok/sec metrics for base_train and mid_train when gradient accumulation is not 1
|
2025-10-26 01:43:49 -05:00 |
|
Andrej Karpathy
|
8892470f29
|
add the SpellingBee task so that nanochat can count r in strawberry etc. along the way we had to add a bunch of new functionality, e.g. extend the calculator to support the count function of python. possibly the current TaskMixture uses way too many synthetic examples of SpellingBee because the eval gives us exactly 100% performance on spelling. We can tune this later to reclaim some wall clock time here I think
|
2025-10-24 14:02:48 +00:00 |
|
Andrej Karpathy
|
81597cd616
|
move the lr schedule args up in base_train so they are tunable in configurator
|
2025-10-24 13:27:31 +00:00 |
|
Luke Stanley
|
defd1246aa
|
Fix Torch crash caused by pinning on CPU
|
2025-10-21 20:28:10 +00:00 |
|
Andrej Karpathy
|
a088b7a6ec
|
use enable_gqa of pytorch sdpa, allows us to delete some code, didnt realize it's available
|
2025-10-21 18:07:33 +00:00 |
|
Andrej Karpathy
|
5bdc99abfb
|
merge and resolve conflict
|
2025-10-21 17:19:10 +00:00 |
|
Andrej Karpathy
|
dfcb1c16f1
|
Merge branch 'master' into cpu-mps-dev
|
2025-10-21 17:15:53 +00:00 |
|
Andrej Karpathy
|
fe5aed940b
|
add personality to nanochat. breaks previous code on git pull and requires download of a new file from s3, but there is a helpful error message so hopefully its ok
|
2025-10-21 15:04:58 +00:00 |
|
karpathy
|
2e9669e03a
|
upgrading all other files to be able to use cpu/mps as well as cuda. various minor other changes ,e.g. changing max_iterations to num_iterations in sft script for consistency in naming
|
2025-10-20 10:15:17 -07:00 |
|
Andrej Karpathy
|
c1d2ed1c13
|
use orig_model in sampling, silly of me to miss this
|
2025-10-20 00:05:09 +00:00 |
|
Andrej Karpathy
|
2bc521a6de
|
use orig_model in sampling, silly of me to miss this
|
2025-10-20 00:04:15 +00:00 |
|
karpathy
|
ae02650afe
|
update the midtraining script too
|
2025-10-16 16:33:17 -07:00 |
|
karpathy
|
df600b6ed5
|
many small tweaks. base, eval, core work now i think
|
2025-10-16 15:46:18 -07:00 |
|
karpathy
|
786119d593
|
add autodetect of device and related stuff. getting weird warnings/errors still, so wip
|
2025-10-16 10:26:19 -07:00 |
|
karpathy
|
279b74312c
|
adjust comment/guidance on device type
|
2025-10-16 10:06:39 -07:00 |
|