Commit Graph

66 Commits

Author SHA1 Message Date
Andrej Karpathy
7312ec9898 fix buggy midtrain and update all kwargs to be idiomatic. that is, argparse uses dashes variables use underscores. the underscores are just a remnant of the previous Configurator object. This is the right way 2026-01-13 22:45:27 +00:00
Andrej Karpathy
3b50b77ed3 fix base_loss to report correct loss by switching the dataloader to the new default 2026-01-13 22:09:36 +00:00
Andrej Karpathy
43c29dd9d5 Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training
The new DataLoader ensures that every token sequence in train/val batches has a BOS token
at the beginning. Therefore, no token streams start abruptly in the middle of a document,
which could be confusing for the model. Note that this changes the loss scale because there
are fewer confusing tokens in the train/val batches. The main downside is that we now waste
about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md
entry for this change for a lot more information.
2026-01-13 20:05:47 +00:00
Andrej Karpathy
21608ec51e allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway 2026-01-12 03:10:13 +00:00
Andrej Karpathy
b33e394528 oops actually make SSSL the default window pattern 2026-01-11 21:50:35 +00:00
Andrej Karpathy
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb 2026-01-11 21:49:54 +00:00
Andrej Karpathy
aa530cdad5 Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb 2026-01-11 18:47:35 +00:00
Andrej Karpathy
2c4473dd1b Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes. 2026-01-11 16:56:59 +00:00
Andrej Karpathy
061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then 2026-01-08 02:16:50 +00:00
Andrej Karpathy
ccf4b7f9bf nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script 2026-01-07 22:11:59 +00:00
Adria Blancafort
1b5de29e71
Fix undefined variable in chat_rl after recent refactor
* Fix undefined variable

* Remove unused import

Remove unused import 're' from chat_rl.py
2026-01-07 09:08:57 -08:00
Andrej Karpathy
ae0bf52529 tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3 2026-01-05 18:57:46 +00:00
Andrej Karpathy
9d4c9b786d many small fixes to base_train: reporting ETA, allowing some additional kwarg flexibility, making sure we don't crash when e.g. depth = 11 - we now calculate the closest num_heads that works 2026-01-05 00:38:09 +00:00
Andrej Karpathy
eb7bbc1b66 delete the configurator in favor of argparse and clean up a lot of kwarg details to make them more consistent across all scripts 2026-01-04 19:14:23 +00:00
Andrej Karpathy
48abd7d85f simplify, clarify and slightly tune model initialization. should be very slightly better possibly, but certainly a lot clearer 2026-01-01 21:15:09 +00:00
helloaidank
389d019a0b
small change to doc string at top of tok_train.py (#402) 2025-12-31 12:57:26 -08:00
Andrej
088726aa7d
clean up model_tag handling across scripts a bit more. 2025-12-27 20:01:09 -08:00
Andrej Karpathy
2874eda59a update to new os env var to get rid of deprecation warning 2025-12-28 03:32:46 +00:00
DU Wenjie
ea4229851b bugfix 2025-12-26 19:02:12 +08:00
DU Wenjie
7840049189 bugfix keep same args style in scripts/base_eval.py 2025-12-26 17:29:08 +08:00
duwenjie
92c6654b95 bugfix save and load ckpt from model_tag dir 2025-12-21 15:07:04 +08:00
Andrej
39cccc527f
small bugfix make mid_train script work even with a tiny number of iterations 2025-12-08 18:27:32 -08:00
Andrej
8b1cecaa95
Apply suggestion from @svlandeg for nicer looking comparison
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2025-12-08 18:27:06 -08:00
Andrej
58f3e84e01
clean up train/val loader in sft for consistency with mid/base 2025-12-08 18:23:57 -08:00
Sanzo00
53b3a4fb81 fix: missing val_bpb on resume 2025-11-22 11:04:20 +08:00
svlandeg
4bcc3bb698 clarify comment 2025-11-21 13:19:45 +01:00
Eric Silberstein
f37d45c21f remove unneeded iter() 2025-11-20 15:14:56 -05:00
Eric Silberstein
dddb95caac make mid_train script work even with a tiny number of iterations 2025-11-19 15:52:20 -05:00
Andrej
4763ce612a
Small fixes to typos 2025-11-14 07:25:59 -08:00
svlandeg
a2fb3c83a6 fix typos 2025-11-14 11:20:25 +01:00
Andrej Karpathy
c6abcdfe3a big change: add pretraining resumption logic so that checkpoints can now be approximately resumed and training can continue. this is useful for very long runs when you don't want the anxiety of your run crashing for some reason. alternatively, it's a way to recover training in the event of loss spikes. i mean, this should have been there in v0 but it's ok. the resumption is approximate to control complexity and bloat, but it's possible we want to change that in the future. to use, set --save_every to a step interval to write checkpoints with, and then use --resume_from_step to resume optimization from a given step. only base model training (pretraining) supports this atm, but it's ok because midtraining is comparably quite a bit faster. 2025-11-13 15:34:40 +00:00
Andrej Karpathy
c6b7ab7440 grad clip logging and printing and cosmetics 2025-11-05 21:08:30 +00:00
svlandeg
2ce62ec076 ensure consistency of quotes within each statement 2025-11-03 21:52:02 +01:00
svlandeg
c72b8b2309 add explicit UTF-8 encoding 2025-11-03 21:27:12 +01:00
Dipesh Babu
226953b841 fix: open JSONL and results CSV with UTF-8 encoding for portability 2025-11-03 01:20:56 -05:00
svlandeg
52e85aaf80 Merge branch 'master' into fix/typo 2025-11-02 13:41:13 +01:00
Andrej Karpathy
cf587acb1a move eval bundle download to be lazy and inside the python code so that we can substantially simplify the run bash scripts 2025-11-01 16:04:38 +00:00
Andrej Karpathy
7d2c4a3d95 delete pandas dep in base_eval use csv instead 2025-11-01 15:28:30 +00:00
Andrej
dfc88334b6
fix tok/sec calculation bug when grad accum steps > 1
Fix tok/sec metrics for base_train and mid_train when gradient accumulation is not 1
2025-10-30 08:36:32 -07:00
svlandeg
70319851fc fix typo 2025-10-29 19:48:34 +01:00
svlandeg
8c9b004c99 typo fixes in scripts 2025-10-28 20:17:31 +01:00
water-vapor
a9de4b1038 Fix tok/sec metrics for base_train and mid_train when gradient accumulation is not 1 2025-10-26 01:43:49 -05:00
Andrej Karpathy
8892470f29 add the SpellingBee task so that nanochat can count r in strawberry etc. along the way we had to add a bunch of new functionality, e.g. extend the calculator to support the count function of python. possibly the current TaskMixture uses way too many synthetic examples of SpellingBee because the eval gives us exactly 100% performance on spelling. We can tune this later to reclaim some wall clock time here I think 2025-10-24 14:02:48 +00:00
Andrej Karpathy
81597cd616 move the lr schedule args up in base_train so they are tunable in configurator 2025-10-24 13:27:31 +00:00
Luke Stanley
defd1246aa Fix Torch crash caused by pinning on CPU 2025-10-21 20:28:10 +00:00
Andrej Karpathy
a088b7a6ec use enable_gqa of pytorch sdpa, allows us to delete some code, didnt realize it's available 2025-10-21 18:07:33 +00:00
Andrej Karpathy
5bdc99abfb merge and resolve conflict 2025-10-21 17:19:10 +00:00
Andrej Karpathy
dfcb1c16f1 Merge branch 'master' into cpu-mps-dev 2025-10-21 17:15:53 +00:00
Andrej Karpathy
fe5aed940b add personality to nanochat. breaks previous code on git pull and requires download of a new file from s3, but there is a helpful error message so hopefully its ok 2025-10-21 15:04:58 +00:00
karpathy
2e9669e03a upgrading all other files to be able to use cpu/mps as well as cuda. various minor other changes ,e.g. changing max_iterations to num_iterations in sft script for consistency in naming 2025-10-20 10:15:17 -07:00