Kian Kyars
2f7841cd50
remove all uv venv
2026-01-16 16:22:39 -08:00
Andrej Karpathy
184d4c12b1
also add to log about the FA3 changes
2026-01-16 18:25:04 +00:00
Andrej Karpathy
b62a5bc44a
naturally i failed to include the actual code in the previous commit facepalm
2026-01-16 17:39:41 +00:00
Andrej Karpathy
8203efa919
implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user.
2026-01-16 17:37:51 +00:00
Haoyu Wang
50413d2d67
typo in comments: change "GAPO" to "DAPO"
2026-01-15 22:03:42 -08:00
Andrej Karpathy
fbf2bbea25
update log with a bunch of attempts
2026-01-16 02:21:17 +00:00
Andrej Karpathy
747ed4491f
add negative result on olmo3 pretraining mix
2026-01-16 00:44:01 +00:00
Andrej Karpathy
7d1700c521
add zstd lib
2026-01-16 00:44:01 +00:00
Sofie Van Landeghem
d4ea28d4e2
Fix args in readme ( #438 )
...
* fix commands in readme, using new arg format
* fix typo
* add required -i flag to chat_eval example runs
2026-01-15 16:26:38 -08:00
Andrej Karpathy
bdcc030ffa
oops legacy spurious line now
2026-01-15 23:32:20 +00:00
Andrej Karpathy
22a71aa3d3
fuse adamw into a single torch compiled kernel similar to muon. it's about 1.7X faster, but overall it's so tiny that it's not making a major dent
2026-01-15 23:30:44 +00:00
Andrej Karpathy
255f8b9af6
cleanly separate cpu and gpu sections
2026-01-15 23:30:11 +00:00
Andrej Karpathy
6bb92403d5
changes and optimizations to muon, making it more efficient and simpler/cleaner a bit
2026-01-15 03:20:48 +00:00
Andrej Karpathy
3142ca1a28
minor helpful message
2026-01-15 03:20:21 +00:00
Andrej Karpathy
7312ec9898
fix buggy midtrain and update all kwargs to be idiomatic. that is, argparse uses dashes variables use underscores. the underscores are just a remnant of the previous Configurator object. This is the right way
2026-01-13 22:45:27 +00:00
Andrej Karpathy
3b50b77ed3
fix base_loss to report correct loss by switching the dataloader to the new default
2026-01-13 22:09:36 +00:00
Andrej Karpathy
f92efce169
add negative result about not allowing attention across BOS tokens. A lot more code complexity for basically no gain in performance
2026-01-13 21:33:54 +00:00
Andrej Karpathy
43c29dd9d5
Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training
...
The new DataLoader ensures that every token sequence in train/val batches has a BOS token
at the beginning. Therefore, no token streams start abruptly in the middle of a document,
which could be confusing for the model. Note that this changes the loss scale because there
are fewer confusing tokens in the train/val batches. The main downside is that we now waste
about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md
entry for this change for a lot more information.
2026-01-13 20:05:47 +00:00
Andrej Karpathy
23985413aa
adjust the comment on the regex pattern per recent experimnet see dev/LOG.md
2026-01-13 17:50:39 +00:00
Andrej Karpathy
64b48d0e5c
validated that \p{N}{1,2} is the correct number of digits to group up to in the regex pattern of the GPT-4 tokenizer (2 down from 3), leading to the best val_bpb for 32K vocabs
2026-01-13 17:45:06 +00:00
Andrej Karpathy
238353c998
document my struggle with fp8 integration yesterday, it's not working like i thought it would and i suffered. one day i will return to continue the fight.
2026-01-13 17:14:29 +00:00
Andrej Karpathy
4610a838a1
record negative result on MTP
2026-01-12 05:23:47 +00:00
Andrej Karpathy
21608ec51e
allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway
2026-01-12 03:10:13 +00:00
Andrej Karpathy
aa95fb2e03
make miniseries more generic and easier to run and less hard coded
2026-01-12 02:54:35 +00:00
Andrej Karpathy
b33e394528
oops actually make SSSL the default window pattern
2026-01-11 21:50:35 +00:00
Andrej Karpathy
fbc1484e8c
add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
2026-01-11 21:49:54 +00:00
Andrej Karpathy
2ff7d51252
integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge
2026-01-11 20:33:19 +00:00
Andrej Karpathy
201d705957
recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints
2026-01-11 20:13:12 +00:00
Andrej Karpathy
aa530cdad5
Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb
2026-01-11 18:47:35 +00:00
Andrej Karpathy
2c4473dd1b
Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.
2026-01-11 16:56:59 +00:00
Andrej Karpathy
f5a0ea4d3f
take out these gitignore dirs
2026-01-08 18:18:42 +00:00
Andrej Karpathy
4ddc803797
fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug
2026-01-08 18:18:42 +00:00
Sofie Van Landeghem
a1ccb3dc0b
remove rust compilation as rustbpe is now installed from separate package ( #416 )
2026-01-08 06:18:37 -08:00
Andrej Karpathy
061f83c152
delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then
2026-01-08 02:16:50 +00:00
Andrej Karpathy
e8c30c3b19
add notebook used for scaling laws analysis
2026-01-07 22:28:53 +00:00
Andrej Karpathy
3af4dcf6ee
also add scaling_laws.sh script if it's a useful reference
2026-01-07 22:25:13 +00:00
Andrej Karpathy
4cc605b940
quick pointer to miniseries post in readme for now
2026-01-07 22:14:21 +00:00
Andrej Karpathy
ccf4b7f9bf
nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script
2026-01-07 22:11:59 +00:00
Adria Blancafort
1b5de29e71
Fix undefined variable in chat_rl after recent refactor
...
* Fix undefined variable
* Remove unused import
Remove unused import 're' from chat_rl.py
2026-01-07 09:08:57 -08:00
Andrej Karpathy
ae0bf52529
tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3
2026-01-05 18:57:46 +00:00
Andrej Karpathy
eec0c79563
also add matplotlib dep so that we can have jupyter notebooks
2026-01-05 18:41:09 +00:00
Andrej Karpathy
54e59c38ad
add notebook on deriving the CORE estimates for the GPT-3 miniseries.
2026-01-05 18:40:28 +00:00
Andrej Karpathy
9d4c9b786d
many small fixes to base_train: reporting ETA, allowing some additional kwarg flexibility, making sure we don't crash when e.g. depth = 11 - we now calculate the closest num_heads that works
2026-01-05 00:38:09 +00:00
Andrej Karpathy
962b6bfba3
alright add transformers as a dep of the repo because it should be easy to evaluate the CORE score of HF models. Not super happy about it but i tried it and the uv.lock doesn't get bloated as much as i expected
2026-01-04 20:37:28 +00:00
Andrej Karpathy
ed2082fbc4
sane secrets management
2026-01-04 19:29:22 +00:00
Andrej Karpathy
eb7bbc1b66
delete the configurator in favor of argparse and clean up a lot of kwarg details to make them more consistent across all scripts
2026-01-04 19:14:23 +00:00
Andrej Karpathy
507d54224a
fix small bug where this would break if git stage has deleted files
2026-01-04 19:11:43 +00:00
Andrej Karpathy
9c60dfb64c
bump nanochat to use the latest stable pytorch that is 2.9.1 . Run e.g. to re-update your local environment if you git pull
2026-01-04 18:36:36 +00:00
Andrej Karpathy
be56d29b87
simplify redundant if/elif in bloat metrics
...
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-04 01:40:42 +00:00
Andrej Karpathy
ee79f29fbd
replace files-to-prompt with git ls-files for bloat metrics
...
files-to-prompt was including untracked files (knowledge/, dev scripts, etc.) which inflated the bloat metrics. now we use git ls-files to only count tracked source files, which is more accurate and removes an external dependency.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-04 01:38:15 +00:00