Commit Graph

106 Commits

Author SHA1 Message Date
Kaiyue Wen
25ec1e6c43 Merge branch 'master' into muonh-submit
Resolved conflicts in scripts/base_train.py by keeping muonh-submit features
(hyperball optimizer support, norm_lr parameter, matrix warmup ratio) while
incorporating latest master improvements.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
2026-02-12 20:14:24 -08:00
Kaiyue Wen
fe2a80badd Replace torchao with minimal custom FP8 implementation
Added _Float8MatmulND to fp8.py:
- Handles N-D input tensors efficiently
- Does reshaping internally (opaque to torch.compile)
- Prevents external reshape overhead that was causing MFU regression
- ~75 lines of clean, documented code

Benefits:
- No torchao dependency (removed from pyproject.toml)
- Same performance as torchao for reparam_linear
- Consistent with fp8.py's minimal philosophy (~350 total lines)
- All FP8 logic in one self-contained module

Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
2026-02-12 17:05:06 -08:00
Kaiyue Wen
931d59c515 Use hybrid FP8 approach: torchao for reparam_linear, custom fp8 for layers
- reparam_linear: uses torchao for efficient N-D tensor handling without reshaping
- Float8Linear layers: uses custom fp8 module (simpler, same performance)
- This gives us the best of both: high MFU and minimal dependencies

Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
2026-02-12 16:59:52 -08:00
Kaiyue Wen
29487517ed Revert to torchao for FP8 training to fix MFU regression
The custom fp8 module had a performance issue in reparam_linear:
it was doing reshape→matmul→reshape on every linear layer, and
torch.compile couldn't fuse these operations because _Float8Matmul
was marked @allow_in_graph (opaque to compiler).

torchao's matmul_with_hp_or_float8_args handles N-D tensors directly
without external reshaping, allowing better fusion opportunities and
higher MFU.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
2026-02-12 16:58:05 -08:00
Kaiyue Wen
31e5bec402 Replace torchao with custom fp8 module in gpt.py
- Update reparam_linear to use nanochat.fp8.Float8Linear instead of torchao
- Replace matmul_with_hp_or_float8_args with direct _Float8Matmul.apply call
- Remove torchao dependency mention from base_train.py help text
- Functionally equivalent: both use torch._scaled_mm, custom version ~3% faster

Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
2026-02-12 16:25:52 -08:00
Kaiyue Wen
ee04406ebb Merge muonh-dev and master: FP8 training, optimizer tuning, and scaling improvements
Major changes:
- Add custom FP8 training module (replaces torchao dependency)
- Implement auto-calculated optimal batch sizes (1M for d26)
- Add hyperball data scaling
- Restore and tune momentum schedule (settled on 0.95)
- Add matrix warmup ratio and norm_lr parameters
- Improve weight decay scaling (Tepoch-based theory)
- Update d26 configuration and scaling laws
- Clarify MFU labeling as bf16_mfu
- Update leaderboard and documentation

Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
2026-02-12 16:15:15 -08:00
Andrej Karpathy
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon 2026-02-06 19:22:28 +00:00
Andrej Karpathy
2c062aaa94 nit: don't mutate args, create new var for total_batch_size 2026-02-05 19:59:46 +00:00
Andrej Karpathy
f41dd3cbd7 auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on 2026-02-05 19:40:37 +00:00
dangxingyu
595a0f460a Scale hyperball lr by depth 2026-02-03 21:29:51 -05:00
dangxingyu
77de3297ea Update warmdown and rename quickrun 2026-02-03 20:25:16 -05:00
dangxingyu
e28d4ead22 Add muonh model and quickrun 2026-02-03 20:14:51 -05:00
Andrej Karpathy
6079f78fc3 add fp8 training with torchao 2026-02-03 21:03:42 +00:00
Andrej Karpathy
8ebc14b348 small touchups to the eval script, re-order items etc, cosmetic 2026-02-03 21:03:42 +00:00
Sofie Van Landeghem
72b9064f9d
remove leftover mid references (#491) 2026-02-02 08:33:46 -08:00
Andrej Karpathy
07c4dd4cd9 manually control the over-active garbage collector, save a small few minutes from a typical run 2026-02-02 01:44:30 +00:00
Andrej Karpathy
8b4849d548 fix bug in chat_sft, the attention window must be preserved sigh 2026-02-01 20:58:44 +00:00
Andrej Karpathy
eaf49a33c8 fix path which i think was modified during the refactor and this is a bug introduced by claude i believe 2026-02-01 20:15:19 +00:00
Andrej Karpathy
31b61d2d17 fix broken import sigh 2026-02-01 05:03:44 +00:00
Andrej Karpathy
0307997f9b merge two files base_loss and base_eval into a single file, it's nicer this way, and unify the huggingface code associated with both 2026-02-01 02:36:43 +00:00
Andrej Karpathy
1ddaad1c1c nuke midtraining from orbit, it's not as needed now that we have a BOS-aligned dataloader. Also change the README a lot. midtrianing is not yet fully properly erased across the board, but good enough for step 1 2026-01-31 19:12:25 +00:00
Andrej Karpathy
348fbb301b fix dataloader for midtrain to never crop data. we can't just throw it away like we do in pretraining 2026-01-31 18:21:36 +00:00
Andrej Karpathy
3c3a3d7042 warmdown of 0.5 is slightly better: 2026-01-31 01:08:44 +00:00
Aarushi Singh
ace6740bdd
feat: allow top_k=0 in web api to disable filtering (#458)
* allow top_k=0 in web api to disable filtering

* adding a comment for clear reasoning

* adding change to docstring
2026-01-30 09:21:41 -08:00
Andrej Karpathy
41bb2eac32 Combine AdamW and Muon into single MuonAdamW optimizer, cleaner, ty @chrisjmccormick for idea/help 2026-01-29 00:52:08 +00:00
Andrej Karpathy
c88bbf8133 Merge branch 'engram' 2026-01-27 22:33:16 +00:00
Andrej Karpathy
c8d93beed2 add engram-lite, add log, tune scaling laws analysis scripts 2026-01-27 22:31:17 +00:00
Andrej Karpathy
8630d32be4 quick fix to not OOM main speedrun script 2026-01-26 22:31:42 +00:00
Andrej Karpathy
59e36cc727 first version of engram following modded nanogpt style 2026-01-25 18:59:51 +00:00
Andrej Karpathy
a91743c168 Merge branch 've' 2026-01-18 15:14:39 +00:00
Andrej Karpathy
cf5c9e5b8e resolve a crash for odd depths because FA3 needs head_dim % 8 == 0 2026-01-18 00:07:08 +00:00
Andrej Karpathy
413e91aa0f optimal ratio is now around 4 2026-01-17 23:51:09 +00:00
karpathy
f9a7e0f111 update the CPU/MPS script to give reasonable results. The model can at least answer that Paris is the capital of France and knows that the sky is blue, for about 40 minutes of training on my macbook. Also fixed a bug that existed due to KVCache bfloat16 dtype assumption 2026-01-17 12:27:30 -08:00
Andrej Karpathy
2955650327 add detection of device to report more correct mfu for bf16 2026-01-17 03:16:14 +00:00
Nitish Pandey
f42ae9e901
fix condition to perform bpb evaluation (#324)
Co-authored-by: svlandeg <svlandeg@github.com>
2026-01-16 18:56:43 -08:00
Andrej Karpathy
8203efa919 implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user. 2026-01-16 17:37:51 +00:00
Haoyu Wang
50413d2d67
typo in comments: change "GAPO" to "DAPO" 2026-01-15 22:03:42 -08:00
Sofie Van Landeghem
d4ea28d4e2
Fix args in readme (#438)
* fix commands in readme, using new arg format

* fix typo

* add required -i flag to chat_eval example runs
2026-01-15 16:26:38 -08:00
Andrej Karpathy
bdcc030ffa oops legacy spurious line now 2026-01-15 23:32:20 +00:00
Andrej Karpathy
255f8b9af6 cleanly separate cpu and gpu sections 2026-01-15 23:30:11 +00:00
Andrej Karpathy
7312ec9898 fix buggy midtrain and update all kwargs to be idiomatic. that is, argparse uses dashes variables use underscores. the underscores are just a remnant of the previous Configurator object. This is the right way 2026-01-13 22:45:27 +00:00
Andrej Karpathy
3b50b77ed3 fix base_loss to report correct loss by switching the dataloader to the new default 2026-01-13 22:09:36 +00:00
Andrej Karpathy
43c29dd9d5 Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training
The new DataLoader ensures that every token sequence in train/val batches has a BOS token
at the beginning. Therefore, no token streams start abruptly in the middle of a document,
which could be confusing for the model. Note that this changes the loss scale because there
are fewer confusing tokens in the train/val batches. The main downside is that we now waste
about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md
entry for this change for a lot more information.
2026-01-13 20:05:47 +00:00
Andrej Karpathy
21608ec51e allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway 2026-01-12 03:10:13 +00:00
Andrej Karpathy
b33e394528 oops actually make SSSL the default window pattern 2026-01-11 21:50:35 +00:00
Andrej Karpathy
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb 2026-01-11 21:49:54 +00:00
Andrej Karpathy
aa530cdad5 Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb 2026-01-11 18:47:35 +00:00
Andrej Karpathy
2c4473dd1b Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes. 2026-01-11 16:56:59 +00:00
Andrej Karpathy
061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then 2026-01-08 02:16:50 +00:00
Andrej Karpathy
ccf4b7f9bf nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script 2026-01-07 22:11:59 +00:00