Kaiyue Wen
330fa1188c
Merge origin/master into muonh
...
Resolved conflicts:
- nanochat/fp8.py: Kept _Float8MatmulND class from muonh
- scripts/base_train.py: Kept dual lrm logging from muonh
2026-02-12 21:30:17 -08:00
Kaiyue Wen
25ec1e6c43
Merge branch 'master' into muonh-submit
...
Resolved conflicts in scripts/base_train.py by keeping muonh-submit features
(hyperball optimizer support, norm_lr parameter, matrix warmup ratio) while
incorporating latest master improvements.
Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
2026-02-12 20:14:24 -08:00
Kaiyue Wen
116900ac16
muonh
2026-02-12 17:51:36 -08:00
Kaiyue Wen
5a965c1383
Remove runs/scaling_laws_muonh.sh
...
Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
2026-02-12 17:09:19 -08:00
Kaiyue Wen
fe2a80badd
Replace torchao with minimal custom FP8 implementation
...
Added _Float8MatmulND to fp8.py:
- Handles N-D input tensors efficiently
- Does reshaping internally (opaque to torch.compile)
- Prevents external reshape overhead that was causing MFU regression
- ~75 lines of clean, documented code
Benefits:
- No torchao dependency (removed from pyproject.toml)
- Same performance as torchao for reparam_linear
- Consistent with fp8.py's minimal philosophy (~350 total lines)
- All FP8 logic in one self-contained module
Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
2026-02-12 17:05:06 -08:00
Kaiyue Wen
931d59c515
Use hybrid FP8 approach: torchao for reparam_linear, custom fp8 for layers
...
- reparam_linear: uses torchao for efficient N-D tensor handling without reshaping
- Float8Linear layers: uses custom fp8 module (simpler, same performance)
- This gives us the best of both: high MFU and minimal dependencies
Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
2026-02-12 16:59:52 -08:00
Kaiyue Wen
29487517ed
Revert to torchao for FP8 training to fix MFU regression
...
The custom fp8 module had a performance issue in reparam_linear:
it was doing reshape→matmul→reshape on every linear layer, and
torch.compile couldn't fuse these operations because _Float8Matmul
was marked @allow_in_graph (opaque to compiler).
torchao's matmul_with_hp_or_float8_args handles N-D tensors directly
without external reshaping, allowing better fusion opportunities and
higher MFU.
Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
2026-02-12 16:58:05 -08:00
Kaiyue Wen
31e5bec402
Replace torchao with custom fp8 module in gpt.py
...
- Update reparam_linear to use nanochat.fp8.Float8Linear instead of torchao
- Replace matmul_with_hp_or_float8_args with direct _Float8Matmul.apply call
- Remove torchao dependency mention from base_train.py help text
- Functionally equivalent: both use torch._scaled_mm, custom version ~3% faster
Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
2026-02-12 16:25:52 -08:00
Kaiyue Wen
ee04406ebb
Merge muonh-dev and master: FP8 training, optimizer tuning, and scaling improvements
...
Major changes:
- Add custom FP8 training module (replaces torchao dependency)
- Implement auto-calculated optimal batch sizes (1M for d26)
- Add hyperball data scaling
- Restore and tune momentum schedule (settled on 0.95)
- Add matrix warmup ratio and norm_lr parameters
- Improve weight decay scaling (Tepoch-based theory)
- Update d26 configuration and scaling laws
- Clarify MFU labeling as bf16_mfu
- Update leaderboard and documentation
Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
2026-02-12 16:15:15 -08:00
Andrej Karpathy
2f09686724
clarify that this is bf16 mfu we're talking about
2026-02-10 23:35:00 +00:00
Andrej Karpathy
e569b59f92
delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
2026-02-10 18:46:39 +00:00
Andrej Karpathy
1ec0a34779
at 28 and above we start to need batch size 8
2026-02-08 18:26:34 +00:00
Andrej Karpathy
ff46300720
tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
2026-02-08 17:54:12 +00:00
Andrej Karpathy
aeff095e97
better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
2026-02-06 19:22:28 +00:00
Andrej Karpathy
685271dc8d
new optimal ratio for d26 training
2026-02-06 19:21:27 +00:00
Andrej Karpathy
e527521a3f
briefly mention batch ramp experimentation too, too weak to merge in my few attempts
2026-02-05 22:21:03 +00:00
Andrej Karpathy
96522798f1
docs docs docs
2026-02-05 20:27:07 +00:00
Andrej Karpathy
5fdd5cdb24
new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier
2026-02-05 20:11:32 +00:00
Andrej Karpathy
2c062aaa94
nit: don't mutate args, create new var for total_batch_size
2026-02-05 19:59:46 +00:00
Andrej Karpathy
f41dd3cbd7
auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on
2026-02-05 19:40:37 +00:00
Andrej Karpathy
98eed6df18
bring back an assert guarding against bad param sizing
2026-02-05 18:14:30 +00:00
Sofie Van Landeghem
012da1a78b
Typo fixes ( #480 )
...
* small typo
* few more small fixes
* small fixes in leaderboard.md
2026-02-05 19:12:50 +01:00
Andrej Karpathy
75b302f331
fix hash commit on leaderboard and a paragraph clarification
2026-02-05 16:14:28 +00:00
Andrej Karpathy
1144d186ed
try and fail relu^2 -> swiglu
2026-02-05 02:42:46 +00:00
Andrej Karpathy
d63b7ab9ac
try and fail relu^2 -> swiglu
2026-02-05 02:41:46 +00:00
Andrej Karpathy
718e5e9d67
correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt
2026-02-05 01:39:26 +00:00
dangxingyu
595a0f460a
Scale hyperball lr by depth
2026-02-03 21:29:51 -05:00
Andrej Karpathy
542beb0c8c
bump speedrun to be the up to date leaderboard run
2026-02-04 02:12:04 +00:00
dangxingyu
924489f582
Update quickrun defaults
2026-02-03 20:46:20 -05:00
dangxingyu
e7ee891c3b
Update quickrun script
2026-02-03 20:43:43 -05:00
dangxingyu
a611a85e35
Rename quickrun script
2026-02-03 20:29:55 -05:00
dangxingyu
4686cb9509
Update quickrun wandb mode
2026-02-03 20:26:11 -05:00
dangxingyu
77de3297ea
Update warmdown and rename quickrun
2026-02-03 20:25:16 -05:00
dangxingyu
e28d4ead22
Add muonh model and quickrun
2026-02-03 20:14:51 -05:00
Andrej Karpathy
d510b1385b
quick experiments to log
2026-02-03 23:21:39 +00:00
Andrej Karpathy
16b8ac7da3
oops forgot to attach leaderboard file too
2026-02-03 21:06:12 +00:00
Andrej Karpathy
fe55b092b8
minor cosmetics for the table
2026-02-03 21:05:28 +00:00
Andrej Karpathy
a67eba35dc
add feb2 new leaderboard record from upgrading to fp8 training, +4.3% speedup to time to GPT-2
2026-02-03 21:03:42 +00:00
Andrej Karpathy
6079f78fc3
add fp8 training with torchao
2026-02-03 21:03:42 +00:00
Andrej Karpathy
8ebc14b348
small touchups to the eval script, re-order items etc, cosmetic
2026-02-03 21:03:42 +00:00
Sofie Van Landeghem
72b9064f9d
remove leftover mid references ( #491 )
2026-02-02 08:33:46 -08:00
Andrej Karpathy
b19b4f3e49
fix bug in speedrun script, batch size that doesn't OOM on 8XH100 for d24 is 16
2026-02-02 15:50:14 +00:00
Andrej Karpathy
230d6cf6c6
tune the synthetic data generation script. delete the king andrej stuff lol. also, upgrade to gemini 3
2026-02-02 01:45:59 +00:00
Andrej Karpathy
07c4dd4cd9
manually control the over-active garbage collector, save a small few minutes from a typical run
2026-02-02 01:44:30 +00:00
Andrej Karpathy
e8fec97d4c
slightly more efficient dataloader that reduces the number of python objects flying around and causing strain on runtime and garbage collector
2026-02-02 01:17:30 +00:00
Andrej Karpathy
8b4849d548
fix bug in chat_sft, the attention window must be preserved sigh
2026-02-01 20:58:44 +00:00
Andrej Karpathy
eaf49a33c8
fix path which i think was modified during the refactor and this is a bug introduced by claude i believe
2026-02-01 20:15:19 +00:00
Andrej Karpathy
31b61d2d17
fix broken import sigh
2026-02-01 05:03:44 +00:00
Sofie Van Landeghem
4d6415b8ef
use _PEAK_FLOPS_TABLE instead of if-else structure ( #479 )
2026-01-31 19:45:06 -08:00
Sofie Van Landeghem
43078c347e
clean up original tokenizing_distributed_data_loader ( #478 )
2026-01-31 19:44:12 -08:00