nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-04-04 06:35:23 +00:00

Author	SHA1	Message	Date
Xingyu Dang	4478f390fb	Merge `330fa1188c` into `c7ba252142`	2026-02-20 11:05:12 -05:00
Dipesh Babu	c7ba252142	docs: fix typos in experiment log (#547 )	2026-02-20 08:03:45 -08:00
Andrej Karpathy	2dffdc8cf6	document MoE exploration	2026-02-19 02:53:47 +00:00
Andrej Karpathy	48804bff3a	report negative result on fineweb dataset	2026-02-18 23:45:31 +00:00
Andrej Karpathy	bb5137860e	fix comment	2026-02-18 23:26:22 +00:00
Andrej Karpathy	458555117b	Merge branch 'Chetter2-patch-1'	2026-02-18 23:17:39 +00:00
Andrej Karpathy	bac5a35dd7	fix minor bug in fp8 application to skip tiny matmuls	2026-02-18 23:17:29 +00:00
George Shakan	ad55575326	Fix bug in setting precision (#538 )	2026-02-18 15:49:18 +00:00
Sofie Van Landeghem	cac43e8511	Fix MockModel's device definition (#535 ) * fix MockModel's device definition * cleanup	2026-02-18 15:49:18 +00:00
Andrej Karpathy	f5fe7925ed	update dev log with recent	2026-02-18 15:49:18 +00:00
Andrej Karpathy	1415fb7617	tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft	2026-02-18 15:49:18 +00:00
Andrej Karpathy	77f8fb8303	a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps	2026-02-18 15:49:18 +00:00
George Shakan	0a23f87643	Fix bug in setting precision (#538 )	2026-02-18 07:42:11 -08:00
Sofie Van Landeghem	4800c62f6e	Fix MockModel's device definition (#535 ) * fix MockModel's device definition * cleanup	2026-02-17 16:03:46 -08:00
Andrej Karpathy	4a6e47b0c6	update dev log with recent	2026-02-17 15:44:54 +00:00
Andrej Karpathy	8180e1d8c1	tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft	2026-02-16 20:23:04 +00:00
Andrej Karpathy	788dadeb88	a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps	2026-02-16 14:41:53 +00:00
Alan	124f49be98	Removed redundant qunatization of gradients	2026-02-15 15:41:33 +00:00
Alan	d9678ff0f9	Save FP8 tensors in autograd ctx instead of full-precision inputs Store quantized input/weight and their inverse scales in _Float8Matmul ctx to avoid re-quantization in backward and reduce saved-activation memory without changing numerics.	2026-02-15 14:31:54 +00:00
Kaiyue Wen	330fa1188c	Merge origin/master into muonh Resolved conflicts: - nanochat/fp8.py: Kept _Float8MatmulND class from muonh - scripts/base_train.py: Kept dual lrm logging from muonh	2026-02-12 21:30:17 -08:00
Kaiyue Wen	25ec1e6c43	Merge branch 'master' into muonh-submit Resolved conflicts in scripts/base_train.py by keeping muonh-submit features (hyperball optimizer support, norm_lr parameter, matrix warmup ratio) while incorporating latest master improvements. Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>	2026-02-12 20:14:24 -08:00
Kaiyue Wen	116900ac16	muonh	2026-02-12 17:51:36 -08:00
Kaiyue Wen	5a965c1383	Remove runs/scaling_laws_muonh.sh Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>	2026-02-12 17:09:19 -08:00
Kaiyue Wen	fe2a80badd	Replace torchao with minimal custom FP8 implementation Added _Float8MatmulND to fp8.py: - Handles N-D input tensors efficiently - Does reshaping internally (opaque to torch.compile) - Prevents external reshape overhead that was causing MFU regression - ~75 lines of clean, documented code Benefits: - No torchao dependency (removed from pyproject.toml) - Same performance as torchao for reparam_linear - Consistent with fp8.py's minimal philosophy (~350 total lines) - All FP8 logic in one self-contained module Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>	2026-02-12 17:05:06 -08:00
Kaiyue Wen	931d59c515	Use hybrid FP8 approach: torchao for reparam_linear, custom fp8 for layers - reparam_linear: uses torchao for efficient N-D tensor handling without reshaping - Float8Linear layers: uses custom fp8 module (simpler, same performance) - This gives us the best of both: high MFU and minimal dependencies Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>	2026-02-12 16:59:52 -08:00
Kaiyue Wen	29487517ed	Revert to torchao for FP8 training to fix MFU regression The custom fp8 module had a performance issue in reparam_linear: it was doing reshape→matmul→reshape on every linear layer, and torch.compile couldn't fuse these operations because _Float8Matmul was marked @allow_in_graph (opaque to compiler). torchao's matmul_with_hp_or_float8_args handles N-D tensors directly without external reshaping, allowing better fusion opportunities and higher MFU. Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>	2026-02-12 16:58:05 -08:00
Kaiyue Wen	31e5bec402	Replace torchao with custom fp8 module in gpt.py - Update reparam_linear to use nanochat.fp8.Float8Linear instead of torchao - Replace matmul_with_hp_or_float8_args with direct _Float8Matmul.apply call - Remove torchao dependency mention from base_train.py help text - Functionally equivalent: both use torch._scaled_mm, custom version ~3% faster Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>	2026-02-12 16:25:52 -08:00
Kaiyue Wen	ee04406ebb	Merge muonh-dev and master: FP8 training, optimizer tuning, and scaling improvements Major changes: - Add custom FP8 training module (replaces torchao dependency) - Implement auto-calculated optimal batch sizes (1M for d26) - Add hyperball data scaling - Restore and tune momentum schedule (settled on 0.95) - Add matrix warmup ratio and norm_lr parameters - Improve weight decay scaling (Tepoch-based theory) - Update d26 configuration and scaling laws - Clarify MFU labeling as bf16_mfu - Update leaderboard and documentation Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>	2026-02-12 16:15:15 -08:00
Andrej Karpathy	2f09686724	clarify that this is bf16 mfu we're talking about	2026-02-10 23:35:00 +00:00
Andrej Karpathy	e569b59f92	delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm	2026-02-10 18:46:39 +00:00
Andrej Karpathy	1ec0a34779	at 28 and above we start to need batch size 8	2026-02-08 18:26:34 +00:00
Andrej Karpathy	ff46300720	tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing	2026-02-08 17:54:12 +00:00
Andrej Karpathy	aeff095e97	better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon	2026-02-06 19:22:28 +00:00
Andrej Karpathy	685271dc8d	new optimal ratio for d26 training	2026-02-06 19:21:27 +00:00
Andrej Karpathy	e527521a3f	briefly mention batch ramp experimentation too, too weak to merge in my few attempts	2026-02-05 22:21:03 +00:00
Andrej Karpathy	96522798f1	docs docs docs	2026-02-05 20:27:07 +00:00
Andrej Karpathy	5fdd5cdb24	new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier	2026-02-05 20:11:32 +00:00
Andrej Karpathy	2c062aaa94	nit: don't mutate args, create new var for total_batch_size	2026-02-05 19:59:46 +00:00
Andrej Karpathy	f41dd3cbd7	auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on	2026-02-05 19:40:37 +00:00
Andrej Karpathy	98eed6df18	bring back an assert guarding against bad param sizing	2026-02-05 18:14:30 +00:00
Sofie Van Landeghem	012da1a78b	Typo fixes (#480 ) * small typo * few more small fixes * small fixes in leaderboard.md	2026-02-05 19:12:50 +01:00
Andrej Karpathy	75b302f331	fix hash commit on leaderboard and a paragraph clarification	2026-02-05 16:14:28 +00:00
Andrej Karpathy	1144d186ed	try and fail relu^2 -> swiglu	2026-02-05 02:42:46 +00:00
Andrej Karpathy	d63b7ab9ac	try and fail relu^2 -> swiglu	2026-02-05 02:41:46 +00:00
Andrej Karpathy	718e5e9d67	correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt	2026-02-05 01:39:26 +00:00
dangxingyu	595a0f460a	Scale hyperball lr by depth	2026-02-03 21:29:51 -05:00
Andrej Karpathy	542beb0c8c	bump speedrun to be the up to date leaderboard run	2026-02-04 02:12:04 +00:00
dangxingyu	924489f582	Update quickrun defaults	2026-02-03 20:46:20 -05:00
dangxingyu	e7ee891c3b	Update quickrun script	2026-02-03 20:43:43 -05:00
dangxingyu	a611a85e35	Rename quickrun script	2026-02-03 20:29:55 -05:00

1 2 3 4 5 ...

357 Commits