nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-03-27 15:15:16 +00:00

Author	SHA1	Message	Date
William Thurston	b7629eff5d	Add L3 (Large Lookup Layers) following arXiv:2601.21461v2 L3 generalizes token embeddings by placing per-token lookup tables inside the decoder stack. Unlike MoE, routing is static (determined by token ID), eliminating router training and load-balancing losses. Implementation: - nanochat/l3.py: LZW allocation algorithm and L3Layer module with vectorized gather+pad+mask forward pass, tied/untied KV support - GPT integration: L3 layers sit between decoder blocks, applied residually (x = x + l3_layer(x, token_ids)) - CLI: --l3-after-layers, --l3-n-emb, --l3-d-up, --l3-k-max flags with LZW precomputation from training data sample - 17 tests covering allocation, layer, and GPT integration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 15:49:15 -08:00
William Thurston	194c98a5b3	Merge upstream/master (266 commits) into fork Accept upstream's architectural changes wholesale: - argparse replaces configurator.py across all scripts - Unified MuonAdamW optimizer replaces separate AdamW + Muon - Sliding window attention (SSSL pattern) + Flash Attention 3 - Value embeddings (ResFormer-style) with per-layer gating - Per-layer learnable scalars (resid_lambdas, x0_lambdas) - FP8 training support with Float8Linear - Scaling laws (Power Lines batch sizing, T_epoch weight decay) - Checkpoint resumption with dataloader state - BOS-aligned bestfit-pad packing for SFT - ChatCORE evaluation metric - Consolidated base_loss.py into base_eval.py - Removed mid_train.py (pipeline simplified) Drops our MoE and tie_embeddings implementations in favor of upstream's cleaner architecture. These can be re-added later on top of the new codebase if needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 14:50:28 -08:00
William Thurston	0c942a8c00	Add tie_embeddings support and configurable logging interval Implement weight tying between token embeddings and lm_head to reduce parameter count. When enabled, logits are scaled by 1/√d_model, lm_head zeroing is skipped, and optimizer groups are deduplicated. Param counting uses unique parameters while Chinchilla ratio calculation adds back the would-be lm_head size for comparability. Also adds boolean flag parsing (--flag without =value) to the configurator, an auto-derived log_every interval, and minor shell script fixes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 14:42:58 -08:00
Dipesh Babu	c7ba252142	docs: fix typos in experiment log (#547 )	2026-02-20 08:03:45 -08:00
Andrej Karpathy	2dffdc8cf6	document MoE exploration	2026-02-19 02:53:47 +00:00
Andrej Karpathy	48804bff3a	report negative result on fineweb dataset	2026-02-18 23:45:31 +00:00
Andrej Karpathy	bb5137860e	fix comment	2026-02-18 23:26:22 +00:00
Andrej Karpathy	458555117b	Merge branch 'Chetter2-patch-1'	2026-02-18 23:17:39 +00:00
Andrej Karpathy	bac5a35dd7	fix minor bug in fp8 application to skip tiny matmuls	2026-02-18 23:17:29 +00:00
George Shakan	ad55575326	Fix bug in setting precision (#538 )	2026-02-18 15:49:18 +00:00
Sofie Van Landeghem	cac43e8511	Fix MockModel's device definition (#535 ) * fix MockModel's device definition * cleanup	2026-02-18 15:49:18 +00:00
Andrej Karpathy	f5fe7925ed	update dev log with recent	2026-02-18 15:49:18 +00:00
Andrej Karpathy	1415fb7617	tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft	2026-02-18 15:49:18 +00:00
Andrej Karpathy	77f8fb8303	a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps	2026-02-18 15:49:18 +00:00
George Shakan	0a23f87643	Fix bug in setting precision (#538 )	2026-02-18 07:42:11 -08:00
Sofie Van Landeghem	4800c62f6e	Fix MockModel's device definition (#535 ) * fix MockModel's device definition * cleanup	2026-02-17 16:03:46 -08:00
Andrej Karpathy	4a6e47b0c6	update dev log with recent	2026-02-17 15:44:54 +00:00
Andrej Karpathy	8180e1d8c1	tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft	2026-02-16 20:23:04 +00:00
Andrej Karpathy	788dadeb88	a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps	2026-02-16 14:41:53 +00:00
Alan	124f49be98	Removed redundant qunatization of gradients	2026-02-15 15:41:33 +00:00
Alan	d9678ff0f9	Save FP8 tensors in autograd ctx instead of full-precision inputs Store quantized input/weight and their inverse scales in _Float8Matmul ctx to avoid re-quantization in backward and reduce saved-activation memory without changing numerics.	2026-02-15 14:31:54 +00:00
Andrej Karpathy	2f09686724	clarify that this is bf16 mfu we're talking about	2026-02-10 23:35:00 +00:00
Andrej Karpathy	e569b59f92	delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm	2026-02-10 18:46:39 +00:00
Andrej Karpathy	1ec0a34779	at 28 and above we start to need batch size 8	2026-02-08 18:26:34 +00:00
Andrej Karpathy	ff46300720	tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing	2026-02-08 17:54:12 +00:00
Andrej Karpathy	aeff095e97	better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon	2026-02-06 19:22:28 +00:00
Andrej Karpathy	685271dc8d	new optimal ratio for d26 training	2026-02-06 19:21:27 +00:00
Andrej Karpathy	e527521a3f	briefly mention batch ramp experimentation too, too weak to merge in my few attempts	2026-02-05 22:21:03 +00:00
Andrej Karpathy	96522798f1	docs docs docs	2026-02-05 20:27:07 +00:00
Andrej Karpathy	5fdd5cdb24	new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier	2026-02-05 20:11:32 +00:00
Andrej Karpathy	2c062aaa94	nit: don't mutate args, create new var for total_batch_size	2026-02-05 19:59:46 +00:00
Andrej Karpathy	f41dd3cbd7	auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on	2026-02-05 19:40:37 +00:00
Andrej Karpathy	98eed6df18	bring back an assert guarding against bad param sizing	2026-02-05 18:14:30 +00:00
Sofie Van Landeghem	012da1a78b	Typo fixes (#480 ) * small typo * few more small fixes * small fixes in leaderboard.md	2026-02-05 19:12:50 +01:00
Andrej Karpathy	75b302f331	fix hash commit on leaderboard and a paragraph clarification	2026-02-05 16:14:28 +00:00
Andrej Karpathy	1144d186ed	try and fail relu^2 -> swiglu	2026-02-05 02:42:46 +00:00
Andrej Karpathy	d63b7ab9ac	try and fail relu^2 -> swiglu	2026-02-05 02:41:46 +00:00
Andrej Karpathy	718e5e9d67	correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt	2026-02-05 01:39:26 +00:00
Andrej Karpathy	542beb0c8c	bump speedrun to be the up to date leaderboard run	2026-02-04 02:12:04 +00:00
Andrej Karpathy	d510b1385b	quick experiments to log	2026-02-03 23:21:39 +00:00
Andrej Karpathy	16b8ac7da3	oops forgot to attach leaderboard file too	2026-02-03 21:06:12 +00:00
Andrej Karpathy	fe55b092b8	minor cosmetics for the table	2026-02-03 21:05:28 +00:00
Andrej Karpathy	a67eba35dc	add feb2 new leaderboard record from upgrading to fp8 training, +4.3% speedup to time to GPT-2	2026-02-03 21:03:42 +00:00
Andrej Karpathy	6079f78fc3	add fp8 training with torchao	2026-02-03 21:03:42 +00:00
Andrej Karpathy	8ebc14b348	small touchups to the eval script, re-order items etc, cosmetic	2026-02-03 21:03:42 +00:00
Sofie Van Landeghem	72b9064f9d	remove leftover mid references (#491 )	2026-02-02 08:33:46 -08:00
Andrej Karpathy	b19b4f3e49	fix bug in speedrun script, batch size that doesn't OOM on 8XH100 for d24 is 16	2026-02-02 15:50:14 +00:00
Andrej Karpathy	230d6cf6c6	tune the synthetic data generation script. delete the king andrej stuff lol. also, upgrade to gemini 3	2026-02-02 01:45:59 +00:00
Andrej Karpathy	07c4dd4cd9	manually control the over-active garbage collector, save a small few minutes from a typical run	2026-02-02 01:44:30 +00:00
Andrej Karpathy	e8fec97d4c	slightly more efficient dataloader that reduces the number of python objects flying around and causing strain on runtime and garbage collector	2026-02-02 01:17:30 +00:00

1 2 3 4 5 ...

350 Commits