nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-05-07 08:19:52 +00:00

Author	SHA1	Message	Date
gio	cc2f2abdf0	speedrun.sh: use --num-iterations=6000 (88 min recipe, CORE 0.2646)	2026-04-27 15:03:10 -05:00
Giovanni Zinzi	f5e93547e4	Delete dev/LEADERBOARD_SUBMISSION.md	2026-04-27 06:34:41 -05:00
gio	f8e217e6dd	speedrun.sh: switch to Run 7 recipe (d22 + MuonClip + warmdown=0.85)	2026-04-26 21:36:37 -05:00
gio	889e588883	add LEADERBOARD_SUBMISSION.md (Run 7 candidate) d22 + 6000 iter + bs=1M + warmdown=0.85 + muonclip τ=100 - CORE 0.2646 in 88.2 min (matches Run 6 quality, 10.9% faster wall-clock) - val_bpb 0.7241 Both warmdown=0.85 and muonclip individually regress at d22; together they synergize. MuonClip is the only code addition — 66 LOC across optim.py + gpt.py + base_train.py, default OFF preserves Run 6 behavior bit-identical. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 21:36:00 -05:00
gio	119f567cda	add MuonClip QK-Clip (--muon-qk-clip-tau) on top of upstream Run 6 Single-flag minimal change. When tau > 0, c_q/c_k weights are pulled into a dedicated Muon group and rescaled after each Muon step so that Frobenius/sqrt(min_dim) spectral-norm estimate <= sqrt(tau). Default tau=0 = no-op = bit-identical to v73. Reference: Kimi K2 paper (arxiv 2507.20534 §A). Caps max attention logit ~tau. Files touched (3): nanochat/optim.py: +_apply_qk_clip helper, called after MuonAdamW.step and DistMuonAdamW.step nanochat/gpt.py: +muon_qk_clip_tau arg in setup_optimizer; splits c_q/c_k into a dedicated Muon group when tau > 0 scripts/base_train.py: +--muon-qk-clip-tau CLI arg, threaded to setup_optimizer Validated overnight (private fork) at d22 6000-iter: v73 baseline: val_bpb 0.7242, CORE 0.2714, crosses GPT-2 CORE @ ~81 min v198 (tau=100): val_bpb 0.7242, CORE 0.2731, crosses GPT-2 CORE @ ~80 min All other stacks (warmdown, lr, warmup) regressed; tau sweep (50/100/200) showed sharp peak at tau=100. Generalizes across model depths because it's a Muon optimizer-level fix, not a recipe tweak. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 16:52:03 -05:00
Andrej	0aaca56805	Merge pull request #706 from svlandeg/fix/cpu Add setuptools for CPU run	2026-04-14 11:33:14 -07:00
Andrej	b9b6ce137b	Merge pull request #686 from marcinbogdanski/fix/init-smear-backout-lambdas Initialize smear and backout lambdas in init_weights()	2026-04-13 16:08:04 -07:00
Sofie Van Landeghem	a3ca42a678	add comment	2026-04-13 14:17:23 +02:00
Sofie Van Landeghem	9822cc7424	use nn.init and initialize smear gate's weight as well	2026-04-13 14:03:18 +02:00
svlandeg	12839c11e3	update uv lock	2026-04-13 11:20:38 +02:00
svlandeg	8ef90bc154	add setuptools for CPU run	2026-04-13 10:50:57 +02:00
Marcin Bogdanski	94b73ad29a	fix: initialize smear and backout lambdas in init_weights	2026-04-03 20:39:55 +00:00
Andrej Karpathy	a445144d39	create a group for dev dependencies, there is no need to install all this other stuff just for speedrun and it's exposing people to dependency chain attacks. we need to delete more dependencies. dependencies bad bad bad	2026-03-26 03:41:28 +00:00
Andrej Karpathy	03be953668	delete non-essential deps from legacy use	2026-03-26 03:41:28 +00:00
Andrej	7808dc7159	Merge pull request #595 from svlandeg/fix/typo Small fixes	2026-03-25 14:40:25 -07:00
Andrej	a4ed96687b	Merge pull request #634 from 2bitbit/fix-docs-and-comments fix: correct minor typos in help text, README, and comments	2026-03-25 14:31:49 -07:00
Andrej	7b70f6b411	Merge pull request #639 from mathieu-lacage/master Verified: both paths return identical data (99,842 rows), and all splits under subset='all' load cleanly.	2026-03-25 14:29:30 -07:00
RoomWithOutRoof	47e983eea7	fix: use meta device in disable_fp8 to avoid VRAM spike (#616 ) When swapping Float8Linear to Linear in disable_fp8 context manager, using device=fp8_module.weight.device directly allocates new tensors on GPU, causing unnecessary VRAM spike (~1GB for large models). This fix uses device='meta' to avoid physical memory allocation, then swaps in the weight tensor reference. This eliminates the unnecessary VRAM spike during evaluation phase. Fixes issue #592 Co-authored-by: RoomWithOutRoof <roomwithoutroof@sparklab.ai>	2026-03-25 14:24:57 -07:00
Andrej Karpathy	c0dbf1f3ff	use COMPUTE_DTYPE-aware cast in Muon polar express step The bf16 cast is intentional for speed on Hopper+ GPUs, but should be skipped on other platforms rather than blindly applied. fp16 is unstable here due to its limited exponent range, and fp32 platforms don't benefit from the cast. Now: bf16 when COMPUTE_DTYPE is bf16, no cast otherwise. Inspired by PR #667. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 20:19:14 +00:00
Andrej Karpathy	4e1694cc95	bunch of ideas tried from openai/parameter-golf, all negative results for nanochat	2026-03-24 22:13:13 +00:00
Andrej Karpathy	1cd94d768f	bump D:N ratio to 12 per recent scaling laws re-run	2026-03-24 19:25:50 +00:00
Andrej Karpathy	c16db281ff	fix small bug with params logging and batch size	2026-03-24 19:25:34 +00:00
svlandeg	dfe7d39ce8	Merge branch 'master' into fix/typo	2026-03-18 17:01:45 +01:00
Andrej Karpathy	5019accc5b	fix scaling laws scripts after the bigram embeddings were removed	2026-03-17 16:55:56 +00:00
svlandeg	51f42a4406	~1.5h :-)	2026-03-15 22:29:27 +01:00
svlandeg	1f9e42a855	two more typos, from PR 645	2026-03-15 22:27:18 +01:00
svlandeg	bd6e9c8d5f	fix numbering	2026-03-15 22:18:18 +01:00
svlandeg	02e865c2ab	Merge branch 'master' into fix/typo	2026-03-15 22:18:01 +01:00
Andrej Karpathy	1b1cc3c599	submit new time to GPT-2 leaderboard entry: 99 minutes	2026-03-14 17:15:01 +00:00
Andrej Karpathy	a825e63f81	Autoresearch round 2: smear, backout, and hyperparameter tuning New architectural features: - Smear: mix previous token embedding into current position via learned gate, providing cheap bigram-like info (works in training + KV cache) - Backout: subtract learned fraction of mid-layer residual before logit projection to remove low-level features Hyperparameter tuning: - Muon momentum warmdown 0.97→0.90 during LR warmdown phase - Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05 - c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4 - Speedrun data:params ratio reduced to 8 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-14 17:03:06 +00:00
svlandeg	6405b26d24	Merge branch 'master' into fix/typo	2026-03-13 13:56:50 +01:00
svlandeg	1052d25d45	we only need to wait 2h now!	2026-03-13 13:46:16 +01:00
Mathieu Lacage	a641b6ca96	MMLU main split is named auxiliary_train, not train	2026-03-13 13:19:10 +01:00
2bitbit	2bb93b2ae4	fix: correct minor typos in help text, README, and comments	2026-03-12 17:03:26 +08:00
svlandeg	d96558bcb0	fix heading, cf #622	2026-03-10 09:57:30 +01:00
Andrej Karpathy	f068604948	new leaderboard entry coming from improvements of autoresearch round 1, time to gpt-2 from 2.02 hours to 1.80 hours	2026-03-10 06:26:39 +00:00
Andrej Karpathy	6ed7d1d82c	All of these improvements were developed by Claude running autonomously over ~2 days using autoresearch. I didn't touch anything - incredible. All tuning was done on d12 but generalized easily to larger models (e.g. d24 in particular). This means we will also get a new "Time to GPT-2" Leaderboard entry, which I will push separately. Optimizer & schedule changes: - Increase unembedding LR 0.004 -> 0.008, weight decay 0.2 -> 0.28 - Per-group Adam betas and weight decay (instead of shared global betas) - Muon beta2 0.95 -> 0.9, momentum warmup target 0.95 -> 0.97 over 400 steps - Warmup: ratio-based -> absolute steps (default 40) - Warmdown ratio 0.5 -> 0.65, final LR fraction 0.0 -> 0.05 - Weight decay schedule: linear -> cosine decay - Polar express norm factor 1.02 -> 1.01 Architecture & init changes: - VE gate: channels 32 -> 12, scale range 2x -> 3x, init small positive - Add post-QK-norm scaling (q,k *= 1.15) for sharper attention - Embedding init std 1.0 -> 0.8, MLP c_fc init 0.5x smaller - RoPE base theta 10K -> 100K - Short attention window: seq_len/2 -> ~seq_len/3 (ceil to 128 tile) - Logit softcap 20 -> 15	2026-03-09 20:45:17 +00:00
svlandeg	f8ff0439b9	two more small typos	2026-03-06 11:03:00 +01:00
Andrej Karpathy	1076f97059	delete autocast, an unnecessary thorn in my side, manage dtypes directly	2026-03-04 23:55:30 +00:00
Sofie Van Landeghem	752abc836e	Ensure that inputs and targets are contiguous (#569 ) * call reshape instead of view in case the tensors are not contiguous * fix directly in data loader instead	2026-03-04 13:58:27 -08:00
Andrej Karpathy	4b4077425b	Document new Leaderboard entry congrats @ddudek for pointing out ClimbMix, time to GPT-2 is now 2.01 hours, down from 2.76 previously	2026-03-04 20:02:07 +00:00
Andrej Karpathy	324e69c45d	big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise	2026-03-04 19:47:12 +00:00
Andrej Karpathy	b07604ebaa	document the legacy fineweb100b dataset and the new climbmix400b dataset	2026-03-03 17:24:31 +00:00
Andrej Karpathy	aba30cb037	tune logit softcap?	2026-03-03 00:38:53 +00:00
Anish	83dccc20ae	Restore completion-only loss masking in SFT dataloader (#582 ) * printing steps count * adding reply only loss for chat * using the mask by render_conversation function of tokeniser * undoing some changes * putting back the comment which got removed accidently, no functionality change	2026-03-02 16:37:47 -08:00
Dipesh Babu	c7ba252142	docs: fix typos in experiment log (#547 )	2026-02-20 08:03:45 -08:00
Andrej Karpathy	2dffdc8cf6	document MoE exploration	2026-02-19 02:53:47 +00:00
Andrej Karpathy	48804bff3a	report negative result on fineweb dataset	2026-02-18 23:45:31 +00:00
Andrej Karpathy	bb5137860e	fix comment	2026-02-18 23:26:22 +00:00
Andrej Karpathy	458555117b	Merge branch 'Chetter2-patch-1'	2026-02-18 23:17:39 +00:00

1 2 3 4 5 ...

385 Commits