nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-04-02 05:35:19 +00:00

Author	SHA1	Message	Date
vivek varikuti	f4162c9daf	Merge `55eb345515` into `7808dc7159`	2026-03-26 06:26:41 +08:00
Andrej	7808dc7159	Merge pull request #595 from svlandeg/fix/typo Small fixes	2026-03-25 14:40:25 -07:00
Andrej	a4ed96687b	Merge pull request #634 from 2bitbit/fix-docs-and-comments fix: correct minor typos in help text, README, and comments	2026-03-25 14:31:49 -07:00
Andrej	7b70f6b411	Merge pull request #639 from mathieu-lacage/master Verified: both paths return identical data (99,842 rows), and all splits under subset='all' load cleanly.	2026-03-25 14:29:30 -07:00
RoomWithOutRoof	47e983eea7	fix: use meta device in disable_fp8 to avoid VRAM spike (#616 ) When swapping Float8Linear to Linear in disable_fp8 context manager, using device=fp8_module.weight.device directly allocates new tensors on GPU, causing unnecessary VRAM spike (~1GB for large models). This fix uses device='meta' to avoid physical memory allocation, then swaps in the weight tensor reference. This eliminates the unnecessary VRAM spike during evaluation phase. Fixes issue #592 Co-authored-by: RoomWithOutRoof <roomwithoutroof@sparklab.ai>	2026-03-25 14:24:57 -07:00
Andrej Karpathy	c0dbf1f3ff	use COMPUTE_DTYPE-aware cast in Muon polar express step The bf16 cast is intentional for speed on Hopper+ GPUs, but should be skipped on other platforms rather than blindly applied. fp16 is unstable here due to its limited exponent range, and fp32 platforms don't benefit from the cast. Now: bf16 when COMPUTE_DTYPE is bf16, no cast otherwise. Inspired by PR #667. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 20:19:14 +00:00
Andrej Karpathy	4e1694cc95	bunch of ideas tried from openai/parameter-golf, all negative results for nanochat	2026-03-24 22:13:13 +00:00
Andrej Karpathy	1cd94d768f	bump D:N ratio to 12 per recent scaling laws re-run	2026-03-24 19:25:50 +00:00
Andrej Karpathy	c16db281ff	fix small bug with params logging and batch size	2026-03-24 19:25:34 +00:00
vivekvar-dl	55eb345515	fix: prevent NaN loss in SFT with fully-masked micro-batches (#590 ) ## Problem When running SFT with small device-batch-size (≤8), fully-masked micro-batches cause NaN loss from step 1, corrupting gradients permanently. This happens when a micro-batch contains only 'User' tokens (all targets=-1), especially common with small batch sizes on consumer GPUs. Root cause: torch.nn.functional.cross_entropy with reduction='mean' returns NaN when all labels are -1 (division by zero in mean computation). ## Solution Added validation in the training loop to detect and skip fully-masked batches: - Check (y != -1).any() before computing loss - Skip backward() for batches with no valid targets (zero gradient contribution) - Track skipped batches and warn user if >5% in first 100 steps - Log skipped batches as loss=0 for transparency ## Testing - Added comprehensive test suite (test_sft_masked_batches.py) - Tests cover: fully masked, partially masked, and unmasked batches - Documents cross_entropy behavior with ignore_index=-1 - Validates the fix logic ## Impact - Fixes #590: NaN loss with small batch sizes - No performance impact for normal batches - Helps users on consumer GPUs (RTX 3060, etc.) - Prevents silent gradient corruption Resolves #590	2026-03-22 18:15:40 +00:00
svlandeg	dfe7d39ce8	Merge branch 'master' into fix/typo	2026-03-18 17:01:45 +01:00
Andrej Karpathy	5019accc5b	fix scaling laws scripts after the bigram embeddings were removed	2026-03-17 16:55:56 +00:00
svlandeg	51f42a4406	~1.5h :-)	2026-03-15 22:29:27 +01:00
svlandeg	1f9e42a855	two more typos, from PR 645	2026-03-15 22:27:18 +01:00
svlandeg	bd6e9c8d5f	fix numbering	2026-03-15 22:18:18 +01:00
svlandeg	02e865c2ab	Merge branch 'master' into fix/typo	2026-03-15 22:18:01 +01:00
Andrej Karpathy	1b1cc3c599	submit new time to GPT-2 leaderboard entry: 99 minutes	2026-03-14 17:15:01 +00:00
Andrej Karpathy	a825e63f81	Autoresearch round 2: smear, backout, and hyperparameter tuning New architectural features: - Smear: mix previous token embedding into current position via learned gate, providing cheap bigram-like info (works in training + KV cache) - Backout: subtract learned fraction of mid-layer residual before logit projection to remove low-level features Hyperparameter tuning: - Muon momentum warmdown 0.97→0.90 during LR warmdown phase - Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05 - c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4 - Speedrun data:params ratio reduced to 8 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-14 17:03:06 +00:00
svlandeg	6405b26d24	Merge branch 'master' into fix/typo	2026-03-13 13:56:50 +01:00
svlandeg	1052d25d45	we only need to wait 2h now!	2026-03-13 13:46:16 +01:00
Mathieu Lacage	a641b6ca96	MMLU main split is named auxiliary_train, not train	2026-03-13 13:19:10 +01:00
2bitbit	2bb93b2ae4	fix: correct minor typos in help text, README, and comments	2026-03-12 17:03:26 +08:00
svlandeg	d96558bcb0	fix heading, cf #622	2026-03-10 09:57:30 +01:00
Andrej Karpathy	f068604948	new leaderboard entry coming from improvements of autoresearch round 1, time to gpt-2 from 2.02 hours to 1.80 hours	2026-03-10 06:26:39 +00:00
Andrej Karpathy	6ed7d1d82c	All of these improvements were developed by Claude running autonomously over ~2 days using autoresearch. I didn't touch anything - incredible. All tuning was done on d12 but generalized easily to larger models (e.g. d24 in particular). This means we will also get a new "Time to GPT-2" Leaderboard entry, which I will push separately. Optimizer & schedule changes: - Increase unembedding LR 0.004 -> 0.008, weight decay 0.2 -> 0.28 - Per-group Adam betas and weight decay (instead of shared global betas) - Muon beta2 0.95 -> 0.9, momentum warmup target 0.95 -> 0.97 over 400 steps - Warmup: ratio-based -> absolute steps (default 40) - Warmdown ratio 0.5 -> 0.65, final LR fraction 0.0 -> 0.05 - Weight decay schedule: linear -> cosine decay - Polar express norm factor 1.02 -> 1.01 Architecture & init changes: - VE gate: channels 32 -> 12, scale range 2x -> 3x, init small positive - Add post-QK-norm scaling (q,k *= 1.15) for sharper attention - Embedding init std 1.0 -> 0.8, MLP c_fc init 0.5x smaller - RoPE base theta 10K -> 100K - Short attention window: seq_len/2 -> ~seq_len/3 (ceil to 128 tile) - Logit softcap 20 -> 15	2026-03-09 20:45:17 +00:00
svlandeg	f8ff0439b9	two more small typos	2026-03-06 11:03:00 +01:00
Andrej Karpathy	1076f97059	delete autocast, an unnecessary thorn in my side, manage dtypes directly	2026-03-04 23:55:30 +00:00
Sofie Van Landeghem	752abc836e	Ensure that inputs and targets are contiguous (#569 ) * call reshape instead of view in case the tensors are not contiguous * fix directly in data loader instead	2026-03-04 13:58:27 -08:00
Andrej Karpathy	4b4077425b	Document new Leaderboard entry congrats @ddudek for pointing out ClimbMix, time to GPT-2 is now 2.01 hours, down from 2.76 previously	2026-03-04 20:02:07 +00:00
Andrej Karpathy	324e69c45d	big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise	2026-03-04 19:47:12 +00:00
Andrej Karpathy	b07604ebaa	document the legacy fineweb100b dataset and the new climbmix400b dataset	2026-03-03 17:24:31 +00:00
Andrej Karpathy	aba30cb037	tune logit softcap?	2026-03-03 00:38:53 +00:00
Anish	83dccc20ae	Restore completion-only loss masking in SFT dataloader (#582 ) * printing steps count * adding reply only loss for chat * using the mask by render_conversation function of tokeniser * undoing some changes * putting back the comment which got removed accidently, no functionality change	2026-03-02 16:37:47 -08:00
Dipesh Babu	c7ba252142	docs: fix typos in experiment log (#547 )	2026-02-20 08:03:45 -08:00
Andrej Karpathy	2dffdc8cf6	document MoE exploration	2026-02-19 02:53:47 +00:00
Andrej Karpathy	48804bff3a	report negative result on fineweb dataset	2026-02-18 23:45:31 +00:00
Andrej Karpathy	bb5137860e	fix comment	2026-02-18 23:26:22 +00:00
Andrej Karpathy	458555117b	Merge branch 'Chetter2-patch-1'	2026-02-18 23:17:39 +00:00
Andrej Karpathy	bac5a35dd7	fix minor bug in fp8 application to skip tiny matmuls	2026-02-18 23:17:29 +00:00
George Shakan	ad55575326	Fix bug in setting precision (#538 )	2026-02-18 15:49:18 +00:00
Sofie Van Landeghem	cac43e8511	Fix MockModel's device definition (#535 ) * fix MockModel's device definition * cleanup	2026-02-18 15:49:18 +00:00
Andrej Karpathy	f5fe7925ed	update dev log with recent	2026-02-18 15:49:18 +00:00
Andrej Karpathy	1415fb7617	tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft	2026-02-18 15:49:18 +00:00
Andrej Karpathy	77f8fb8303	a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps	2026-02-18 15:49:18 +00:00
George Shakan	0a23f87643	Fix bug in setting precision (#538 )	2026-02-18 07:42:11 -08:00
Sofie Van Landeghem	4800c62f6e	Fix MockModel's device definition (#535 ) * fix MockModel's device definition * cleanup	2026-02-17 16:03:46 -08:00
Andrej Karpathy	4a6e47b0c6	update dev log with recent	2026-02-17 15:44:54 +00:00
Andrej Karpathy	8180e1d8c1	tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft	2026-02-16 20:23:04 +00:00
Andrej Karpathy	788dadeb88	a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps	2026-02-16 14:41:53 +00:00
Alan	124f49be98	Removed redundant qunatization of gradients	2026-02-15 15:41:33 +00:00

1 2 3 4 5 ...

373 Commits