nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-06-18 12:09:09 +00:00

Author	SHA1	Message	Date
Junyang Chen	eecf352d95	Merge `767df6ef61` into `1076f97059`	2026-03-05 10:51:50 +01:00
Andrej Karpathy	1076f97059	delete autocast, an unnecessary thorn in my side, manage dtypes directly	2026-03-04 23:55:30 +00:00
Sofie Van Landeghem	752abc836e	Ensure that inputs and targets are contiguous (#569 ) * call reshape instead of view in case the tensors are not contiguous * fix directly in data loader instead	2026-03-04 13:58:27 -08:00
Andrej Karpathy	4b4077425b	Document new Leaderboard entry congrats @ddudek for pointing out ClimbMix, time to GPT-2 is now 2.01 hours, down from 2.76 previously	2026-03-04 20:02:07 +00:00
Andrej Karpathy	324e69c45d	big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise	2026-03-04 19:47:12 +00:00
Andrej Karpathy	b07604ebaa	document the legacy fineweb100b dataset and the new climbmix400b dataset	2026-03-03 17:24:31 +00:00
Andrej Karpathy	aba30cb037	tune logit softcap?	2026-03-03 00:38:53 +00:00
Anish	83dccc20ae	Restore completion-only loss masking in SFT dataloader (#582 ) * printing steps count * adding reply only loss for chat * using the mask by render_conversation function of tokeniser * undoing some changes * putting back the comment which got removed accidently, no functionality change	2026-03-02 16:37:47 -08:00
Dipesh Babu	c7ba252142	docs: fix typos in experiment log (#547 )	2026-02-20 08:03:45 -08:00
Junyang Chen	767df6ef61	dataloader: reuse cropped remainders to reduce token waste ~35% -> ~23% When the BestFit-Crop algorithm crops a document to fill remaining row space, the leftover tokens are currently discarded. This change puts the remainder (with BOS prepended) back into the document buffer for future rows. Simulation results at T=2048 with realistic document length distribution: - Source token consumption reduced by ~15% - Data efficiency improved by ~1.18x - Estimated ~28 minutes saved on d24 speedrun (3.04h -> ~2.57h) The change is minimal (6 lines in the crop branch) and preserves all existing properties: BOS-aligned rows, 100% utilization, deterministic packing order.	2026-02-18 23:04:43 -08:00
Andrej Karpathy	2dffdc8cf6	document MoE exploration	2026-02-19 02:53:47 +00:00
Andrej Karpathy	48804bff3a	report negative result on fineweb dataset	2026-02-18 23:45:31 +00:00
Andrej Karpathy	bb5137860e	fix comment	2026-02-18 23:26:22 +00:00
Andrej Karpathy	458555117b	Merge branch 'Chetter2-patch-1'	2026-02-18 23:17:39 +00:00
Andrej Karpathy	bac5a35dd7	fix minor bug in fp8 application to skip tiny matmuls	2026-02-18 23:17:29 +00:00
George Shakan	ad55575326	Fix bug in setting precision (#538 )	2026-02-18 15:49:18 +00:00
Sofie Van Landeghem	cac43e8511	Fix MockModel's device definition (#535 ) * fix MockModel's device definition * cleanup	2026-02-18 15:49:18 +00:00
Andrej Karpathy	f5fe7925ed	update dev log with recent	2026-02-18 15:49:18 +00:00
Andrej Karpathy	1415fb7617	tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft	2026-02-18 15:49:18 +00:00
Andrej Karpathy	77f8fb8303	a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps	2026-02-18 15:49:18 +00:00
George Shakan	0a23f87643	Fix bug in setting precision (#538 )	2026-02-18 07:42:11 -08:00
Sofie Van Landeghem	4800c62f6e	Fix MockModel's device definition (#535 ) * fix MockModel's device definition * cleanup	2026-02-17 16:03:46 -08:00
Andrej Karpathy	4a6e47b0c6	update dev log with recent	2026-02-17 15:44:54 +00:00
Andrej Karpathy	8180e1d8c1	tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft	2026-02-16 20:23:04 +00:00
Andrej Karpathy	788dadeb88	a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps	2026-02-16 14:41:53 +00:00
Alan	124f49be98	Removed redundant qunatization of gradients	2026-02-15 15:41:33 +00:00
Alan	d9678ff0f9	Save FP8 tensors in autograd ctx instead of full-precision inputs Store quantized input/weight and their inverse scales in _Float8Matmul ctx to avoid re-quantization in backward and reduce saved-activation memory without changing numerics.	2026-02-15 14:31:54 +00:00
Andrej Karpathy	2f09686724	clarify that this is bf16 mfu we're talking about	2026-02-10 23:35:00 +00:00
Andrej Karpathy	e569b59f92	delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm	2026-02-10 18:46:39 +00:00
Andrej Karpathy	1ec0a34779	at 28 and above we start to need batch size 8	2026-02-08 18:26:34 +00:00
Andrej Karpathy	ff46300720	tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing	2026-02-08 17:54:12 +00:00
Andrej Karpathy	aeff095e97	better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon	2026-02-06 19:22:28 +00:00
Andrej Karpathy	685271dc8d	new optimal ratio for d26 training	2026-02-06 19:21:27 +00:00
Andrej Karpathy	e527521a3f	briefly mention batch ramp experimentation too, too weak to merge in my few attempts	2026-02-05 22:21:03 +00:00
Andrej Karpathy	96522798f1	docs docs docs	2026-02-05 20:27:07 +00:00
Andrej Karpathy	5fdd5cdb24	new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier	2026-02-05 20:11:32 +00:00
Andrej Karpathy	2c062aaa94	nit: don't mutate args, create new var for total_batch_size	2026-02-05 19:59:46 +00:00
Andrej Karpathy	f41dd3cbd7	auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on	2026-02-05 19:40:37 +00:00
Andrej Karpathy	98eed6df18	bring back an assert guarding against bad param sizing	2026-02-05 18:14:30 +00:00
Sofie Van Landeghem	012da1a78b	Typo fixes (#480 ) * small typo * few more small fixes * small fixes in leaderboard.md	2026-02-05 19:12:50 +01:00
Andrej Karpathy	75b302f331	fix hash commit on leaderboard and a paragraph clarification	2026-02-05 16:14:28 +00:00
Andrej Karpathy	1144d186ed	try and fail relu^2 -> swiglu	2026-02-05 02:42:46 +00:00
Andrej Karpathy	d63b7ab9ac	try and fail relu^2 -> swiglu	2026-02-05 02:41:46 +00:00
Andrej Karpathy	718e5e9d67	correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt	2026-02-05 01:39:26 +00:00
Andrej Karpathy	542beb0c8c	bump speedrun to be the up to date leaderboard run	2026-02-04 02:12:04 +00:00
Andrej Karpathy	d510b1385b	quick experiments to log	2026-02-03 23:21:39 +00:00
Andrej Karpathy	16b8ac7da3	oops forgot to attach leaderboard file too	2026-02-03 21:06:12 +00:00
Andrej Karpathy	fe55b092b8	minor cosmetics for the table	2026-02-03 21:05:28 +00:00
Andrej Karpathy	a67eba35dc	add feb2 new leaderboard record from upgrading to fp8 training, +4.3% speedup to time to GPT-2	2026-02-03 21:03:42 +00:00
Andrej Karpathy	6079f78fc3	add fp8 training with torchao	2026-02-03 21:03:42 +00:00

1 2 3 4 5 ...

349 Commits