nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-04-05 07:05:28 +00:00

Author	SHA1	Message	Date
Sermet Pekin	4cd1d3aeb5	Merge `3735eb9723` into `8180e1d8c1`	2026-02-16 23:12:55 +01:00
Andrej Karpathy	8180e1d8c1	tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft	2026-02-16 20:23:04 +00:00
Andrej Karpathy	788dadeb88	a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps	2026-02-16 14:41:53 +00:00
Sermet Pekin	3735eb9723	simplify test.yml - one command for both platforms	2026-02-16 16:41:56 +03:00
Sermet Pekin	7686d3c7e2	Update test.yml - adds manual trigger for testing on GH actions - adds --extra cpu parameter to uv run	2026-02-16 15:33:48 +03:00
Sermet Pekin	3185f928d7	Update .github/workflows/test.yml - trigger tests on push and pull request only on branch master Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2026-02-16 13:47:08 +03:00
Sermet Pekin	4c5cf36035	Update test.yml	2026-02-16 11:01:16 +03:00
Sermet Pekin	90cdd1b5b8	Update test.yml	2026-02-16 10:58:39 +03:00
Sermet Pekin	dd147b018d	Create test.yml	2026-02-16 10:45:06 +03:00
Andrej Karpathy	2f09686724	clarify that this is bf16 mfu we're talking about	2026-02-10 23:35:00 +00:00
Andrej Karpathy	e569b59f92	delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm	2026-02-10 18:46:39 +00:00
Andrej Karpathy	1ec0a34779	at 28 and above we start to need batch size 8	2026-02-08 18:26:34 +00:00
Andrej Karpathy	ff46300720	tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing	2026-02-08 17:54:12 +00:00
Andrej Karpathy	aeff095e97	better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon	2026-02-06 19:22:28 +00:00
Andrej Karpathy	685271dc8d	new optimal ratio for d26 training	2026-02-06 19:21:27 +00:00
Andrej Karpathy	e527521a3f	briefly mention batch ramp experimentation too, too weak to merge in my few attempts	2026-02-05 22:21:03 +00:00
Andrej Karpathy	96522798f1	docs docs docs	2026-02-05 20:27:07 +00:00
Andrej Karpathy	5fdd5cdb24	new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier	2026-02-05 20:11:32 +00:00
Andrej Karpathy	2c062aaa94	nit: don't mutate args, create new var for total_batch_size	2026-02-05 19:59:46 +00:00
Andrej Karpathy	f41dd3cbd7	auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on	2026-02-05 19:40:37 +00:00
Andrej Karpathy	98eed6df18	bring back an assert guarding against bad param sizing	2026-02-05 18:14:30 +00:00
Sofie Van Landeghem	012da1a78b	Typo fixes (#480 ) * small typo * few more small fixes * small fixes in leaderboard.md	2026-02-05 19:12:50 +01:00
Andrej Karpathy	75b302f331	fix hash commit on leaderboard and a paragraph clarification	2026-02-05 16:14:28 +00:00
Andrej Karpathy	1144d186ed	try and fail relu^2 -> swiglu	2026-02-05 02:42:46 +00:00
Andrej Karpathy	d63b7ab9ac	try and fail relu^2 -> swiglu	2026-02-05 02:41:46 +00:00
Andrej Karpathy	718e5e9d67	correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt	2026-02-05 01:39:26 +00:00
Andrej Karpathy	542beb0c8c	bump speedrun to be the up to date leaderboard run	2026-02-04 02:12:04 +00:00
Andrej Karpathy	d510b1385b	quick experiments to log	2026-02-03 23:21:39 +00:00
Andrej Karpathy	16b8ac7da3	oops forgot to attach leaderboard file too	2026-02-03 21:06:12 +00:00
Andrej Karpathy	fe55b092b8	minor cosmetics for the table	2026-02-03 21:05:28 +00:00
Andrej Karpathy	a67eba35dc	add feb2 new leaderboard record from upgrading to fp8 training, +4.3% speedup to time to GPT-2	2026-02-03 21:03:42 +00:00
Andrej Karpathy	6079f78fc3	add fp8 training with torchao	2026-02-03 21:03:42 +00:00
Andrej Karpathy	8ebc14b348	small touchups to the eval script, re-order items etc, cosmetic	2026-02-03 21:03:42 +00:00
Sofie Van Landeghem	72b9064f9d	remove leftover mid references (#491 )	2026-02-02 08:33:46 -08:00
Andrej Karpathy	b19b4f3e49	fix bug in speedrun script, batch size that doesn't OOM on 8XH100 for d24 is 16	2026-02-02 15:50:14 +00:00
Andrej Karpathy	230d6cf6c6	tune the synthetic data generation script. delete the king andrej stuff lol. also, upgrade to gemini 3	2026-02-02 01:45:59 +00:00
Andrej Karpathy	07c4dd4cd9	manually control the over-active garbage collector, save a small few minutes from a typical run	2026-02-02 01:44:30 +00:00
Andrej Karpathy	e8fec97d4c	slightly more efficient dataloader that reduces the number of python objects flying around and causing strain on runtime and garbage collector	2026-02-02 01:17:30 +00:00
Andrej Karpathy	8b4849d548	fix bug in chat_sft, the attention window must be preserved sigh	2026-02-01 20:58:44 +00:00
Andrej Karpathy	eaf49a33c8	fix path which i think was modified during the refactor and this is a bug introduced by claude i believe	2026-02-01 20:15:19 +00:00
Andrej Karpathy	31b61d2d17	fix broken import sigh	2026-02-01 05:03:44 +00:00
Sofie Van Landeghem	4d6415b8ef	use _PEAK_FLOPS_TABLE instead of if-else structure (#479 )	2026-01-31 19:45:06 -08:00
Sofie Van Landeghem	43078c347e	clean up original tokenizing_distributed_data_loader (#478 )	2026-01-31 19:44:12 -08:00
Franci Penov	dc291c627f	Add Blackwell (SM100) GPU support via SDPA fallback (#475 )	2026-01-31 19:42:58 -08:00
Andrej Karpathy	0307997f9b	merge two files base_loss and base_eval into a single file, it's nicer this way, and unify the huggingface code associated with both	2026-02-01 02:36:43 +00:00
Andrej Karpathy	1ddaad1c1c	nuke midtraining from orbit, it's not as needed now that we have a BOS-aligned dataloader. Also change the README a lot. midtrianing is not yet fully properly erased across the board, but good enough for step 1	2026-01-31 19:12:25 +00:00
Andrej Karpathy	348fbb301b	fix dataloader for midtrain to never crop data. we can't just throw it away like we do in pretraining	2026-01-31 18:21:36 +00:00
Andrej Karpathy	3c3a3d7042	warmdown of 0.5 is slightly better:	2026-01-31 01:08:44 +00:00
Andrei Panferov	4d8dbaf6e0	Fix escape character in README bibtex entry (#454 )	2026-01-30 09:34:02 -08:00
Andrej Karpathy	3ba42e8135	Fix SDPA KV-cache decode to respect sliding window (#456 ) SDPA fallback now respects sliding window during single-token KV-cache decode by slicing K/V to the last (window + 1) tokens. Also simplifies the mask building for chunk inference to properly apply sliding window in that path as well. Fixes #452 Co-Authored-By: Kartik Vashishta <kartikv776@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-30 17:32:12 +00:00

1 2 3 4 5 ...

331 Commits