nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-03-09 10:45:30 +00:00

Author	SHA1	Message	Date
Andrej Karpathy	1076f97059	delete autocast, an unnecessary thorn in my side, manage dtypes directly	2026-03-04 23:55:30 +00:00
Andrej Karpathy	4b4077425b	Document new Leaderboard entry congrats @ddudek for pointing out ClimbMix, time to GPT-2 is now 2.01 hours, down from 2.76 previously	2026-03-04 20:02:07 +00:00
Andrej Karpathy	324e69c45d	big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise	2026-03-04 19:47:12 +00:00
Andrej Karpathy	b07604ebaa	document the legacy fineweb100b dataset and the new climbmix400b dataset	2026-03-03 17:24:31 +00:00
Andrej Karpathy	aba30cb037	tune logit softcap?	2026-03-03 00:38:53 +00:00
Dipesh Babu	c7ba252142	docs: fix typos in experiment log (#547 )	2026-02-20 08:03:45 -08:00
Andrej Karpathy	2dffdc8cf6	document MoE exploration	2026-02-19 02:53:47 +00:00
Andrej Karpathy	48804bff3a	report negative result on fineweb dataset	2026-02-18 23:45:31 +00:00
Andrej Karpathy	4a6e47b0c6	update dev log with recent	2026-02-17 15:44:54 +00:00
Andrej Karpathy	e527521a3f	briefly mention batch ramp experimentation too, too weak to merge in my few attempts	2026-02-05 22:21:03 +00:00
Andrej Karpathy	5fdd5cdb24	new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier	2026-02-05 20:11:32 +00:00
Andrej Karpathy	f41dd3cbd7	auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on	2026-02-05 19:40:37 +00:00
Sofie Van Landeghem	012da1a78b	Typo fixes (#480 ) * small typo * few more small fixes * small fixes in leaderboard.md	2026-02-05 19:12:50 +01:00
Andrej Karpathy	1144d186ed	try and fail relu^2 -> swiglu	2026-02-05 02:42:46 +00:00
Andrej Karpathy	d63b7ab9ac	try and fail relu^2 -> swiglu	2026-02-05 02:41:46 +00:00
Andrej Karpathy	718e5e9d67	correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt	2026-02-05 01:39:26 +00:00
Andrej Karpathy	d510b1385b	quick experiments to log	2026-02-03 23:21:39 +00:00
Andrej Karpathy	16b8ac7da3	oops forgot to attach leaderboard file too	2026-02-03 21:06:12 +00:00
Andrej Karpathy	a67eba35dc	add feb2 new leaderboard record from upgrading to fp8 training, +4.3% speedup to time to GPT-2	2026-02-03 21:03:42 +00:00
Andrej Karpathy	230d6cf6c6	tune the synthetic data generation script. delete the king andrej stuff lol. also, upgrade to gemini 3	2026-02-02 01:45:59 +00:00
Andrej Karpathy	1ddaad1c1c	nuke midtraining from orbit, it's not as needed now that we have a BOS-aligned dataloader. Also change the README a lot. midtrianing is not yet fully properly erased across the board, but good enough for step 1	2026-01-31 19:12:25 +00:00
Andrej Karpathy	ebd4d9bbf5	tried muonh, appealing but didn't work out of the box	2026-01-29 19:01:36 +00:00
Andrej Karpathy	74554be3b5	revert engram, not seeing an improvement at larger scale	2026-01-28 20:07:39 +00:00
Sofie Van Landeghem	d5418ea5a1	Fix link to DeepSeek Engram paper (#470 ) * Fix link to DeepSeek Engram paper in LOG.md Updated link to the DeepSeek Engram paper in the log. * remove www	2026-01-28 08:31:44 -08:00
Andrej Karpathy	c8d93beed2	add engram-lite, add log, tune scaling laws analysis scripts	2026-01-27 22:31:17 +00:00
Andrej Karpathy	85b3e95e09	320 experiments just to tune the adam beta1 of x0 a little bit up from 0.8 to 0.96	2026-01-25 00:04:02 +00:00
Andrej Karpathy	63bb5831e2	something i've wanted to do for a while - move all .sh runs to their own directory so they don't pollute root dir	2026-01-18 15:27:41 +00:00
Andrej Karpathy	d58fcd9d73	log for jan 17	2026-01-18 03:01:17 +00:00
karpathy	f9a7e0f111	update the CPU/MPS script to give reasonable results. The model can at least answer that Paris is the capital of France and knows that the sky is blue, for about 40 minutes of training on my macbook. Also fixed a bug that existed due to KVCache bfloat16 dtype assumption	2026-01-17 12:27:30 -08:00
Yury Kirpichev	77a46902e4	Fix WANDB_RUN parameter passing in runcpu.sh (#407 ) - Add --run=$WANDB_RUN to base_train, mid_train, and chat_sft calls - Ensures wandb logging works when WANDB_RUN environment variable is set - Matches the behavior in speedrun.sh Co-authored-by: svlandeg <svlandeg@github.com>	2026-01-16 18:59:44 -08:00
Andrej Karpathy	1933e85046	brief update to log	2026-01-17 00:25:50 +00:00
Andrej Karpathy	184d4c12b1	also add to log about the FA3 changes	2026-01-16 18:25:04 +00:00
Andrej Karpathy	fbf2bbea25	update log with a bunch of attempts	2026-01-16 02:21:17 +00:00
Andrej Karpathy	747ed4491f	add negative result on olmo3 pretraining mix	2026-01-16 00:44:01 +00:00
Andrej Karpathy	7312ec9898	fix buggy midtrain and update all kwargs to be idiomatic. that is, argparse uses dashes variables use underscores. the underscores are just a remnant of the previous Configurator object. This is the right way	2026-01-13 22:45:27 +00:00
Andrej Karpathy	f92efce169	add negative result about not allowing attention across BOS tokens. A lot more code complexity for basically no gain in performance	2026-01-13 21:33:54 +00:00
Andrej Karpathy	43c29dd9d5	Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training The new DataLoader ensures that every token sequence in train/val batches has a BOS token at the beginning. Therefore, no token streams start abruptly in the middle of a document, which could be confusing for the model. Note that this changes the loss scale because there are fewer confusing tokens in the train/val batches. The main downside is that we now waste about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md entry for this change for a lot more information.	2026-01-13 20:05:47 +00:00
Andrej Karpathy	64b48d0e5c	validated that \p{N}{1,2} is the correct number of digits to group up to in the regex pattern of the GPT-4 tokenizer (2 down from 3), leading to the best val_bpb for 32K vocabs	2026-01-13 17:45:06 +00:00
Andrej Karpathy	238353c998	document my struggle with fp8 integration yesterday, it's not working like i thought it would and i suffered. one day i will return to continue the fight.	2026-01-13 17:14:29 +00:00
Andrej Karpathy	4610a838a1	record negative result on MTP	2026-01-12 05:23:47 +00:00
Andrej Karpathy	fbc1484e8c	add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb	2026-01-11 21:49:54 +00:00
Andrej Karpathy	2ff7d51252	integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge	2026-01-11 20:33:19 +00:00
Andrej Karpathy	aa530cdad5	Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb	2026-01-11 18:47:35 +00:00
Andrej Karpathy	2c4473dd1b	Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.	2026-01-11 16:56:59 +00:00
Sofie Van Landeghem	a1ccb3dc0b	remove rust compilation as rustbpe is now installed from separate package (#416 )	2026-01-08 06:18:37 -08:00
Andrej Karpathy	061f83c152	delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then	2026-01-08 02:16:50 +00:00
Andrej Karpathy	e8c30c3b19	add notebook used for scaling laws analysis	2026-01-07 22:28:53 +00:00
Andrej Karpathy	54e59c38ad	add notebook on deriving the CORE estimates for the GPT-3 miniseries.	2026-01-05 18:40:28 +00:00
Andrej Karpathy	ed2082fbc4	sane secrets management	2026-01-04 19:29:22 +00:00
svlandeg	2ce62ec076	ensure consistency of quotes within each statement	2025-11-03 21:52:02 +01:00

1 2

65 Commits