nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-03-09 18:55:30 +00:00

Author	SHA1	Message	Date
Andrej Karpathy	a91743c168	Merge branch 've'	2026-01-18 15:14:39 +00:00
Andrej Karpathy	cf5c9e5b8e	resolve a crash for odd depths because FA3 needs head_dim % 8 == 0	2026-01-18 00:07:08 +00:00
Andrej Karpathy	413e91aa0f	optimal ratio is now around 4	2026-01-17 23:51:09 +00:00
karpathy	f9a7e0f111	update the CPU/MPS script to give reasonable results. The model can at least answer that Paris is the capital of France and knows that the sky is blue, for about 40 minutes of training on my macbook. Also fixed a bug that existed due to KVCache bfloat16 dtype assumption	2026-01-17 12:27:30 -08:00
Andrej Karpathy	2955650327	add detection of device to report more correct mfu for bf16	2026-01-17 03:16:14 +00:00
Nitish Pandey	f42ae9e901	fix condition to perform bpb evaluation (#324 ) Co-authored-by: svlandeg <svlandeg@github.com>	2026-01-16 18:56:43 -08:00
Andrej Karpathy	8203efa919	implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user.	2026-01-16 17:37:51 +00:00
Haoyu Wang	50413d2d67	typo in comments: change "GAPO" to "DAPO"	2026-01-15 22:03:42 -08:00
Sofie Van Landeghem	d4ea28d4e2	Fix args in readme (#438 ) * fix commands in readme, using new arg format * fix typo * add required -i flag to chat_eval example runs	2026-01-15 16:26:38 -08:00
Andrej Karpathy	bdcc030ffa	oops legacy spurious line now	2026-01-15 23:32:20 +00:00
Andrej Karpathy	255f8b9af6	cleanly separate cpu and gpu sections	2026-01-15 23:30:11 +00:00
Andrej Karpathy	7312ec9898	fix buggy midtrain and update all kwargs to be idiomatic. that is, argparse uses dashes variables use underscores. the underscores are just a remnant of the previous Configurator object. This is the right way	2026-01-13 22:45:27 +00:00
Andrej Karpathy	3b50b77ed3	fix base_loss to report correct loss by switching the dataloader to the new default	2026-01-13 22:09:36 +00:00
Andrej Karpathy	43c29dd9d5	Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training The new DataLoader ensures that every token sequence in train/val batches has a BOS token at the beginning. Therefore, no token streams start abruptly in the middle of a document, which could be confusing for the model. Note that this changes the loss scale because there are fewer confusing tokens in the train/val batches. The main downside is that we now waste about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md entry for this change for a lot more information.	2026-01-13 20:05:47 +00:00
Andrej Karpathy	21608ec51e	allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway	2026-01-12 03:10:13 +00:00
Andrej Karpathy	b33e394528	oops actually make SSSL the default window pattern	2026-01-11 21:50:35 +00:00
Andrej Karpathy	fbc1484e8c	add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb	2026-01-11 21:49:54 +00:00
Andrej Karpathy	aa530cdad5	Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb	2026-01-11 18:47:35 +00:00
Andrej Karpathy	2c4473dd1b	Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.	2026-01-11 16:56:59 +00:00
Andrej Karpathy	061f83c152	delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then	2026-01-08 02:16:50 +00:00
Andrej Karpathy	ccf4b7f9bf	nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script	2026-01-07 22:11:59 +00:00
Adria Blancafort	1b5de29e71	Fix undefined variable in chat_rl after recent refactor * Fix undefined variable * Remove unused import Remove unused import 're' from chat_rl.py	2026-01-07 09:08:57 -08:00
Andrej Karpathy	ae0bf52529	tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3	2026-01-05 18:57:46 +00:00
Andrej Karpathy	9d4c9b786d	many small fixes to base_train: reporting ETA, allowing some additional kwarg flexibility, making sure we don't crash when e.g. depth = 11 - we now calculate the closest num_heads that works	2026-01-05 00:38:09 +00:00
Andrej Karpathy	eb7bbc1b66	delete the configurator in favor of argparse and clean up a lot of kwarg details to make them more consistent across all scripts	2026-01-04 19:14:23 +00:00
Andrej Karpathy	48abd7d85f	simplify, clarify and slightly tune model initialization. should be very slightly better possibly, but certainly a lot clearer	2026-01-01 21:15:09 +00:00
helloaidank	389d019a0b	small change to doc string at top of tok_train.py (#402 )	2025-12-31 12:57:26 -08:00
Andrej	088726aa7d	clean up model_tag handling across scripts a bit more.	2025-12-27 20:01:09 -08:00
Andrej Karpathy	2874eda59a	update to new os env var to get rid of deprecation warning	2025-12-28 03:32:46 +00:00
DU Wenjie	ea4229851b	bugfix	2025-12-26 19:02:12 +08:00
DU Wenjie	7840049189	bugfix keep same args style in scripts/base_eval.py	2025-12-26 17:29:08 +08:00
duwenjie	92c6654b95	bugfix save and load ckpt from model_tag dir	2025-12-21 15:07:04 +08:00
Andrej	39cccc527f	small bugfix make mid_train script work even with a tiny number of iterations	2025-12-08 18:27:32 -08:00
Andrej	8b1cecaa95	Apply suggestion from @svlandeg for nicer looking comparison Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2025-12-08 18:27:06 -08:00
Andrej	58f3e84e01	clean up train/val loader in sft for consistency with mid/base	2025-12-08 18:23:57 -08:00
Sanzo00	53b3a4fb81	fix: missing val_bpb on resume	2025-11-22 11:04:20 +08:00
svlandeg	4bcc3bb698	clarify comment	2025-11-21 13:19:45 +01:00
Eric Silberstein	f37d45c21f	remove unneeded iter()	2025-11-20 15:14:56 -05:00
Eric Silberstein	dddb95caac	make mid_train script work even with a tiny number of iterations	2025-11-19 15:52:20 -05:00
Andrej	4763ce612a	Small fixes to typos	2025-11-14 07:25:59 -08:00
svlandeg	a2fb3c83a6	fix typos	2025-11-14 11:20:25 +01:00
Andrej Karpathy	c6abcdfe3a	big change: add pretraining resumption logic so that checkpoints can now be approximately resumed and training can continue. this is useful for very long runs when you don't want the anxiety of your run crashing for some reason. alternatively, it's a way to recover training in the event of loss spikes. i mean, this should have been there in v0 but it's ok. the resumption is approximate to control complexity and bloat, but it's possible we want to change that in the future. to use, set --save_every to a step interval to write checkpoints with, and then use --resume_from_step to resume optimization from a given step. only base model training (pretraining) supports this atm, but it's ok because midtraining is comparably quite a bit faster.	2025-11-13 15:34:40 +00:00
Andrej Karpathy	c6b7ab7440	grad clip logging and printing and cosmetics	2025-11-05 21:08:30 +00:00
svlandeg	2ce62ec076	ensure consistency of quotes within each statement	2025-11-03 21:52:02 +01:00
svlandeg	c72b8b2309	add explicit UTF-8 encoding	2025-11-03 21:27:12 +01:00
Dipesh Babu	226953b841	fix: open JSONL and results CSV with UTF-8 encoding for portability	2025-11-03 01:20:56 -05:00
svlandeg	52e85aaf80	Merge branch 'master' into fix/typo	2025-11-02 13:41:13 +01:00
Andrej Karpathy	cf587acb1a	move eval bundle download to be lazy and inside the python code so that we can substantially simplify the run bash scripts	2025-11-01 16:04:38 +00:00
Andrej Karpathy	7d2c4a3d95	delete pandas dep in base_eval use csv instead	2025-11-01 15:28:30 +00:00
Andrej	dfc88334b6	fix tok/sec calculation bug when grad accum steps > 1 Fix tok/sec metrics for base_train and mid_train when gradient accumulation is not 1	2025-10-30 08:36:32 -07:00

1 2

77 Commits