nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-03-24 05:43:31 +00:00

Author	SHA1	Message	Date
William Thurston	194c98a5b3	Merge upstream/master (266 commits) into fork Accept upstream's architectural changes wholesale: - argparse replaces configurator.py across all scripts - Unified MuonAdamW optimizer replaces separate AdamW + Muon - Sliding window attention (SSSL pattern) + Flash Attention 3 - Value embeddings (ResFormer-style) with per-layer gating - Per-layer learnable scalars (resid_lambdas, x0_lambdas) - FP8 training support with Float8Linear - Scaling laws (Power Lines batch sizing, T_epoch weight decay) - Checkpoint resumption with dataloader state - BOS-aligned bestfit-pad packing for SFT - ChatCORE evaluation metric - Consolidated base_loss.py into base_eval.py - Removed mid_train.py (pipeline simplified) Drops our MoE and tie_embeddings implementations in favor of upstream's cleaner architecture. These can be re-added later on top of the new codebase if needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 14:50:28 -08:00
William Thurston	0c942a8c00	Add tie_embeddings support and configurable logging interval Implement weight tying between token embeddings and lm_head to reduce parameter count. When enabled, logits are scaled by 1/√d_model, lm_head zeroing is skipped, and optimizer groups are deduplicated. Param counting uses unique parameters while Chinchilla ratio calculation adds back the would-be lm_head size for comparability. Also adds boolean flag parsing (--flag without =value) to the configurator, an auto-derived log_every interval, and minor shell script fixes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 14:42:58 -08:00
Dipesh Babu	c7ba252142	docs: fix typos in experiment log (#547 )	2026-02-20 08:03:45 -08:00
Andrej Karpathy	2dffdc8cf6	document MoE exploration	2026-02-19 02:53:47 +00:00
Andrej Karpathy	48804bff3a	report negative result on fineweb dataset	2026-02-18 23:45:31 +00:00
Andrej Karpathy	4a6e47b0c6	update dev log with recent	2026-02-17 15:44:54 +00:00
Andrej Karpathy	e527521a3f	briefly mention batch ramp experimentation too, too weak to merge in my few attempts	2026-02-05 22:21:03 +00:00
Andrej Karpathy	5fdd5cdb24	new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier	2026-02-05 20:11:32 +00:00
Andrej Karpathy	f41dd3cbd7	auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on	2026-02-05 19:40:37 +00:00
Sofie Van Landeghem	012da1a78b	Typo fixes (#480 ) * small typo * few more small fixes * small fixes in leaderboard.md	2026-02-05 19:12:50 +01:00
Andrej Karpathy	1144d186ed	try and fail relu^2 -> swiglu	2026-02-05 02:42:46 +00:00
Andrej Karpathy	d63b7ab9ac	try and fail relu^2 -> swiglu	2026-02-05 02:41:46 +00:00
Andrej Karpathy	718e5e9d67	correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt	2026-02-05 01:39:26 +00:00
Andrej Karpathy	d510b1385b	quick experiments to log	2026-02-03 23:21:39 +00:00
Andrej Karpathy	16b8ac7da3	oops forgot to attach leaderboard file too	2026-02-03 21:06:12 +00:00
Andrej Karpathy	a67eba35dc	add feb2 new leaderboard record from upgrading to fp8 training, +4.3% speedup to time to GPT-2	2026-02-03 21:03:42 +00:00
Andrej Karpathy	230d6cf6c6	tune the synthetic data generation script. delete the king andrej stuff lol. also, upgrade to gemini 3	2026-02-02 01:45:59 +00:00
Andrej Karpathy	1ddaad1c1c	nuke midtraining from orbit, it's not as needed now that we have a BOS-aligned dataloader. Also change the README a lot. midtrianing is not yet fully properly erased across the board, but good enough for step 1	2026-01-31 19:12:25 +00:00
Andrej Karpathy	ebd4d9bbf5	tried muonh, appealing but didn't work out of the box	2026-01-29 19:01:36 +00:00
Andrej Karpathy	74554be3b5	revert engram, not seeing an improvement at larger scale	2026-01-28 20:07:39 +00:00
Sofie Van Landeghem	d5418ea5a1	Fix link to DeepSeek Engram paper (#470 ) * Fix link to DeepSeek Engram paper in LOG.md Updated link to the DeepSeek Engram paper in the log. * remove www	2026-01-28 08:31:44 -08:00
Andrej Karpathy	c8d93beed2	add engram-lite, add log, tune scaling laws analysis scripts	2026-01-27 22:31:17 +00:00
Andrej Karpathy	85b3e95e09	320 experiments just to tune the adam beta1 of x0 a little bit up from 0.8 to 0.96	2026-01-25 00:04:02 +00:00
Andrej Karpathy	63bb5831e2	something i've wanted to do for a while - move all .sh runs to their own directory so they don't pollute root dir	2026-01-18 15:27:41 +00:00
Andrej Karpathy	d58fcd9d73	log for jan 17	2026-01-18 03:01:17 +00:00
karpathy	f9a7e0f111	update the CPU/MPS script to give reasonable results. The model can at least answer that Paris is the capital of France and knows that the sky is blue, for about 40 minutes of training on my macbook. Also fixed a bug that existed due to KVCache bfloat16 dtype assumption	2026-01-17 12:27:30 -08:00
Yury Kirpichev	77a46902e4	Fix WANDB_RUN parameter passing in runcpu.sh (#407 ) - Add --run=$WANDB_RUN to base_train, mid_train, and chat_sft calls - Ensures wandb logging works when WANDB_RUN environment variable is set - Matches the behavior in speedrun.sh Co-authored-by: svlandeg <svlandeg@github.com>	2026-01-16 18:59:44 -08:00
Andrej Karpathy	1933e85046	brief update to log	2026-01-17 00:25:50 +00:00
Andrej Karpathy	184d4c12b1	also add to log about the FA3 changes	2026-01-16 18:25:04 +00:00
Andrej Karpathy	fbf2bbea25	update log with a bunch of attempts	2026-01-16 02:21:17 +00:00
Andrej Karpathy	747ed4491f	add negative result on olmo3 pretraining mix	2026-01-16 00:44:01 +00:00
Andrej Karpathy	7312ec9898	fix buggy midtrain and update all kwargs to be idiomatic. that is, argparse uses dashes variables use underscores. the underscores are just a remnant of the previous Configurator object. This is the right way	2026-01-13 22:45:27 +00:00
Andrej Karpathy	f92efce169	add negative result about not allowing attention across BOS tokens. A lot more code complexity for basically no gain in performance	2026-01-13 21:33:54 +00:00
Andrej Karpathy	43c29dd9d5	Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training The new DataLoader ensures that every token sequence in train/val batches has a BOS token at the beginning. Therefore, no token streams start abruptly in the middle of a document, which could be confusing for the model. Note that this changes the loss scale because there are fewer confusing tokens in the train/val batches. The main downside is that we now waste about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md entry for this change for a lot more information.	2026-01-13 20:05:47 +00:00
Andrej Karpathy	64b48d0e5c	validated that \p{N}{1,2} is the correct number of digits to group up to in the regex pattern of the GPT-4 tokenizer (2 down from 3), leading to the best val_bpb for 32K vocabs	2026-01-13 17:45:06 +00:00
Andrej Karpathy	238353c998	document my struggle with fp8 integration yesterday, it's not working like i thought it would and i suffered. one day i will return to continue the fight.	2026-01-13 17:14:29 +00:00
Andrej Karpathy	4610a838a1	record negative result on MTP	2026-01-12 05:23:47 +00:00
Andrej Karpathy	fbc1484e8c	add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb	2026-01-11 21:49:54 +00:00
Andrej Karpathy	2ff7d51252	integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge	2026-01-11 20:33:19 +00:00
Andrej Karpathy	aa530cdad5	Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb	2026-01-11 18:47:35 +00:00
Andrej Karpathy	2c4473dd1b	Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.	2026-01-11 16:56:59 +00:00
Sofie Van Landeghem	a1ccb3dc0b	remove rust compilation as rustbpe is now installed from separate package (#416 )	2026-01-08 06:18:37 -08:00
Andrej Karpathy	061f83c152	delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then	2026-01-08 02:16:50 +00:00
Andrej Karpathy	e8c30c3b19	add notebook used for scaling laws analysis	2026-01-07 22:28:53 +00:00
Andrej Karpathy	54e59c38ad	add notebook on deriving the CORE estimates for the GPT-3 miniseries.	2026-01-05 18:40:28 +00:00
Andrej Karpathy	ed2082fbc4	sane secrets management	2026-01-04 19:29:22 +00:00
William Thurston	76227f70d3	Add MOE debug interval and logging for gradient statistics - Introduced `MOE_DEBUG_INTERVAL` parameter in `runmps.sh` to control debug logging frequency during training. - Enhanced `base_train.py` to log gradients of routed and shared weights at specified intervals, aiding in monitoring model performance. - Updated `gpt.py` to adjust router bias calculations, improving load balancing among experts. - Added unit tests in `test_moe.py` to validate the behavior of the MoE implementation and ensure correctness of gradient calculations.	2025-11-13 16:22:20 -08:00
William Thurston	25d2573f47	Add MoE configuration and implementation in training scripts and model architecture - Introduced parameters for Mixture of Experts (MoE) in `runmps.sh`, `base_train.py`, and `gpt.py`, allowing for dynamic configuration of experts during training. - Enhanced `gpt.py` with new classes `MoEFeedForward` and `ExpertFFN` to implement MoE functionality in the model architecture. - Updated `configurator.py` to handle type conversions for new MoE parameters. - Improved logging in `base_train.py` to include MoE-related metrics and configurations during training. - Added assertions and derived defaults for MoE parameters to ensure valid configurations. - Implemented methods to estimate and log FLOPs for both dense and MoE active configurations during training. - Enhanced gradient handling in `muon.py` to accommodate potential absence of gradients for unused experts.	2025-11-11 19:58:38 -08:00
William Thurston	9550053cc1	Enhance model tagging support in training and evaluation scripts - Added model tagging functionality to `runmps.sh`, allowing for dynamic model tagging based on the W&B run name. - Updated `base_train.py`, `mid_train.py`, and `chat_sft.py` to utilize model tags for checkpoint management. - Enhanced `base_eval.py` to accept model tags for loading models during evaluation. - Improved handling of model tags to ensure proper checkpoint directory naming and logging.	2025-11-10 19:45:02 -08:00
William Thurston	8a6d34daf7	Add kv_head_mult parameter for training and evaluation scripts - Introduced `kv_head_mult` to control the number of query heads sharing a key/value head in `base_train.py`, `mid_train.py`, and `runmps.sh`. - Updated logging to include global token per second metrics during training. - Added assertions to ensure `kv_head_mult` is valid and properly integrated into model calculations.	2025-11-09 14:23:45 -08:00

1 2

67 Commits