nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-06-18 20:19:08 +00:00

Author	SHA1	Message	Date
Unsal Gokdag	c3f234cfca	CORE eval: GPU-resident data, continuous pipeline, per-task progress bars three independent improvements to the cached CORE evaluation path: 1. GPU-resident data: all base-4 collated batches (~144MB for full CORE eval) are moved to GPU upfront via .to(device). eliminates all CPU→GPU transfers from the forward loop. _forward_all_cached replaces double-buffered prefetch with a simple upfront bulk transfer — .to() is a no-op when the caller has already preloaded tensors to GPU (as bench_core_eval now does). 2. continuous cross-task pipeline: _forward_all_cached flattens all tasks' batches into one stream. the last batch of task N flows directly into the first batch of task N+1 with no pipeline restart. GPU-side composition via merge (pad+cat for bs > base) and split (row-slice for bs < base) avoids the CPU-side compose_collated bottleneck that made bs=8 slower than bs=4. 3. progress bars + per-task result printing: both cached and first-run paths in evaluate_model now show a tqdm progress bar with the current task label. on_task_done callback in _forward_all_cached prints each task's accuracy as soon as its last batch is processed (single-GPU). DDP falls back to printing after all_reduce. both paths print total elapsed time at the end. bench_core_eval: preloads ALL base-4 batches to GPU once before the batch-size sweep. all sweep iterations compose from GPU-resident tensors with zero CPU→GPU transfers in the hot loop.	2026-02-13 07:54:53 +00:00
Unsal Gokdag	7fa30f5ee3	CORE eval: disk-cached tokenized batches, double-buffered GPU transfers, batch composition, benchmark improvements the main idea: tokenization + collation for CORE eval only needs to happen once per tokenizer. collated batches at base batch_size=4 are saved to disk (core_token_cache/), keyed by SHA-256 of the tokenizer file. any batch_size can be served from these base-4 batches: larger sizes merge consecutive batches (right-pad shorter ones, cat along dim=0), smaller sizes split along example boundaries (trim trailing padding). this means prepare_task_data is truly a one-time cost. core_eval.py: - double-buffered CPU->GPU transfers in both forward paths (_forward_batches and evaluate_task's pipelined path). while GPU runs forward_model on batch N, batch N+1 is pin_memory()'d and DMA-transferred via non_blocking=True. the DMA engine and GPU compute units are separate hardware so they overlap. previously GPU idled during every transfer. - compose_collated(): merge base batches for larger batch_size (cat after right-padding to max_len), or split for smaller batch_size (slice along row boundaries from batch_meta, trim trailing padding via vectorized non_pad.any(dim=0)). works because examples are sorted by seq_len, so consecutive base batches have monotonically increasing lengths. - evaluate_task and _forward_batches accept optional pbar for progress tracking. base_eval.py: - evaluate_model now has 3-tier caching: in-memory (_batch_cache, across calls within same process), disk load (core_token_cache/, on first call when in-memory is empty), disk save (after first run's prepare+collate+forward, writes collated batches so future training runs and the benchmark skip tokenization entirely). keyed by tokenizer file hash + max_per_task. bench_core_eval.py: - cached sweep no longer re-runs the full first-run sweep to build collated data (was 2x the work for no reason). instead loads/builds base-4 cache once, then compose_collated serves any target batch_size. cached sweep only varies batch_size (no queue_size — no collation thread). - --skip-first: skip the first-run sweep entirely if disk cache exists. if cache is missing, runs a single bs=4 eval in minimal time to create it, then proceeds to cached sweep. - tqdm progress bars everywhere: old sequential baseline (per-example with task name), first-run sweep (double bar: outer=combo progress, inner=per-example), cache building (per-task), cached sweep (double bar). task names left-padded to max label length so the bar doesn't shift. - tokenizer identity via file_checksum (SHA-256 of tokenizer.pkl/tokenizer.json on disk), not encode-output hashing. HF models fall back to hashing the repo name.	2026-02-12 22:34:23 +00:00
unsalgokdag	8695280566	speed up CORE metric evaluation: batched GPU forward passes, threaded CPU prep, cross-call caching. first eval pipelines tokenization on a background thread while GPU processes the previous batch. second+ evals skip tokenization and collation entirely, only GPU forward passes remain. Also adds a benchmark script to sweep batch_size and queue_size hyperparameters.	2026-02-12 18:13:56 +01:00
Andrej Karpathy	2f09686724	clarify that this is bf16 mfu we're talking about	2026-02-10 23:35:00 +00:00
Andrej Karpathy	e569b59f92	delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm	2026-02-10 18:46:39 +00:00
Andrej Karpathy	aeff095e97	better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon	2026-02-06 19:22:28 +00:00
Andrej Karpathy	2c062aaa94	nit: don't mutate args, create new var for total_batch_size	2026-02-05 19:59:46 +00:00
Andrej Karpathy	f41dd3cbd7	auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on	2026-02-05 19:40:37 +00:00
Andrej Karpathy	6079f78fc3	add fp8 training with torchao	2026-02-03 21:03:42 +00:00
Andrej Karpathy	8ebc14b348	small touchups to the eval script, re-order items etc, cosmetic	2026-02-03 21:03:42 +00:00
Sofie Van Landeghem	72b9064f9d	remove leftover mid references (#491 )	2026-02-02 08:33:46 -08:00
Andrej Karpathy	07c4dd4cd9	manually control the over-active garbage collector, save a small few minutes from a typical run	2026-02-02 01:44:30 +00:00
Andrej Karpathy	8b4849d548	fix bug in chat_sft, the attention window must be preserved sigh	2026-02-01 20:58:44 +00:00
Andrej Karpathy	eaf49a33c8	fix path which i think was modified during the refactor and this is a bug introduced by claude i believe	2026-02-01 20:15:19 +00:00
Andrej Karpathy	31b61d2d17	fix broken import sigh	2026-02-01 05:03:44 +00:00
Andrej Karpathy	0307997f9b	merge two files base_loss and base_eval into a single file, it's nicer this way, and unify the huggingface code associated with both	2026-02-01 02:36:43 +00:00
Andrej Karpathy	1ddaad1c1c	nuke midtraining from orbit, it's not as needed now that we have a BOS-aligned dataloader. Also change the README a lot. midtrianing is not yet fully properly erased across the board, but good enough for step 1	2026-01-31 19:12:25 +00:00
Andrej Karpathy	348fbb301b	fix dataloader for midtrain to never crop data. we can't just throw it away like we do in pretraining	2026-01-31 18:21:36 +00:00
Andrej Karpathy	3c3a3d7042	warmdown of 0.5 is slightly better:	2026-01-31 01:08:44 +00:00
Aarushi Singh	ace6740bdd	feat: allow top_k=0 in web api to disable filtering (#458 ) * allow top_k=0 in web api to disable filtering * adding a comment for clear reasoning * adding change to docstring	2026-01-30 09:21:41 -08:00
Andrej Karpathy	41bb2eac32	Combine AdamW and Muon into single MuonAdamW optimizer, cleaner, ty @chrisjmccormick for idea/help	2026-01-29 00:52:08 +00:00
Andrej Karpathy	c88bbf8133	Merge branch 'engram'	2026-01-27 22:33:16 +00:00
Andrej Karpathy	c8d93beed2	add engram-lite, add log, tune scaling laws analysis scripts	2026-01-27 22:31:17 +00:00
Andrej Karpathy	8630d32be4	quick fix to not OOM main speedrun script	2026-01-26 22:31:42 +00:00
Andrej Karpathy	59e36cc727	first version of engram following modded nanogpt style	2026-01-25 18:59:51 +00:00
Andrej Karpathy	a91743c168	Merge branch 've'	2026-01-18 15:14:39 +00:00
Andrej Karpathy	cf5c9e5b8e	resolve a crash for odd depths because FA3 needs head_dim % 8 == 0	2026-01-18 00:07:08 +00:00
Andrej Karpathy	413e91aa0f	optimal ratio is now around 4	2026-01-17 23:51:09 +00:00
karpathy	f9a7e0f111	update the CPU/MPS script to give reasonable results. The model can at least answer that Paris is the capital of France and knows that the sky is blue, for about 40 minutes of training on my macbook. Also fixed a bug that existed due to KVCache bfloat16 dtype assumption	2026-01-17 12:27:30 -08:00
Andrej Karpathy	2955650327	add detection of device to report more correct mfu for bf16	2026-01-17 03:16:14 +00:00
Nitish Pandey	f42ae9e901	fix condition to perform bpb evaluation (#324 ) Co-authored-by: svlandeg <svlandeg@github.com>	2026-01-16 18:56:43 -08:00
Andrej Karpathy	8203efa919	implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user.	2026-01-16 17:37:51 +00:00
Haoyu Wang	50413d2d67	typo in comments: change "GAPO" to "DAPO"	2026-01-15 22:03:42 -08:00
Sofie Van Landeghem	d4ea28d4e2	Fix args in readme (#438 ) * fix commands in readme, using new arg format * fix typo * add required -i flag to chat_eval example runs	2026-01-15 16:26:38 -08:00
Andrej Karpathy	bdcc030ffa	oops legacy spurious line now	2026-01-15 23:32:20 +00:00
Andrej Karpathy	255f8b9af6	cleanly separate cpu and gpu sections	2026-01-15 23:30:11 +00:00
Andrej Karpathy	7312ec9898	fix buggy midtrain and update all kwargs to be idiomatic. that is, argparse uses dashes variables use underscores. the underscores are just a remnant of the previous Configurator object. This is the right way	2026-01-13 22:45:27 +00:00
Andrej Karpathy	3b50b77ed3	fix base_loss to report correct loss by switching the dataloader to the new default	2026-01-13 22:09:36 +00:00
Andrej Karpathy	43c29dd9d5	Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training The new DataLoader ensures that every token sequence in train/val batches has a BOS token at the beginning. Therefore, no token streams start abruptly in the middle of a document, which could be confusing for the model. Note that this changes the loss scale because there are fewer confusing tokens in the train/val batches. The main downside is that we now waste about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md entry for this change for a lot more information.	2026-01-13 20:05:47 +00:00
Andrej Karpathy	21608ec51e	allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway	2026-01-12 03:10:13 +00:00
Andrej Karpathy	b33e394528	oops actually make SSSL the default window pattern	2026-01-11 21:50:35 +00:00
Andrej Karpathy	fbc1484e8c	add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb	2026-01-11 21:49:54 +00:00
Andrej Karpathy	aa530cdad5	Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb	2026-01-11 18:47:35 +00:00
Andrej Karpathy	2c4473dd1b	Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.	2026-01-11 16:56:59 +00:00
Andrej Karpathy	061f83c152	delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then	2026-01-08 02:16:50 +00:00
Andrej Karpathy	ccf4b7f9bf	nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script	2026-01-07 22:11:59 +00:00
Adria Blancafort	1b5de29e71	Fix undefined variable in chat_rl after recent refactor * Fix undefined variable * Remove unused import Remove unused import 're' from chat_rl.py	2026-01-07 09:08:57 -08:00
Andrej Karpathy	ae0bf52529	tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3	2026-01-05 18:57:46 +00:00
Andrej Karpathy	9d4c9b786d	many small fixes to base_train: reporting ETA, allowing some additional kwarg flexibility, making sure we don't crash when e.g. depth = 11 - we now calculate the closest num_heads that works	2026-01-05 00:38:09 +00:00
Andrej Karpathy	eb7bbc1b66	delete the configurator in favor of argparse and clean up a lot of kwarg details to make them more consistent across all scripts	2026-01-04 19:14:23 +00:00

1 2 3

102 Commits