nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-06-18 20:19:08 +00:00

Author	SHA1	Message	Date
Unsal Gokdag	c3f234cfca	CORE eval: GPU-resident data, continuous pipeline, per-task progress bars three independent improvements to the cached CORE evaluation path: 1. GPU-resident data: all base-4 collated batches (~144MB for full CORE eval) are moved to GPU upfront via .to(device). eliminates all CPU→GPU transfers from the forward loop. _forward_all_cached replaces double-buffered prefetch with a simple upfront bulk transfer — .to() is a no-op when the caller has already preloaded tensors to GPU (as bench_core_eval now does). 2. continuous cross-task pipeline: _forward_all_cached flattens all tasks' batches into one stream. the last batch of task N flows directly into the first batch of task N+1 with no pipeline restart. GPU-side composition via merge (pad+cat for bs > base) and split (row-slice for bs < base) avoids the CPU-side compose_collated bottleneck that made bs=8 slower than bs=4. 3. progress bars + per-task result printing: both cached and first-run paths in evaluate_model now show a tqdm progress bar with the current task label. on_task_done callback in _forward_all_cached prints each task's accuracy as soon as its last batch is processed (single-GPU). DDP falls back to printing after all_reduce. both paths print total elapsed time at the end. bench_core_eval: preloads ALL base-4 batches to GPU once before the batch-size sweep. all sweep iterations compose from GPU-resident tensors with zero CPU→GPU transfers in the hot loop.	2026-02-13 07:54:53 +00:00
Unsal Gokdag	7fa30f5ee3	CORE eval: disk-cached tokenized batches, double-buffered GPU transfers, batch composition, benchmark improvements the main idea: tokenization + collation for CORE eval only needs to happen once per tokenizer. collated batches at base batch_size=4 are saved to disk (core_token_cache/), keyed by SHA-256 of the tokenizer file. any batch_size can be served from these base-4 batches: larger sizes merge consecutive batches (right-pad shorter ones, cat along dim=0), smaller sizes split along example boundaries (trim trailing padding). this means prepare_task_data is truly a one-time cost. core_eval.py: - double-buffered CPU->GPU transfers in both forward paths (_forward_batches and evaluate_task's pipelined path). while GPU runs forward_model on batch N, batch N+1 is pin_memory()'d and DMA-transferred via non_blocking=True. the DMA engine and GPU compute units are separate hardware so they overlap. previously GPU idled during every transfer. - compose_collated(): merge base batches for larger batch_size (cat after right-padding to max_len), or split for smaller batch_size (slice along row boundaries from batch_meta, trim trailing padding via vectorized non_pad.any(dim=0)). works because examples are sorted by seq_len, so consecutive base batches have monotonically increasing lengths. - evaluate_task and _forward_batches accept optional pbar for progress tracking. base_eval.py: - evaluate_model now has 3-tier caching: in-memory (_batch_cache, across calls within same process), disk load (core_token_cache/, on first call when in-memory is empty), disk save (after first run's prepare+collate+forward, writes collated batches so future training runs and the benchmark skip tokenization entirely). keyed by tokenizer file hash + max_per_task. bench_core_eval.py: - cached sweep no longer re-runs the full first-run sweep to build collated data (was 2x the work for no reason). instead loads/builds base-4 cache once, then compose_collated serves any target batch_size. cached sweep only varies batch_size (no queue_size — no collation thread). - --skip-first: skip the first-run sweep entirely if disk cache exists. if cache is missing, runs a single bs=4 eval in minimal time to create it, then proceeds to cached sweep. - tqdm progress bars everywhere: old sequential baseline (per-example with task name), first-run sweep (double bar: outer=combo progress, inner=per-example), cache building (per-task), cached sweep (double bar). task names left-padded to max label length so the bar doesn't shift. - tokenizer identity via file_checksum (SHA-256 of tokenizer.pkl/tokenizer.json on disk), not encode-output hashing. HF models fall back to hashing the repo name.	2026-02-12 22:34:23 +00:00
unsalgokdag	8695280566	speed up CORE metric evaluation: batched GPU forward passes, threaded CPU prep, cross-call caching. first eval pipelines tokenization on a background thread while GPU processes the previous batch. second+ evals skip tokenization and collation entirely, only GPU forward passes remain. Also adds a benchmark script to sweep batch_size and queue_size hyperparameters.	2026-02-12 18:13:56 +01:00
Andrej Karpathy	2f09686724	clarify that this is bf16 mfu we're talking about	2026-02-10 23:35:00 +00:00
Andrej Karpathy	e569b59f92	delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm	2026-02-10 18:46:39 +00:00
Andrej Karpathy	1ec0a34779	at 28 and above we start to need batch size 8	2026-02-08 18:26:34 +00:00
Andrej Karpathy	ff46300720	tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing	2026-02-08 17:54:12 +00:00
Andrej Karpathy	aeff095e97	better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon	2026-02-06 19:22:28 +00:00
Andrej Karpathy	685271dc8d	new optimal ratio for d26 training	2026-02-06 19:21:27 +00:00
Andrej Karpathy	e527521a3f	briefly mention batch ramp experimentation too, too weak to merge in my few attempts	2026-02-05 22:21:03 +00:00
Andrej Karpathy	96522798f1	docs docs docs	2026-02-05 20:27:07 +00:00
Andrej Karpathy	5fdd5cdb24	new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier	2026-02-05 20:11:32 +00:00
Andrej Karpathy	2c062aaa94	nit: don't mutate args, create new var for total_batch_size	2026-02-05 19:59:46 +00:00
Andrej Karpathy	f41dd3cbd7	auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on	2026-02-05 19:40:37 +00:00
Andrej Karpathy	98eed6df18	bring back an assert guarding against bad param sizing	2026-02-05 18:14:30 +00:00
Sofie Van Landeghem	012da1a78b	Typo fixes (#480 ) * small typo * few more small fixes * small fixes in leaderboard.md	2026-02-05 19:12:50 +01:00
Andrej Karpathy	75b302f331	fix hash commit on leaderboard and a paragraph clarification	2026-02-05 16:14:28 +00:00
Andrej Karpathy	1144d186ed	try and fail relu^2 -> swiglu	2026-02-05 02:42:46 +00:00
Andrej Karpathy	d63b7ab9ac	try and fail relu^2 -> swiglu	2026-02-05 02:41:46 +00:00
Andrej Karpathy	718e5e9d67	correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt	2026-02-05 01:39:26 +00:00
Andrej Karpathy	542beb0c8c	bump speedrun to be the up to date leaderboard run	2026-02-04 02:12:04 +00:00
Andrej Karpathy	d510b1385b	quick experiments to log	2026-02-03 23:21:39 +00:00
Andrej Karpathy	16b8ac7da3	oops forgot to attach leaderboard file too	2026-02-03 21:06:12 +00:00
Andrej Karpathy	fe55b092b8	minor cosmetics for the table	2026-02-03 21:05:28 +00:00
Andrej Karpathy	a67eba35dc	add feb2 new leaderboard record from upgrading to fp8 training, +4.3% speedup to time to GPT-2	2026-02-03 21:03:42 +00:00
Andrej Karpathy	6079f78fc3	add fp8 training with torchao	2026-02-03 21:03:42 +00:00
Andrej Karpathy	8ebc14b348	small touchups to the eval script, re-order items etc, cosmetic	2026-02-03 21:03:42 +00:00
Sofie Van Landeghem	72b9064f9d	remove leftover mid references (#491 )	2026-02-02 08:33:46 -08:00
Andrej Karpathy	b19b4f3e49	fix bug in speedrun script, batch size that doesn't OOM on 8XH100 for d24 is 16	2026-02-02 15:50:14 +00:00
Andrej Karpathy	230d6cf6c6	tune the synthetic data generation script. delete the king andrej stuff lol. also, upgrade to gemini 3	2026-02-02 01:45:59 +00:00
Andrej Karpathy	07c4dd4cd9	manually control the over-active garbage collector, save a small few minutes from a typical run	2026-02-02 01:44:30 +00:00
Andrej Karpathy	e8fec97d4c	slightly more efficient dataloader that reduces the number of python objects flying around and causing strain on runtime and garbage collector	2026-02-02 01:17:30 +00:00
Andrej Karpathy	8b4849d548	fix bug in chat_sft, the attention window must be preserved sigh	2026-02-01 20:58:44 +00:00
Andrej Karpathy	eaf49a33c8	fix path which i think was modified during the refactor and this is a bug introduced by claude i believe	2026-02-01 20:15:19 +00:00
Andrej Karpathy	31b61d2d17	fix broken import sigh	2026-02-01 05:03:44 +00:00
Sofie Van Landeghem	4d6415b8ef	use _PEAK_FLOPS_TABLE instead of if-else structure (#479 )	2026-01-31 19:45:06 -08:00
Sofie Van Landeghem	43078c347e	clean up original tokenizing_distributed_data_loader (#478 )	2026-01-31 19:44:12 -08:00
Franci Penov	dc291c627f	Add Blackwell (SM100) GPU support via SDPA fallback (#475 )	2026-01-31 19:42:58 -08:00
Andrej Karpathy	0307997f9b	merge two files base_loss and base_eval into a single file, it's nicer this way, and unify the huggingface code associated with both	2026-02-01 02:36:43 +00:00
Andrej Karpathy	1ddaad1c1c	nuke midtraining from orbit, it's not as needed now that we have a BOS-aligned dataloader. Also change the README a lot. midtrianing is not yet fully properly erased across the board, but good enough for step 1	2026-01-31 19:12:25 +00:00
Andrej Karpathy	348fbb301b	fix dataloader for midtrain to never crop data. we can't just throw it away like we do in pretraining	2026-01-31 18:21:36 +00:00
Andrej Karpathy	3c3a3d7042	warmdown of 0.5 is slightly better:	2026-01-31 01:08:44 +00:00
Andrei Panferov	4d8dbaf6e0	Fix escape character in README bibtex entry (#454 )	2026-01-30 09:34:02 -08:00
Andrej Karpathy	3ba42e8135	Fix SDPA KV-cache decode to respect sliding window (#456 ) SDPA fallback now respects sliding window during single-token KV-cache decode by slicing K/V to the last (window + 1) tokens. Also simplifies the mask building for chunk inference to properly apply sliding window in that path as well. Fixes #452 Co-Authored-By: Kartik Vashishta <kartikv776@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-30 17:32:12 +00:00
Aarushi Singh	ace6740bdd	feat: allow top_k=0 in web api to disable filtering (#458 ) * allow top_k=0 in web api to disable filtering * adding a comment for clear reasoning * adding change to docstring	2026-01-30 09:21:41 -08:00
Harsh Gupta	2e17723817	Fix generate() crash when top_k=0 (#467 ) Prevent a crash in generate() by skipping top-k filtering when top_k is set to 0	2026-01-30 09:21:02 -08:00
Andrej Karpathy	02baa15405	i am feeling in a delete mood today. i need to delete a lot of code. there is too much code and surface area and complexity. ew	2026-01-30 17:08:53 +00:00
Andrej Karpathy	d6c4f3b923	i think this is the new torch 2.9+ API for declaring tf32 preference	2026-01-30 17:03:15 +00:00
Andrej Karpathy	067daa7758	small fix cpu script ty PR #474	2026-01-30 02:11:25 +00:00
Andrej Karpathy	6a341f2ecf	contiguous views and single HtoD transfer for inputs/targets much cleaner	2026-01-30 00:23:01 +00:00

1 2 3 4 5 ...

325 Commits