nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-06-19 04:29:09 +00:00

Author	SHA1	Message	Date
haltingstate	df77f21819	Merge `c4a183dfef` into `e569b59f92`	2026-02-10 14:41:19 -05:00
Andrej Karpathy	e569b59f92	delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm	2026-02-10 18:46:39 +00:00
haltingstate	c4a183dfef	Move memory cleanup settings to configurable eval_config Extract hardcoded memory cleanup interval (100 → 256) and enable flags to eval_config.py for better maintainability and tuning flexibility. Changes: 1. Created nanochat/eval_config.py: - CACHE_CLEANUP_INTERVAL = 256 (changed from hardcoded 100) - ENABLE_PERIODIC_CLEANUP = True (allows disabling cleanup) - ENABLE_FINAL_CLEANUP = True (allows skipping final cleanup) - Documented rationale for 256: balances overhead vs fragmentation 2. Updated nanochat/core_eval.py: - Import eval_config module - Use eval_config.CACHE_CLEANUP_INTERVAL instead of hardcoded 100 - Check eval_config.ENABLE_PERIODIC_CLEANUP flag before cleanup - Check eval_config.ENABLE_FINAL_CLEANUP flag for final cleanup Rationale for 256 vs 100: - Power of 2 (efficient modulo operation) - Lower overhead: HellaSwag 10,000 examples: 39 cleanups (~2s) vs 100 cleanups (~5s) - Still frequent enough to prevent fragmentation - For MMLU (100-1000 examples): 0-4 cleanups (negligible impact) Benefits: - Centralizes tuning parameters in one location - Allows easy experimentation with cleanup intervals - Can disable cleanup for debugging/profiling - Documents tradeoffs in config comments - No magic numbers in evaluation code Related: Previous commit `a7066b8` (hellaswag memory leak fix)	2026-02-09 14:37:59 +08:00
haltingstate	a7066b8483	Fix hellaswag memory leak and progressive slowdown (Issue #427 ) ROOT CAUSE: GPU tensors (outputs, losses, predictions, input_ids) not explicitly freed after use, causing memory fragmentation and progressive slowdown. Each forward pass creates ~411MB output logits tensor that lingers in memory until Python GC triggers. Over 10,000+ HellaSwag examples, accumulates 4.4GB tensors, exhausts available headroom on 32GB unified memory systems. SYMPTOMS: - Progressive slowdown: Example 0: 2.5s → Example 300: 6.2s (+148%) - Unbounded memory growth: 20-50MB per 100 examples - Mac Studio (32GB) crashes with OOM after 8000-9000 examples - HellaSwag-specific (10,000 examples vs MMLU: 100-1000) MECHANISM: 1. PyTorch caching allocator fragments memory over time 2. Allocator performance degrades (O(1) → O(N) search for free blocks) 3. Python GC lazy, doesn't free promptly 4. No explicit cleanup: no torch.cuda.empty_cache(), no gc.collect() 5. Memory fragmentation + accumulated tensors = progressive slowdown FIXES IMPLEMENTED: 1. forward_model (lines 166-168): Explicit tensor cleanup - Added: del outputs, del target_ids - Impact: Frees ~411MB output logits + 16KB target_ids per call - outputs tensor: batch_size × seq_len × vocab_size float32 = 4 choices × 512 tokens × 50,257 vocab × 4 bytes = 411MB 2. evaluate_example (lines 246-247): Cleanup after result extraction - Added: del losses, predictions, input_ids - Impact: Frees tensors immediately after .item() extracts scalar - Prevents retention until function returns 3. evaluate_task (lines 262-283): Periodic cache cleanup - Added: gc.collect() + torch.cuda.empty_cache() every 100 examples - Impact: Resets allocator state, prevents fragmentation accumulation - Small cost: ~10-50ms per 100 examples - Final cleanup after task completes (line 287-289) EXPECTED IMPROVEMENT: - Memory growth: <100MB total (vs unbounded before) - Slowdown: <5% variation (vs 400%+ before) - Completion: HellaSwag completes in ~7-8 hours without OOM - Timing: Constant 2.5-2.6s per example throughout evaluation TESTING: Before deploying to production, verify: - MMLU accuracy unchanged (within 0.5% of baseline) - Memory growth <100MB over 1000 examples - Time per example: last 100 within 10% of first 100 - HellaSwag completes without OOM crash WHY HELLASWAG AFFECTED: - 10,000+ examples (vs MMLU: 100-1000, GSM8K: 1319, HumanEval: 164) - 4 forward passes per example (multiple choice) - Runs 8.3 hours (vs MMLU: 40 min) - More time for fragmentation to accumulate - MMLU completes before memory pressure becomes severe TECHNICAL DETAILS: - @torch.no_grad() prevents gradient graphs, not tensor allocation - del only removes Python references, GC frees actual memory - torch.cuda.empty_cache() releases cached memory back to GPU - gc.collect() forces immediate garbage collection (slow but thorough) Fixes: Issue #427 (hellaswag memory leak and progressive slowdown) Related: kcg-llm task-47.fix-hellaswag-memory-leak-progressive-slowdown.pending Analysis: kcg-llm/b1.tasks/task-47*/task-47.10-memory-leak-analysis.txt	2026-02-09 14:34:05 +08:00
haltingstate	143dc98c76	Add MPS device detection and memory monitoring Add is_mps_device() and should_use_torch_compile() to nanochat/common.py Disable torch.compile on macOS MPS devices (prevents indefinite hanging) Add conditional torch.compile in base_train.py and chat_sft.py Add memory monitoring with 32GB inference / 96GB training limits Reference: Task-20, Task-18, Task-19, Task-28, Task-39	2026-02-09 13:19:00 +08:00
Andrej Karpathy	1ec0a34779	at 28 and above we start to need batch size 8	2026-02-08 18:26:34 +00:00
Andrej Karpathy	ff46300720	tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing	2026-02-08 17:54:12 +00:00
Andrej Karpathy	aeff095e97	better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon	2026-02-06 19:22:28 +00:00
Andrej Karpathy	685271dc8d	new optimal ratio for d26 training	2026-02-06 19:21:27 +00:00
Andrej Karpathy	e527521a3f	briefly mention batch ramp experimentation too, too weak to merge in my few attempts	2026-02-05 22:21:03 +00:00
Andrej Karpathy	96522798f1	docs docs docs	2026-02-05 20:27:07 +00:00
Andrej Karpathy	5fdd5cdb24	new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier	2026-02-05 20:11:32 +00:00
Andrej Karpathy	2c062aaa94	nit: don't mutate args, create new var for total_batch_size	2026-02-05 19:59:46 +00:00
Andrej Karpathy	f41dd3cbd7	auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on	2026-02-05 19:40:37 +00:00
Andrej Karpathy	98eed6df18	bring back an assert guarding against bad param sizing	2026-02-05 18:14:30 +00:00
Sofie Van Landeghem	012da1a78b	Typo fixes (#480 ) * small typo * few more small fixes * small fixes in leaderboard.md	2026-02-05 19:12:50 +01:00
Andrej Karpathy	75b302f331	fix hash commit on leaderboard and a paragraph clarification	2026-02-05 16:14:28 +00:00
Andrej Karpathy	1144d186ed	try and fail relu^2 -> swiglu	2026-02-05 02:42:46 +00:00
Andrej Karpathy	d63b7ab9ac	try and fail relu^2 -> swiglu	2026-02-05 02:41:46 +00:00
Andrej Karpathy	718e5e9d67	correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt	2026-02-05 01:39:26 +00:00
Andrej Karpathy	542beb0c8c	bump speedrun to be the up to date leaderboard run	2026-02-04 02:12:04 +00:00
Andrej Karpathy	d510b1385b	quick experiments to log	2026-02-03 23:21:39 +00:00
Andrej Karpathy	16b8ac7da3	oops forgot to attach leaderboard file too	2026-02-03 21:06:12 +00:00
Andrej Karpathy	fe55b092b8	minor cosmetics for the table	2026-02-03 21:05:28 +00:00
Andrej Karpathy	a67eba35dc	add feb2 new leaderboard record from upgrading to fp8 training, +4.3% speedup to time to GPT-2	2026-02-03 21:03:42 +00:00
Andrej Karpathy	6079f78fc3	add fp8 training with torchao	2026-02-03 21:03:42 +00:00
Andrej Karpathy	8ebc14b348	small touchups to the eval script, re-order items etc, cosmetic	2026-02-03 21:03:42 +00:00
Sofie Van Landeghem	72b9064f9d	remove leftover mid references (#491 )	2026-02-02 08:33:46 -08:00
Andrej Karpathy	b19b4f3e49	fix bug in speedrun script, batch size that doesn't OOM on 8XH100 for d24 is 16	2026-02-02 15:50:14 +00:00
Andrej Karpathy	230d6cf6c6	tune the synthetic data generation script. delete the king andrej stuff lol. also, upgrade to gemini 3	2026-02-02 01:45:59 +00:00
Andrej Karpathy	07c4dd4cd9	manually control the over-active garbage collector, save a small few minutes from a typical run	2026-02-02 01:44:30 +00:00
Andrej Karpathy	e8fec97d4c	slightly more efficient dataloader that reduces the number of python objects flying around and causing strain on runtime and garbage collector	2026-02-02 01:17:30 +00:00
Andrej Karpathy	8b4849d548	fix bug in chat_sft, the attention window must be preserved sigh	2026-02-01 20:58:44 +00:00
Andrej Karpathy	eaf49a33c8	fix path which i think was modified during the refactor and this is a bug introduced by claude i believe	2026-02-01 20:15:19 +00:00
Andrej Karpathy	31b61d2d17	fix broken import sigh	2026-02-01 05:03:44 +00:00
Sofie Van Landeghem	4d6415b8ef	use _PEAK_FLOPS_TABLE instead of if-else structure (#479 )	2026-01-31 19:45:06 -08:00
Sofie Van Landeghem	43078c347e	clean up original tokenizing_distributed_data_loader (#478 )	2026-01-31 19:44:12 -08:00
Franci Penov	dc291c627f	Add Blackwell (SM100) GPU support via SDPA fallback (#475 )	2026-01-31 19:42:58 -08:00
Andrej Karpathy	0307997f9b	merge two files base_loss and base_eval into a single file, it's nicer this way, and unify the huggingface code associated with both	2026-02-01 02:36:43 +00:00
Andrej Karpathy	1ddaad1c1c	nuke midtraining from orbit, it's not as needed now that we have a BOS-aligned dataloader. Also change the README a lot. midtrianing is not yet fully properly erased across the board, but good enough for step 1	2026-01-31 19:12:25 +00:00
Andrej Karpathy	348fbb301b	fix dataloader for midtrain to never crop data. we can't just throw it away like we do in pretraining	2026-01-31 18:21:36 +00:00
Andrej Karpathy	3c3a3d7042	warmdown of 0.5 is slightly better:	2026-01-31 01:08:44 +00:00
Andrei Panferov	4d8dbaf6e0	Fix escape character in README bibtex entry (#454 )	2026-01-30 09:34:02 -08:00
Andrej Karpathy	3ba42e8135	Fix SDPA KV-cache decode to respect sliding window (#456 ) SDPA fallback now respects sliding window during single-token KV-cache decode by slicing K/V to the last (window + 1) tokens. Also simplifies the mask building for chunk inference to properly apply sliding window in that path as well. Fixes #452 Co-Authored-By: Kartik Vashishta <kartikv776@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-30 17:32:12 +00:00
Aarushi Singh	ace6740bdd	feat: allow top_k=0 in web api to disable filtering (#458 ) * allow top_k=0 in web api to disable filtering * adding a comment for clear reasoning * adding change to docstring	2026-01-30 09:21:41 -08:00
Harsh Gupta	2e17723817	Fix generate() crash when top_k=0 (#467 ) Prevent a crash in generate() by skipping top-k filtering when top_k is set to 0	2026-01-30 09:21:02 -08:00
Andrej Karpathy	02baa15405	i am feeling in a delete mood today. i need to delete a lot of code. there is too much code and surface area and complexity. ew	2026-01-30 17:08:53 +00:00
Andrej Karpathy	d6c4f3b923	i think this is the new torch 2.9+ API for declaring tf32 preference	2026-01-30 17:03:15 +00:00
Andrej Karpathy	067daa7758	small fix cpu script ty PR #474	2026-01-30 02:11:25 +00:00
Andrej Karpathy	6a341f2ecf	contiguous views and single HtoD transfer for inputs/targets much cleaner	2026-01-30 00:23:01 +00:00

1 2 3 4 5 ...

325 Commits