Commit Graph

325 Commits

Author SHA1 Message Date
haltingstate
df77f21819
Merge c4a183dfef into e569b59f92 2026-02-10 14:41:19 -05:00
Andrej Karpathy
e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm 2026-02-10 18:46:39 +00:00
haltingstate
c4a183dfef Move memory cleanup settings to configurable eval_config
Extract hardcoded memory cleanup interval (100 → 256) and enable flags
to eval_config.py for better maintainability and tuning flexibility.

Changes:

1. Created nanochat/eval_config.py:
   - CACHE_CLEANUP_INTERVAL = 256 (changed from hardcoded 100)
   - ENABLE_PERIODIC_CLEANUP = True (allows disabling cleanup)
   - ENABLE_FINAL_CLEANUP = True (allows skipping final cleanup)
   - Documented rationale for 256: balances overhead vs fragmentation

2. Updated nanochat/core_eval.py:
   - Import eval_config module
   - Use eval_config.CACHE_CLEANUP_INTERVAL instead of hardcoded 100
   - Check eval_config.ENABLE_PERIODIC_CLEANUP flag before cleanup
   - Check eval_config.ENABLE_FINAL_CLEANUP flag for final cleanup

Rationale for 256 vs 100:
- Power of 2 (efficient modulo operation)
- Lower overhead: HellaSwag 10,000 examples: 39 cleanups (~2s) vs 100 cleanups (~5s)
- Still frequent enough to prevent fragmentation
- For MMLU (100-1000 examples): 0-4 cleanups (negligible impact)

Benefits:
- Centralizes tuning parameters in one location
- Allows easy experimentation with cleanup intervals
- Can disable cleanup for debugging/profiling
- Documents tradeoffs in config comments
- No magic numbers in evaluation code

Related: Previous commit a7066b8 (hellaswag memory leak fix)
2026-02-09 14:37:59 +08:00
haltingstate
a7066b8483 Fix hellaswag memory leak and progressive slowdown (Issue #427)
ROOT CAUSE:
GPU tensors (outputs, losses, predictions, input_ids) not explicitly freed
after use, causing memory fragmentation and progressive slowdown. Each
forward pass creates ~411MB output logits tensor that lingers in memory
until Python GC triggers. Over 10,000+ HellaSwag examples, accumulates
4.4GB tensors, exhausts available headroom on 32GB unified memory systems.

SYMPTOMS:
- Progressive slowdown: Example 0: 2.5s → Example 300: 6.2s (+148%)
- Unbounded memory growth: 20-50MB per 100 examples
- Mac Studio (32GB) crashes with OOM after 8000-9000 examples
- HellaSwag-specific (10,000 examples vs MMLU: 100-1000)

MECHANISM:
1. PyTorch caching allocator fragments memory over time
2. Allocator performance degrades (O(1) → O(N) search for free blocks)
3. Python GC lazy, doesn't free promptly
4. No explicit cleanup: no torch.cuda.empty_cache(), no gc.collect()
5. Memory fragmentation + accumulated tensors = progressive slowdown

FIXES IMPLEMENTED:

1. forward_model (lines 166-168): Explicit tensor cleanup
   - Added: del outputs, del target_ids
   - Impact: Frees ~411MB output logits + 16KB target_ids per call
   - outputs tensor: batch_size × seq_len × vocab_size float32
     = 4 choices × 512 tokens × 50,257 vocab × 4 bytes = 411MB

2. evaluate_example (lines 246-247): Cleanup after result extraction
   - Added: del losses, predictions, input_ids
   - Impact: Frees tensors immediately after .item() extracts scalar
   - Prevents retention until function returns

3. evaluate_task (lines 262-283): Periodic cache cleanup
   - Added: gc.collect() + torch.cuda.empty_cache() every 100 examples
   - Impact: Resets allocator state, prevents fragmentation accumulation
   - Small cost: ~10-50ms per 100 examples
   - Final cleanup after task completes (line 287-289)

EXPECTED IMPROVEMENT:
- Memory growth: <100MB total (vs unbounded before)
- Slowdown: <5% variation (vs 400%+ before)
- Completion: HellaSwag completes in ~7-8 hours without OOM
- Timing: Constant 2.5-2.6s per example throughout evaluation

TESTING:
Before deploying to production, verify:
- MMLU accuracy unchanged (within 0.5% of baseline)
- Memory growth <100MB over 1000 examples
- Time per example: last 100 within 10% of first 100
- HellaSwag completes without OOM crash

WHY HELLASWAG AFFECTED:
- 10,000+ examples (vs MMLU: 100-1000, GSM8K: 1319, HumanEval: 164)
- 4 forward passes per example (multiple choice)
- Runs 8.3 hours (vs MMLU: 40 min)
- More time for fragmentation to accumulate
- MMLU completes before memory pressure becomes severe

TECHNICAL DETAILS:
- @torch.no_grad() prevents gradient graphs, not tensor allocation
- del only removes Python references, GC frees actual memory
- torch.cuda.empty_cache() releases cached memory back to GPU
- gc.collect() forces immediate garbage collection (slow but thorough)

Fixes: Issue #427 (hellaswag memory leak and progressive slowdown)
Related: kcg-llm task-47.fix-hellaswag-memory-leak-progressive-slowdown.pending
Analysis: kcg-llm/b1.tasks/task-47*/task-47.10-memory-leak-analysis.txt
2026-02-09 14:34:05 +08:00
haltingstate
143dc98c76 Add MPS device detection and memory monitoring
Add is_mps_device() and should_use_torch_compile() to nanochat/common.py
Disable torch.compile on macOS MPS devices (prevents indefinite hanging)
Add conditional torch.compile in base_train.py and chat_sft.py
Add memory monitoring with 32GB inference / 96GB training limits

Reference: Task-20, Task-18, Task-19, Task-28, Task-39
2026-02-09 13:19:00 +08:00
Andrej Karpathy
1ec0a34779 at 28 and above we start to need batch size 8 2026-02-08 18:26:34 +00:00
Andrej Karpathy
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing 2026-02-08 17:54:12 +00:00
Andrej Karpathy
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon 2026-02-06 19:22:28 +00:00
Andrej Karpathy
685271dc8d new optimal ratio for d26 training 2026-02-06 19:21:27 +00:00
Andrej Karpathy
e527521a3f briefly mention batch ramp experimentation too, too weak to merge in my few attempts 2026-02-05 22:21:03 +00:00
Andrej Karpathy
96522798f1 docs docs docs 2026-02-05 20:27:07 +00:00
Andrej Karpathy
5fdd5cdb24 new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier 2026-02-05 20:11:32 +00:00
Andrej Karpathy
2c062aaa94 nit: don't mutate args, create new var for total_batch_size 2026-02-05 19:59:46 +00:00
Andrej Karpathy
f41dd3cbd7 auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on 2026-02-05 19:40:37 +00:00
Andrej Karpathy
98eed6df18 bring back an assert guarding against bad param sizing 2026-02-05 18:14:30 +00:00
Sofie Van Landeghem
012da1a78b
Typo fixes (#480)
* small typo

* few more small fixes

* small fixes in leaderboard.md
2026-02-05 19:12:50 +01:00
Andrej Karpathy
75b302f331 fix hash commit on leaderboard and a paragraph clarification 2026-02-05 16:14:28 +00:00
Andrej Karpathy
1144d186ed try and fail relu^2 -> swiglu 2026-02-05 02:42:46 +00:00
Andrej Karpathy
d63b7ab9ac try and fail relu^2 -> swiglu 2026-02-05 02:41:46 +00:00
Andrej Karpathy
718e5e9d67 correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt 2026-02-05 01:39:26 +00:00
Andrej Karpathy
542beb0c8c bump speedrun to be the up to date leaderboard run 2026-02-04 02:12:04 +00:00
Andrej Karpathy
d510b1385b quick experiments to log 2026-02-03 23:21:39 +00:00
Andrej Karpathy
16b8ac7da3 oops forgot to attach leaderboard file too 2026-02-03 21:06:12 +00:00
Andrej Karpathy
fe55b092b8 minor cosmetics for the table 2026-02-03 21:05:28 +00:00
Andrej Karpathy
a67eba35dc add feb2 new leaderboard record from upgrading to fp8 training, +4.3% speedup to time to GPT-2 2026-02-03 21:03:42 +00:00
Andrej Karpathy
6079f78fc3 add fp8 training with torchao 2026-02-03 21:03:42 +00:00
Andrej Karpathy
8ebc14b348 small touchups to the eval script, re-order items etc, cosmetic 2026-02-03 21:03:42 +00:00
Sofie Van Landeghem
72b9064f9d
remove leftover mid references (#491) 2026-02-02 08:33:46 -08:00
Andrej Karpathy
b19b4f3e49 fix bug in speedrun script, batch size that doesn't OOM on 8XH100 for d24 is 16 2026-02-02 15:50:14 +00:00
Andrej Karpathy
230d6cf6c6 tune the synthetic data generation script. delete the king andrej stuff lol. also, upgrade to gemini 3 2026-02-02 01:45:59 +00:00
Andrej Karpathy
07c4dd4cd9 manually control the over-active garbage collector, save a small few minutes from a typical run 2026-02-02 01:44:30 +00:00
Andrej Karpathy
e8fec97d4c slightly more efficient dataloader that reduces the number of python objects flying around and causing strain on runtime and garbage collector 2026-02-02 01:17:30 +00:00
Andrej Karpathy
8b4849d548 fix bug in chat_sft, the attention window must be preserved sigh 2026-02-01 20:58:44 +00:00
Andrej Karpathy
eaf49a33c8 fix path which i think was modified during the refactor and this is a bug introduced by claude i believe 2026-02-01 20:15:19 +00:00
Andrej Karpathy
31b61d2d17 fix broken import sigh 2026-02-01 05:03:44 +00:00
Sofie Van Landeghem
4d6415b8ef
use _PEAK_FLOPS_TABLE instead of if-else structure (#479) 2026-01-31 19:45:06 -08:00
Sofie Van Landeghem
43078c347e
clean up original tokenizing_distributed_data_loader (#478) 2026-01-31 19:44:12 -08:00
Franci Penov
dc291c627f
Add Blackwell (SM100) GPU support via SDPA fallback (#475) 2026-01-31 19:42:58 -08:00
Andrej Karpathy
0307997f9b merge two files base_loss and base_eval into a single file, it's nicer this way, and unify the huggingface code associated with both 2026-02-01 02:36:43 +00:00
Andrej Karpathy
1ddaad1c1c nuke midtraining from orbit, it's not as needed now that we have a BOS-aligned dataloader. Also change the README a lot. midtrianing is not yet fully properly erased across the board, but good enough for step 1 2026-01-31 19:12:25 +00:00
Andrej Karpathy
348fbb301b fix dataloader for midtrain to never crop data. we can't just throw it away like we do in pretraining 2026-01-31 18:21:36 +00:00
Andrej Karpathy
3c3a3d7042 warmdown of 0.5 is slightly better: 2026-01-31 01:08:44 +00:00
Andrei Panferov
4d8dbaf6e0
Fix escape character in README bibtex entry (#454) 2026-01-30 09:34:02 -08:00
Andrej Karpathy
3ba42e8135 Fix SDPA KV-cache decode to respect sliding window (#456)
SDPA fallback now respects sliding window during single-token KV-cache
decode by slicing K/V to the last (window + 1) tokens.

Also simplifies the mask building for chunk inference to properly apply
sliding window in that path as well.

Fixes #452

Co-Authored-By: Kartik Vashishta <kartikv776@gmail.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 17:32:12 +00:00
Aarushi Singh
ace6740bdd
feat: allow top_k=0 in web api to disable filtering (#458)
* allow top_k=0 in web api to disable filtering

* adding a comment for clear reasoning

* adding change to docstring
2026-01-30 09:21:41 -08:00
Harsh Gupta
2e17723817
Fix generate() crash when top_k=0 (#467)
Prevent a crash in generate() by skipping top-k filtering when top_k is set to 0
2026-01-30 09:21:02 -08:00
Andrej Karpathy
02baa15405 i am feeling in a delete mood today. i need to delete a lot of code. there is too much code and surface area and complexity. ew 2026-01-30 17:08:53 +00:00
Andrej Karpathy
d6c4f3b923 i think this is the new torch 2.9+ API for declaring tf32 preference 2026-01-30 17:03:15 +00:00
Andrej Karpathy
067daa7758 small fix cpu script ty PR #474 2026-01-30 02:11:25 +00:00
Andrej Karpathy
6a341f2ecf contiguous views and single HtoD transfer for inputs/targets much cleaner 2026-01-30 00:23:01 +00:00