nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-06-19 04:29:09 +00:00

Author	SHA1	Message	Date
Jason Kneen	16c37b7d1d	fix: MPS/CPU compatibility for training and inference on Mac Resolves all crashes and silent errors when running on Apple Silicon (MPS) or CPU after merging upstream/master. Tested on torch 2.2.2 and 2.9.1. nanochat/engine.py - Replace signal.SIGALRM timeout with concurrent.futures.ThreadPoolExecutor so use_calculator() works from FastAPI worker threads (SIGALRM is Unix main-thread only, silently broken in any threaded web server) - Guard torch.cuda.synchronize() behind device_type == 'cuda' nanochat/gpt.py - Extract init_rotary_embeddings() from init_weights() so checkpoint loading can restore non-persistent cos/sin buffers without re-randomising all weights - Cast rotary cos/sin to bfloat16 on CUDA only (MPS bfloat16 requires torch>=2.4; float32 used on MPS/CPU) - Update forward() dtype assertion to match device - Add F.rms_norm fallback for torch<2.4 (rms_norm added in 2.4) nanochat/optim.py - _cuda_compile(): skip torch.compile(fullgraph=True) on MPS/CPU; return function unchanged so eager execution is used - adamw_step_fused / muon_step_fused: move 0-D CPU scalar tensors to parameter device at start of function (cross-device ops crash in eager mode on MPS) - muon_step_fused: use bfloat16 in polar express on CUDA only; fall back to float32 on MPS/CPU - _step_muon: replace torch._foreach_copy_() with p.copy_(s) loop on non-CUDA (_foreach_copy_ not implemented on MPS in torch<2.4) nanochat/flash_attention.py - Probe SDPA for enable_gqa support at import time (added in torch 2.5; inspect.signature raises on C builtins in older Python/torch) - Fall back to manual KV head repetition via repeat_interleave when enable_gqa is unavailable nanochat/checkpoint_manager.py - Call model.init_rotary_embeddings() instead of model.init_weights() after load_state_dict() to restore non-persistent rotary buffers without clobbering loaded weights scripts/base_train.py - Guard torch.compile(model) behind device_type == 'cuda' - Set mfu = None on non-CUDA instead of computing 0/inf = 0.00% - Handle mfu is None in end-of-run report tests/test_mps_compat.py (new) - 16 tests covering every fix; all pass on MPS (torch 2.2.2 and 2.9.1)	2026-02-22 15:50:11 +00:00
Sofie Van Landeghem	4800c62f6e	Fix MockModel's device definition (#535 ) * fix MockModel's device definition * cleanup	2026-02-17 16:03:46 -08:00
Andrej Karpathy	3ba42e8135	Fix SDPA KV-cache decode to respect sliding window (#456 ) SDPA fallback now respects sliding window during single-token KV-cache decode by slicing K/V to the last (window + 1) tokens. Also simplifies the mask building for chunk inference to properly apply sliding window in that path as well. Fixes #452 Co-Authored-By: Kartik Vashishta <kartikv776@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-30 17:32:12 +00:00
karpathy	f9a7e0f111	update the CPU/MPS script to give reasonable results. The model can at least answer that Paris is the capital of France and knows that the sky is blue, for about 40 minutes of training on my macbook. Also fixed a bug that existed due to KVCache bfloat16 dtype assumption	2026-01-17 12:27:30 -08:00
Barış Özmen	bbc4413c58	Add high value engine tests for core invariants (33 LoC) (#396 ) * test: add engine generation tests for expected invariants - test_seed_reproducibility - test_temperature_zero_determinism - test_max_tokens_respected - test_num_samples_count 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix temperature test * add test for seed variation in sampling Add test for seed variation in sampling with temperature > 0. * Rename test for clarity * Shorten assert msg --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2026-01-16 18:59:12 -08:00
Andrej Karpathy	8203efa919	implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user.	2026-01-16 17:37:51 +00:00
Andrej Karpathy	2ff7d51252	integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge	2026-01-11 20:33:19 +00:00
Andrej Karpathy	da8b7ea4cb	also delete the rustbpe test code, this now lives in rustbpe repo that is separate	2026-01-04 01:23:34 +00:00
Andrej Karpathy	8f979a8bda	fix: sample first token independently for each row in multi-sample generation Previously, when generating multiple samples (num_samples > 1), the first token after prefill was sampled once and broadcast to all rows, causing all samples to start identically. Now the prefill logits are expanded to num_samples and sampled independently for each row. Also simplified the generation loop by moving the forward pass to the end of the loop, eliminating the first_iteration flag and if/else branching. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-28 04:52:13 +00:00
Andrej Karpathy	91d76cc690	Replace speedup assertion with warning in batch_encode test Performance varies by machine and load, making hard assertions flaky. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-28 04:10:49 +00:00
Barış Özmen	790f3be65c	add rust batch encode as a faster option over encode	2025-12-18 19:17:59 +03:00
svlandeg	2ce62ec076	ensure consistency of quotes within each statement	2025-11-03 21:52:02 +01:00
svlandeg	c72b8b2309	add explicit UTF-8 encoding	2025-11-03 21:27:12 +01:00
Andrej Karpathy	baf0b3fdda	also add a test that failed before the fix and passes now with the fix for kv cache resize	2025-10-28 16:54:17 +00:00
karpathy	3a5e0bc50b	initial commit	2025-10-13 06:49:24 -07:00

15 Commits