nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-06-15 18:49:10 +00:00

Author	SHA1	Message	Date
ademeure	3d0dec5716	FA3/FlexAttention/SDPA attention + PyTorch 2.11/CUDA 13.0 Attention priority: FA3 (Hopper) → FlexAttention (Blackwell/Ada) → SDPA. FlexAttention uses block-sparse sliding window via torch.compile, ~3x faster than SDPA dense masks for sliding window layers. Full causal always uses SDPA is_causal=True. Override with ATTENTION=fa3\|flex\|sdpa. Also upgrades PyTorch 2.9.1 → 2.11.0 with CUDA 13.0, and auto-detects GPU for PyTorch/CUDA version selection in pyproject.toml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 21:38:29 +00:00
Andrej Karpathy	1076f97059	delete autocast, an unnecessary thorn in my side, manage dtypes directly	2026-03-04 23:55:30 +00:00
Andrej Karpathy	3ba42e8135	Fix SDPA KV-cache decode to respect sliding window (#456 ) SDPA fallback now respects sliding window during single-token KV-cache decode by slicing K/V to the last (window + 1) tokens. Also simplifies the mask building for chunk inference to properly apply sliding window in that path as well. Fixes #452 Co-Authored-By: Kartik Vashishta <kartikv776@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-30 17:32:12 +00:00
Andrej Karpathy	8203efa919	implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user.	2026-01-16 17:37:51 +00:00