Commit Graph

4 Commits

Author SHA1 Message Date
ademeure
3d0dec5716 FA3/FlexAttention/SDPA attention + PyTorch 2.11/CUDA 13.0
Attention priority: FA3 (Hopper) → FlexAttention (Blackwell/Ada) → SDPA.
FlexAttention uses block-sparse sliding window via torch.compile, ~3x
faster than SDPA dense masks for sliding window layers. Full causal
always uses SDPA is_causal=True. Override with ATTENTION=fa3|flex|sdpa.

Also upgrades PyTorch 2.9.1 → 2.11.0 with CUDA 13.0, and auto-detects
GPU for PyTorch/CUDA version selection in pyproject.toml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:38:29 +00:00
Andrej Karpathy
1076f97059 delete autocast, an unnecessary thorn in my side, manage dtypes directly 2026-03-04 23:55:30 +00:00
Andrej Karpathy
3ba42e8135 Fix SDPA KV-cache decode to respect sliding window (#456)
SDPA fallback now respects sliding window during single-token KV-cache
decode by slicing K/V to the last (window + 1) tokens.

Also simplifies the mask building for chunk inference to properly apply
sliding window in that path as well.

Fixes #452

Co-Authored-By: Kartik Vashishta <kartikv776@gmail.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 17:32:12 +00:00
Andrej Karpathy
8203efa919 implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user. 2026-01-16 17:37:51 +00:00