mirror of
https://github.com/karpathy/nanochat.git
synced 2026-06-15 10:39:08 +00:00
Attention priority: FA3 (Hopper) → FlexAttention (Blackwell/Ada) → SDPA. FlexAttention uses block-sparse sliding window via torch.compile, ~3x faster than SDPA dense masks for sliding window layers. Full causal always uses SDPA is_causal=True. Override with ATTENTION=fa3|flex|sdpa. Also upgrades PyTorch 2.9.1 → 2.11.0 with CUDA 13.0, and auto-detects GPU for PyTorch/CUDA version selection in pyproject.toml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| test_attention_fallback.py | ||
| test_engine.py | ||