nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-06-15 18:49:10 +00:00

History

ademeure 3d0dec5716 FA3/FlexAttention/SDPA attention + PyTorch 2.11/CUDA 13.0 Attention priority: FA3 (Hopper) → FlexAttention (Blackwell/Ada) → SDPA. FlexAttention uses block-sparse sliding window via torch.compile, ~3x faster than SDPA dense masks for sliding window layers. Full causal always uses SDPA is_causal=True. Override with ATTENTION=fa3\|flex\|sdpa. Also upgrades PyTorch 2.9.1 → 2.11.0 with CUDA 13.0, and auto-detects GPU for PyTorch/CUDA version selection in pyproject.toml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 21:38:29 +00:00
..
test_attention_fallback.py	FA3/FlexAttention/SDPA attention + PyTorch 2.11/CUDA 13.0	2026-04-08 21:38:29 +00:00
test_engine.py	Fix MockModel's device definition (#535 )	2026-02-17 16:03:46 -08:00

ademeure 3d0dec5716 FA3/FlexAttention/SDPA attention + PyTorch 2.11/CUDA 13.0

Attention priority: FA3 (Hopper) → FlexAttention (Blackwell/Ada) → SDPA.
FlexAttention uses block-sparse sliding window via torch.compile, ~3x
faster than SDPA dense masks for sliding window layers. Full causal
always uses SDPA is_causal=True. Override with ATTENTION=fa3|flex|sdpa.

Also upgrades PyTorch 2.9.1 → 2.11.0 with CUDA 13.0, and auto-detects
GPU for PyTorch/CUDA version selection in pyproject.toml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-08 21:38:29 +00:00

test_attention_fallback.py

FA3/FlexAttention/SDPA attention + PyTorch 2.11/CUDA 13.0

2026-04-08 21:38:29 +00:00

test_engine.py

Fix MockModel's device definition (#535 )

2026-02-17 16:03:46 -08:00