nanochat/tests
ademeure 3d0dec5716 FA3/FlexAttention/SDPA attention + PyTorch 2.11/CUDA 13.0
Attention priority: FA3 (Hopper) → FlexAttention (Blackwell/Ada) → SDPA.
FlexAttention uses block-sparse sliding window via torch.compile, ~3x
faster than SDPA dense masks for sliding window layers. Full causal
always uses SDPA is_causal=True. Override with ATTENTION=fa3|flex|sdpa.

Also upgrades PyTorch 2.9.1 → 2.11.0 with CUDA 13.0, and auto-detects
GPU for PyTorch/CUDA version selection in pyproject.toml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:38:29 +00:00
..
test_attention_fallback.py FA3/FlexAttention/SDPA attention + PyTorch 2.11/CUDA 13.0 2026-04-08 21:38:29 +00:00
test_engine.py Fix MockModel's device definition (#535) 2026-02-17 16:03:46 -08:00