mirror of
https://github.com/karpathy/nanochat.git
synced 2026-06-15 10:39:08 +00:00
Attention priority: FA3 (Hopper) → FlexAttention (Blackwell/Ada) → SDPA. FlexAttention uses block-sparse sliding window via torch.compile, ~3x faster than SDPA dense masks for sliding window layers. Full causal always uses SDPA is_causal=True. Override with ATTENTION=fa3|flex|sdpa. Also upgrades PyTorch 2.9.1 → 2.11.0 with CUDA 13.0, and auto-detects GPU for PyTorch/CUDA version selection in pyproject.toml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| __init__.py | ||
| checkpoint_manager.py | ||
| common.py | ||
| core_eval.py | ||
| dataloader.py | ||
| dataset.py | ||
| engine.py | ||
| execution.py | ||
| flash_attention.py | ||
| fp8.py | ||
| gpt.py | ||
| logo.svg | ||
| loss_eval.py | ||
| optim.py | ||
| report.py | ||
| tokenizer.py | ||
| ui.html | ||