Added _Float8MatmulND to fp8.py:
- Handles N-D input tensors efficiently
- Does reshaping internally (opaque to torch.compile)
- Prevents external reshape overhead that was causing MFU regression
- ~75 lines of clean, documented code
Benefits:
- No torchao dependency (removed from pyproject.toml)
- Same performance as torchao for reparam_linear
- Consistent with fp8.py's minimal philosophy (~350 total lines)
- All FP8 logic in one self-contained module
Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
The custom fp8 module had a performance issue in reparam_linear:
it was doing reshape→matmul→reshape on every linear layer, and
torch.compile couldn't fuse these operations because _Float8Matmul
was marked @allow_in_graph (opaque to compiler).
torchao's matmul_with_hp_or_float8_args handles N-D tensors directly
without external reshaping, allowing better fusion opportunities and
higher MFU.
Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
files-to-prompt was including untracked files (knowledge/, dev scripts, etc.) which inflated the bloat metrics. now we use git ls-files to only count tracked source files, which is more accurate and removes an external dependency.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>