- base_train.py: CUDA profiler + PyTorch profiler hooks gated by NANOCHAT_PROFILE_* env vars
- profile_step.py: standalone single-step profiler with NVTX ranges and phase selection
- LOCAL_STATE.md: documents local branch/file state before machine teardown
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Attention priority: FA3 (Hopper) → FlexAttention (Blackwell/Ada) → SDPA.
FlexAttention uses block-sparse sliding window via torch.compile, ~3x
faster than SDPA dense masks for sliding window layers. Full causal
always uses SDPA is_causal=True. Override with ATTENTION=fa3|flex|sdpa.
Also upgrades PyTorch 2.9.1 → 2.11.0 with CUDA 13.0, and auto-detects
GPU for PyTorch/CUDA version selection in pyproject.toml.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When swapping Float8Linear to Linear in disable_fp8 context manager,
using device=fp8_module.weight.device directly allocates new tensors
on GPU, causing unnecessary VRAM spike (~1GB for large models).
This fix uses device='meta' to avoid physical memory allocation,
then swaps in the weight tensor reference. This eliminates the
unnecessary VRAM spike during evaluation phase.
Fixes issue #592
Co-authored-by: RoomWithOutRoof <roomwithoutroof@sparklab.ai>
The bf16 cast is intentional for speed on Hopper+ GPUs, but should be
skipped on other platforms rather than blindly applied. fp16 is unstable
here due to its limited exponent range, and fp32 platforms don't benefit
from the cast. Now: bf16 when COMPUTE_DTYPE is bf16, no cast otherwise.
Inspired by PR #667.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New architectural features:
- Smear: mix previous token embedding into current position via learned
gate, providing cheap bigram-like info (works in training + KV cache)
- Backout: subtract learned fraction of mid-layer residual before logit
projection to remove low-level features
Hyperparameter tuning:
- Muon momentum warmdown 0.97→0.90 during LR warmdown phase
- Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05
- c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4
- Speedrun data:params ratio reduced to 8
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* printing steps count
* adding reply only loss for chat
* using the mask by render_conversation function of tokeniser
* undoing some changes
* putting back the comment which got removed accidently, no functionality change