mirror of
https://github.com/karpathy/nanochat.git
synced 2026-04-02 13:45:21 +00:00
Implements Block AttnRes from MoonshotAI (https://github.com/MoonshotAI/Attention-Residuals) as an optional alternative to resid_lambdas/x0_lambdas for residual connections. When enabled via --attn-res, replaces standard residual scaling with learned depth-attention over block-level representations. Layers are partitioned into blocks; at each sublayer, a softmax-weighted combination of all completed blocks plus the current partial block determines the input to attention/MLP. Design follows nanochat conventions: - Two nn.Parameter(n_layer, D) pseudo-query vectors on GPT (like resid_lambdas) - Uses existing parameterless norm() for key normalization (no learnable RMSNorm) - Block class unchanged — all AttnRes logic lives in GPT.forward - Minimal 6-line block_attn_res() core function Changes: - nanochat/gpt.py: block_attn_res(), AttnRes path in GPT.forward, config/init/optimizer - nanochat/checkpoint_manager.py: backward-compat config patching - scripts/base_train.py: --attn-res and --attn-res-block-size CLI args - tests/test_attn_res.py: 18 tests covering unit/forward/backward/optimizer/inference GPU results (depth=4, 20 steps, RTX 6000 Ada): Standard: val_bpb 3.21 → 2.80, ~840K tok/sec AttnRes: val_bpb 3.21 → 2.61, ~780K tok/sec Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| test_attention_fallback.py | ||
| test_attn_res.py | ||
| test_engine.py | ||