nanochat/tests
AaronXPeng 9f15189853 Add optional Block Attention Residuals (AttnRes)
Implements Block AttnRes from MoonshotAI (https://github.com/MoonshotAI/Attention-Residuals)
as an optional alternative to resid_lambdas/x0_lambdas for residual connections.

When enabled via --attn-res, replaces standard residual scaling with learned
depth-attention over block-level representations. Layers are partitioned into
blocks; at each sublayer, a softmax-weighted combination of all completed blocks
plus the current partial block determines the input to attention/MLP.

Design follows nanochat conventions:
- Two nn.Parameter(n_layer, D) pseudo-query vectors on GPT (like resid_lambdas)
- Uses existing parameterless norm() for key normalization (no learnable RMSNorm)
- Block class unchanged — all AttnRes logic lives in GPT.forward
- Minimal 6-line block_attn_res() core function

Changes:
- nanochat/gpt.py: block_attn_res(), AttnRes path in GPT.forward, config/init/optimizer
- nanochat/checkpoint_manager.py: backward-compat config patching
- scripts/base_train.py: --attn-res and --attn-res-block-size CLI args
- tests/test_attn_res.py: 18 tests covering unit/forward/backward/optimizer/inference

GPU results (depth=4, 20 steps, RTX 6000 Ada):
  Standard: val_bpb 3.21 → 2.80, ~840K tok/sec
  AttnRes:  val_bpb 3.21 → 2.61, ~780K tok/sec

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 00:57:16 -04:00
..
test_attention_fallback.py delete autocast, an unnecessary thorn in my side, manage dtypes directly 2026-03-04 23:55:30 +00:00
test_attn_res.py Add optional Block Attention Residuals (AttnRes) 2026-03-17 00:57:16 -04:00
test_engine.py Fix MockModel's device definition (#535) 2026-02-17 16:03:46 -08:00