nanochat/tests
William Thurston b7629eff5d Add L3 (Large Lookup Layers) following arXiv:2601.21461v2
L3 generalizes token embeddings by placing per-token lookup tables inside
the decoder stack. Unlike MoE, routing is static (determined by token ID),
eliminating router training and load-balancing losses.

Implementation:
- nanochat/l3.py: LZW allocation algorithm and L3Layer module with
  vectorized gather+pad+mask forward pass, tied/untied KV support
- GPT integration: L3 layers sit between decoder blocks, applied
  residually (x = x + l3_layer(x, token_ids))
- CLI: --l3-after-layers, --l3-n-emb, --l3-d-up, --l3-k-max flags
  with LZW precomputation from training data sample
- 17 tests covering allocation, layer, and GPT integration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 15:49:15 -08:00
..
test_attention_fallback.py Fix SDPA KV-cache decode to respect sliding window (#456) 2026-01-30 17:32:12 +00:00
test_engine.py Fix MockModel's device definition (#535) 2026-02-17 16:03:46 -08:00
test_l3.py Add L3 (Large Lookup Layers) following arXiv:2601.21461v2 2026-02-22 15:49:15 -08:00
test_moe.py Implement reset_parameters method in MoEFeedForward and update GPT to utilize it 2025-11-13 17:09:11 -08:00