nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-03-23 13:23:23 +00:00

History

William Thurston b7629eff5d Add L3 (Large Lookup Layers) following arXiv:2601.21461v2 L3 generalizes token embeddings by placing per-token lookup tables inside the decoder stack. Unlike MoE, routing is static (determined by token ID), eliminating router training and load-balancing losses. Implementation: - nanochat/l3.py: LZW allocation algorithm and L3Layer module with vectorized gather+pad+mask forward pass, tied/untied KV support - GPT integration: L3 layers sit between decoder blocks, applied residually (x = x + l3_layer(x, token_ids)) - CLI: --l3-after-layers, --l3-n-emb, --l3-d-up, --l3-k-max flags with LZW precomputation from training data sample - 17 tests covering allocation, layer, and GPT integration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>		2026-02-22 15:49:15 -08:00
..
test_attention_fallback.py	Fix SDPA KV-cache decode to respect sliding window (#456 )	2026-01-30 17:32:12 +00:00
test_engine.py	Fix MockModel's device definition (#535 )	2026-02-17 16:03:46 -08:00
test_l3.py	Add L3 (Large Lookup Layers) following arXiv:2601.21461v2	2026-02-22 15:49:15 -08:00
test_moe.py	Implement reset_parameters method in MoEFeedForward and update GPT to utilize it	2025-11-13 17:09:11 -08:00