nanochat/scripts
William Thurston b7629eff5d Add L3 (Large Lookup Layers) following arXiv:2601.21461v2
L3 generalizes token embeddings by placing per-token lookup tables inside
the decoder stack. Unlike MoE, routing is static (determined by token ID),
eliminating router training and load-balancing losses.

Implementation:
- nanochat/l3.py: LZW allocation algorithm and L3Layer module with
  vectorized gather+pad+mask forward pass, tied/untied KV support
- GPT integration: L3 layers sit between decoder blocks, applied
  residually (x = x + l3_layer(x, token_ids))
- CLI: --l3-after-layers, --l3-n-emb, --l3-d-up, --l3-k-max flags
  with LZW precomputation from training data sample
- 17 tests covering allocation, layer, and GPT integration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 15:49:15 -08:00
..
base_eval.py small touchups to the eval script, re-order items etc, cosmetic 2026-02-03 21:03:42 +00:00
base_train.py Add L3 (Large Lookup Layers) following arXiv:2601.21461v2 2026-02-22 15:49:15 -08:00
chat_cli.py remove leftover mid references (#491) 2026-02-02 08:33:46 -08:00
chat_eval.py Merge upstream/master (266 commits) into fork 2026-02-22 14:50:28 -08:00
chat_rl.py remove leftover mid references (#491) 2026-02-02 08:33:46 -08:00
chat_sft.py tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft 2026-02-18 15:49:18 +00:00
chat_web.py remove leftover mid references (#491) 2026-02-02 08:33:46 -08:00
tok_eval.py initial commit 2025-10-13 06:49:24 -07:00
tok_train.py quick fix to not OOM main speedrun script 2026-01-26 22:31:42 +00:00