nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-03-07 09:50:28 +00:00

History

William Thurston b7629eff5d Add L3 (Large Lookup Layers) following arXiv:2601.21461v2 L3 generalizes token embeddings by placing per-token lookup tables inside the decoder stack. Unlike MoE, routing is static (determined by token ID), eliminating router training and load-balancing losses. Implementation: - nanochat/l3.py: LZW allocation algorithm and L3Layer module with vectorized gather+pad+mask forward pass, tied/untied KV support - GPT integration: L3 layers sit between decoder blocks, applied residually (x = x + l3_layer(x, token_ids)) - CLI: --l3-after-layers, --l3-n-emb, --l3-d-up, --l3-k-max flags with LZW precomputation from training data sample - 17 tests covering allocation, layer, and GPT integration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>		2026-02-22 15:49:15 -08:00
..
base_eval.py	small touchups to the eval script, re-order items etc, cosmetic	2026-02-03 21:03:42 +00:00
base_train.py	Add L3 (Large Lookup Layers) following arXiv:2601.21461v2	2026-02-22 15:49:15 -08:00
chat_cli.py	remove leftover mid references (#491 )	2026-02-02 08:33:46 -08:00
chat_eval.py	Merge upstream/master (266 commits) into fork	2026-02-22 14:50:28 -08:00
chat_rl.py	remove leftover mid references (#491 )	2026-02-02 08:33:46 -08:00
chat_sft.py	tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft	2026-02-18 15:49:18 +00:00
chat_web.py	remove leftover mid references (#491 )	2026-02-02 08:33:46 -08:00
tok_eval.py	initial commit	2025-10-13 06:49:24 -07:00
tok_train.py	quick fix to not OOM main speedrun script	2026-01-26 22:31:42 +00:00