L3 generalizes token embeddings by placing per-token lookup tables inside
the decoder stack. Unlike MoE, routing is static (determined by token ID),
eliminating router training and load-balancing losses.
Implementation:
- nanochat/l3.py: LZW allocation algorithm and L3Layer module with
vectorized gather+pad+mask forward pass, tied/untied KV support
- GPT integration: L3 layers sit between decoder blocks, applied
residually (x = x + l3_layer(x, token_ids))
- CLI: --l3-after-layers, --l3-n-emb, --l3-d-up, --l3-k-max flags
with LZW precomputation from training data sample
- 17 tests covering allocation, layer, and GPT integration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Accept upstream's architectural changes wholesale:
- argparse replaces configurator.py across all scripts
- Unified MuonAdamW optimizer replaces separate AdamW + Muon
- Sliding window attention (SSSL pattern) + Flash Attention 3
- Value embeddings (ResFormer-style) with per-layer gating
- Per-layer learnable scalars (resid_lambdas, x0_lambdas)
- FP8 training support with Float8Linear
- Scaling laws (Power Lines batch sizing, T_epoch weight decay)
- Checkpoint resumption with dataloader state
- BOS-aligned bestfit-pad packing for SFT
- ChatCORE evaluation metric
- Consolidated base_loss.py into base_eval.py
- Removed mid_train.py (pipeline simplified)
Drops our MoE and tie_embeddings implementations in favor of
upstream's cleaner architecture. These can be re-added later
on top of the new codebase if needed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SDPA fallback now respects sliding window during single-token KV-cache
decode by slicing K/V to the last (window + 1) tokens.
Also simplifies the mask building for chunk inference to properly apply
sliding window in that path as well.
Fixes#452
Co-Authored-By: Kartik Vashishta <kartikv776@gmail.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* test: add engine generation tests for expected invariants
- test_seed_reproducibility
- test_temperature_zero_determinism
- test_max_tokens_respected
- test_num_samples_count
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Fix temperature test
* add test for seed variation in sampling
Add test for seed variation in sampling with temperature > 0.
* Rename test for clarity
* Shorten assert msg
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Previously, when generating multiple samples (num_samples > 1), the first
token after prefill was sampled once and broadcast to all rows, causing
all samples to start identically. Now the prefill logits are expanded to
num_samples and sampled independently for each row.
Also simplified the generation loop by moving the forward pass to the end
of the loop, eliminating the first_iteration flag and if/else branching.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Performance varies by machine and load, making hard assertions flaky.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Added a reset_parameters method in MoEFeedForward to reinitialize expert parameters.
- Updated the GPT class to call reset_parameters for MoEFeedForward instances during weight initialization.
- Introduced a new test in test_moe.py to validate gradient updates for MoE experts, ensuring proper functionality during training.
- Introduced `MOE_DEBUG_INTERVAL` parameter in `runmps.sh` to control debug logging frequency during training.
- Enhanced `base_train.py` to log gradients of routed and shared weights at specified intervals, aiding in monitoring model performance.
- Updated `gpt.py` to adjust router bias calculations, improving load balancing among experts.
- Added unit tests in `test_moe.py` to validate the behavior of the MoE implementation and ensure correctness of gradient calculations.