Commit Graph

18 Commits

Author SHA1 Message Date
William Thurston
b7629eff5d Add L3 (Large Lookup Layers) following arXiv:2601.21461v2
L3 generalizes token embeddings by placing per-token lookup tables inside
the decoder stack. Unlike MoE, routing is static (determined by token ID),
eliminating router training and load-balancing losses.

Implementation:
- nanochat/l3.py: LZW allocation algorithm and L3Layer module with
  vectorized gather+pad+mask forward pass, tied/untied KV support
- GPT integration: L3 layers sit between decoder blocks, applied
  residually (x = x + l3_layer(x, token_ids))
- CLI: --l3-after-layers, --l3-n-emb, --l3-d-up, --l3-k-max flags
  with LZW precomputation from training data sample
- 17 tests covering allocation, layer, and GPT integration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 15:49:15 -08:00
William Thurston
194c98a5b3 Merge upstream/master (266 commits) into fork
Accept upstream's architectural changes wholesale:
- argparse replaces configurator.py across all scripts
- Unified MuonAdamW optimizer replaces separate AdamW + Muon
- Sliding window attention (SSSL pattern) + Flash Attention 3
- Value embeddings (ResFormer-style) with per-layer gating
- Per-layer learnable scalars (resid_lambdas, x0_lambdas)
- FP8 training support with Float8Linear
- Scaling laws (Power Lines batch sizing, T_epoch weight decay)
- Checkpoint resumption with dataloader state
- BOS-aligned bestfit-pad packing for SFT
- ChatCORE evaluation metric
- Consolidated base_loss.py into base_eval.py
- Removed mid_train.py (pipeline simplified)

Drops our MoE and tie_embeddings implementations in favor of
upstream's cleaner architecture. These can be re-added later
on top of the new codebase if needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 14:50:28 -08:00
Sofie Van Landeghem
4800c62f6e
Fix MockModel's device definition (#535)
* fix MockModel's device definition

* cleanup
2026-02-17 16:03:46 -08:00
Andrej Karpathy
3ba42e8135 Fix SDPA KV-cache decode to respect sliding window (#456)
SDPA fallback now respects sliding window during single-token KV-cache
decode by slicing K/V to the last (window + 1) tokens.

Also simplifies the mask building for chunk inference to properly apply
sliding window in that path as well.

Fixes #452

Co-Authored-By: Kartik Vashishta <kartikv776@gmail.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 17:32:12 +00:00
karpathy
f9a7e0f111 update the CPU/MPS script to give reasonable results. The model can at least answer that Paris is the capital of France and knows that the sky is blue, for about 40 minutes of training on my macbook. Also fixed a bug that existed due to KVCache bfloat16 dtype assumption 2026-01-17 12:27:30 -08:00
Barış Özmen
bbc4413c58
Add high value engine tests for core invariants (33 LoC) (#396)
* test: add engine generation tests for expected invariants

- test_seed_reproducibility
- test_temperature_zero_determinism
- test_max_tokens_respected
- test_num_samples_count

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix temperature test

* add test for seed variation in sampling

Add test for seed variation in sampling with temperature > 0.

* Rename test for clarity

* Shorten assert msg

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2026-01-16 18:59:12 -08:00
Andrej Karpathy
8203efa919 implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user. 2026-01-16 17:37:51 +00:00
Andrej Karpathy
2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge 2026-01-11 20:33:19 +00:00
Andrej Karpathy
da8b7ea4cb also delete the rustbpe test code, this now lives in rustbpe repo that is separate 2026-01-04 01:23:34 +00:00
Andrej Karpathy
8f979a8bda fix: sample first token independently for each row in multi-sample generation
Previously, when generating multiple samples (num_samples > 1), the first
token after prefill was sampled once and broadcast to all rows, causing
all samples to start identically. Now the prefill logits are expanded to
num_samples and sampled independently for each row.

Also simplified the generation loop by moving the forward pass to the end
of the loop, eliminating the first_iteration flag and if/else branching.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-28 04:52:13 +00:00
Andrej Karpathy
91d76cc690 Replace speedup assertion with warning in batch_encode test
Performance varies by machine and load, making hard assertions flaky.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-28 04:10:49 +00:00
Barış Özmen
790f3be65c add rust batch encode as a faster option over encode 2025-12-18 19:17:59 +03:00
William Thurston
5f13389568 Implement reset_parameters method in MoEFeedForward and update GPT to utilize it
- Added a reset_parameters method in MoEFeedForward to reinitialize expert parameters.
- Updated the GPT class to call reset_parameters for MoEFeedForward instances during weight initialization.
- Introduced a new test in test_moe.py to validate gradient updates for MoE experts, ensuring proper functionality during training.
2025-11-13 17:09:11 -08:00
William Thurston
76227f70d3 Add MOE debug interval and logging for gradient statistics
- Introduced `MOE_DEBUG_INTERVAL` parameter in `runmps.sh` to control debug logging frequency during training.
- Enhanced `base_train.py` to log gradients of routed and shared weights at specified intervals, aiding in monitoring model performance.
- Updated `gpt.py` to adjust router bias calculations, improving load balancing among experts.
- Added unit tests in `test_moe.py` to validate the behavior of the MoE implementation and ensure correctness of gradient calculations.
2025-11-13 16:22:20 -08:00
svlandeg
2ce62ec076 ensure consistency of quotes within each statement 2025-11-03 21:52:02 +01:00
svlandeg
c72b8b2309 add explicit UTF-8 encoding 2025-11-03 21:27:12 +01:00
Andrej Karpathy
baf0b3fdda also add a test that failed before the fix and passes now with the fix for kv cache resize 2025-10-28 16:54:17 +00:00
karpathy
3a5e0bc50b initial commit 2025-10-13 06:49:24 -07:00