Merge branch 'karpathy:master' into master

2026-06-18 20:19:08 +00:00 · 2026-02-03 22:03:38 -05:00 · 2026-02-03 22:03:38 -05:00 · 5e5c609b05
commit 5e5c609b05
parent beb34ac43c 542beb0c8c
22 changed files with 1098 additions and 936 deletions
--- a/README.md
+++ b/README.md
@ -14,37 +14,15 @@ For questions about the repo, I recommend either using [DeepWiki](https://deepwi

 ## Leaderboard

-| # | Record time | Description | Date | Commit | Contributors |
-|---|-------------|-------------|------|--------|--------------|
-| 1 | 3.04 hours | d24 baseline, slightly overtrained | Jan 29 2026 | 348fbb3 | @karpathy |
+| # | time | val_bpb | CORE | Description | Date | Commit | Contributors |
+|---|-------------|---------|------|-------------|------|--------|--------------|
+| 0 | 168 hours | - | 0.2565 | Original OpenAI GPT-2 checkpoint | 2019 | - | OpenAI |
+| 1 | 3.04 | 0.74833 | 0.2585 | d24 baseline, slightly overtrained | Jan 29 2026 | 348fbb3 | @karpathy |
+| 2 | 2.91 | 0.74504 | 0.2578 | d26 slightly undertrained **+fp8** | Feb 2 2026 | 8309b83 | @karpathy |

-The primary metric we care about is "time to GPT-2" - the wall clock time needed to outperform the GPT-2 (1.6B) CORE metric on an 8XH100 GPU node. In 2019, the training of GPT-2 cost approximately $50,000 so it is incredible that due to many advances over 7 years across the stack, we can now do so in 3 hours or less, for ~$73 and below. Once your repo is set up (see the [runs/speedrun.sh](runs/speedrun.sh) script for reference), e.g. the way I kicked off the jan29 run is as follows:
+The primary metric we care about is "time to GPT-2" - the wall clock time needed to outperform the GPT-2 (1.6B) CORE metric on an 8XH100 GPU node. The GPT-2 CORE score is 0.256525. In 2019, the training of GPT-2 cost approximately $50,000 so it is incredible that due to many advances over 7 years across the stack, we can now do so much faster and for well below $100 (e.g. at the current ~$3/GPU/hr, an 8XH100 node is ~$24/hr, so 3 hours is ~$72).

-```
-OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
-    --depth=24 \
-    --run=d24-jan29 \
-    --model-tag=d24_jan29 \
-    --device-batch-size=16 \
-    --sample-every=-1 \
-    --save-every=-1 \
-    --core-metric-max-per-task=-1 \
-    --core-metric-every=3000 \
-    --target-param-data-ratio=12
-```
-
-After 3 hours we get output like this:
-
-```
-...
-wandb: Run summary:
-wandb:          core_metric 0.25851
-wandb:                 step 16704
-wandb: total_training_flops 4.330784131228946e+19
-wandb:  total_training_time 10949.46713
-```
-
-The GPT-2 CORE score (i.e. the target to beat) is 0.256525. So we see that this d24 CORE score is higher (0.25851). Then we look at the `total_training_time`, which is the time of the training iterations alone, excluding all the evaluations and logging, in seconds. We get: `10949/60/60 ~= 3.04` hours, the current record.
+See [dev/LEADERBOARD.md](dev/LEADERBOARD.md) for more docs on how to interpret and contribute to the leaderboard.

 ## Getting started

@ -142,8 +120,7 @@ I've published a number of guides that might contain helpful information:
 │   ├── scaling_laws.sh             # Scaling laws experiments
 │   └── speedrun.sh                 # Train the ~$100 nanochat d20
 ├── scripts
-│   ├── base_eval.py                # Base model: calculate CORE score
-│   ├── base_loss.py                # Base model: calculate bits per byte, sample
+│   ├── base_eval.py                # Base model: CORE score, bits per byte, samples
 │   ├── base_train.py               # Base model: train
 │   ├── chat_cli.py                 # Chat model: talk to over CLI
 │   ├── chat_eval.py                # Chat model: eval tasks
--- a/dev/LEADERBOARD.md
+++ b/dev/LEADERBOARD.md
@ -0,0 +1,119 @@
+# Leaderboard
+
+Docs on participating in the "Time-to-GPT-2" leaderboard of nanochat.
+
+The primary metric we care about is "time to GPT-2" - the wall clock time needed to outperform the GPT-2 (1.6B) CORE metric on an 8XH100 GPU node. Originally in 2019, GPT-2 was trained by OpenAI on 32 TPU v3 chips for 168 hours (7 days), with $8/hour/TPUv3 back then, for a total cost of approx. $43K. It achieves 0.256525 CORE score, which is an ensemble metric introduced in the DCLM paper over 22 evaluations like ARC/MMLU/etc.
+
+## How to
+
+The script [runs/speedrun.sh](runs/speedrun.sh) always implements the current state of the art on the leaderboard.
+
+In practice, I tune the base_train command a little bit. For example, once all the setup is configured and a tokenizer is trained, I like to do something like:
+
+```
+OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
+    --depth=26 \
+    --run="d26-feb2-fp8-ratio8.25" \
+    --model-tag="d26_feb2_fp8_ratio8.25" \
+    --device-batch-size=16 \
+    --sample-every=-1 \
+    --save-every=-1 \
+    --core-metric-max-per-task=-1 \
+    --core-metric-every=999999 \
+    --target-param-data-ratio=8.25 \
+    --fp8
+```
+
+Note that:
+
+- `depth` controls the size of the Transformer
+- `run` is the wandb name
+- `model-tag` is the location of the checkpoints on disk
+- `device-batch-size` in the ideal world, you want this to be 32 because with sequence length of 2048 (the default) and 8 GPUs we get `32 X 2048 X 8 = 524,288`, which is the total desired batch size determined to work fairly well around this scale. However, for bigger (e.g. d26), 32 is too much and OOMs, so we decrease it by 2 to 16. The `base_train.py` script automatically compensates for this by calculating that it has to use gradient accumulation of 2 to meet the desired total batch size. Therefore, it will fo forward+backward twice and then a single step. Long story short, the ideal value is 32. If that doesn't fit, you decrease it, e.g. 16, 8, etc., keeping it powers of two so that the gradient accumulation math works out neatly.
+- `sample-every = -1` turns off periodic sampling
+- `core-metric-max-per-task=-1` means we run the entire CORE eval
+- `core-metric-every=999999` a bit of a hacky way to make the CORE eval only happen a single time at the very end of the run
+- `target-param-data-ratio=8.25` controls the training horizon, which is determined in the script by taking the number of non-embedding model parameters and simply multiplying by this number. The current optimal Tokens:Params ratio can be seen in the defaults of the `base_train.py` script (it is 10.5). 10.5 would produce the *compute optimal* model given the currently measured scaling laws. However, GPT-2 capability is currently somewhere in between a d24 and d26. So to reach it exactly, we want to either overtrain d24 or undertrain d26. In this particular example, I am choosing to slightly undertrain a d26. Note that odd depths (e.g. d25) are not super recommended to use because the math around the transformer sizing and its head dimensions doesn't come out neatly.
+- `--fp8` turns on fp8 training. If you GPU does not support fp8, you can leave this out and the code will simply train in bf16. bf16 is higher precision than fp8, so you can actually expect that you might be able to do fewer steps (lower the `target-param-data-ratio`) to achieve the same capability.
+
+Once you kick off the run, you wait ~3 hours and then at the end you'll see something like:
+
+```
+wandb: Run summary:
+wandb:          core_metric 0.25851
+wandb:                 step 16704
+wandb: total_training_flops 4.330784131228946e+19
+wandb:  total_training_time 10949.46713
+```
+
+Your CORE metric must be greater than GPT-2 0.256525. Then you report the `total_training_time`, (e.g. 10949) which is the time of the training iterations alone, excluding all the evaluations and logging, in seconds. So here for example here it is roughly 10949/60/60 ~= 3.04 hours. You should also note and report the validation bpb of your run because the CORE metric can be a little bit noisy.
+
+If you outperform GPT-2 and the time is less than current SOTA in the Leaderboard, you get to make a PR. In addition to raw gains, there are some qualitative and aesthetic considerations that go into whether your improvement is merged. For example, if it is gnarly or it significantly bloats the code, or it seems too esoteric, then we will way those things against the improvement demonstrated. Additionally, nanochat cares not only about targeting a single model, but an entire miniseries of models. So your change must be principled enough that it can easily generalize to other model depths, so that we can sweep out a miniseries.
+
+After you create the commit, to get the current short git commit hash:
+
+```
+git log -1 --format="%h"
+```
+
+## Run 1
+
+Achieved Jan 29 2026 on commit `348fbb3`. The launch command was
+
+```
+OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
+    --depth=24 \
+    --run=d24-jan29 \
+    --model-tag=d24_jan29 \
+    --device-batch-size=16 \
+    --sample-every=-1 \
+    --save-every=-1 \
+    --core-metric-max-per-task=-1 \
+    --core-metric-every=3000 \
+    --target-param-data-ratio=12
+```
+
+The result was:
+
+```
+wandb: Run summary:
+wandb:          core_metric 0.25851
+wandb:                 step 16704
+wandb: total_training_flops 4.330784131228946e+19
+wandb:  total_training_time 10949.46713
+```
+
+The validation bpb was 0.74833.
+
+Detailed writeup: [Beating GPT-2 for <<$100: the nanochat journey](https://github.com/karpathy/nanochat/discussions/481)
+
+## Run 2
+
+Achieved Feb 2 2026 on commit `8309b83`. The launch command was
+
+```
+OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
+    --depth=26 \
+    --run="d26-feb2-fp8-ratio8.5" \
+    --model-tag="d26_feb2_fp8_ratio8.5" \
+    --device-batch-size=16 \
+    --sample-every=-1 \
+    --save-every=-1 \
+    --core-metric-max-per-task=-1 \
+    --core-metric-every=999999 \
+    --target-param-data-ratio=8.5 \
+    --fp8
+```
+
+The result was:
+
+```
+core_metric 0.2578
+step 14889
+total_training_time 10493
+Minimum validation bpb: 0.745036
+```
+
+The big change in this run is `--fp8`, which causes all Linear layers (other than the gates) to be switched to fp8 training using `torchao` with tensorwise fp8 scaling. Each step is of slightly lower quality, but we are taking them a lot faster, coming out net ahead. Anyone who does not have fp8 (e.g. using a GPU without it) can simply leave out the `--fp8` flag to train in bfloat16. This will work just fine but it will produce a slightly stronger model than GPT-2 because of the fp8 -> bf16 precision upgrade. It's possible that one can further tune which layers to include in the fp8 conversion and that e.g. some of the smaller matmuls should be just kept in bf16 etc.
+
+Previous record was 3.04 hours, so 2.91 hours is `(3.04 - 2.91)/3.04*100` ~= 4.3% speed improvement.
--- a/dev/LOG.md
+++ b/dev/LOG.md
@ -4,6 +4,92 @@ A running summary documenting some experiments and findings. Started ~Jan 7 2026

 ---

+## 2026-02-03: Flip Muon MLP LR Multiplier (PR #492)
+
+Tested flipping the shape-based LR heuristic in Muon from boosting tall matrices (input projections like `c_fc`) to boosting wide matrices (output projections like `c_proj`). The original code applies `max(1, rows/cols)^0.5`, giving ~2x LR to `c_fc`. The flipped version gives ~2x LR to `c_proj` instead, which aligns with classical fan-in/fan-out scaling conventions. This was proposed in [PR #492](https://github.com/karpathy/nanochat/pull/492) and showed improvements in modded-nanogpt.
+
+**Result:** Quick d12 experiment: slightly worse **Not adopted.**
+
+---
+
+## 2026-02-03: Skip AdamW Every Other Step
+
+Inspired by modded-nanogpt, tried stepping AdamW only on odd iterations while Muon steps every iteration. The idea is that small AdamW params (embeddings, scalars, gates) don't need updates as frequently as the large weight matrices, and skipping saves both compute and communication.
+
+Added `skip_adamw` parameter to `MuonAdamW.step()` and `DistMuonAdamW.step()` plus a matching `zero_grad(skip_adamw=...)` to let AdamW gradients accumulate over 2 steps. Used `lr *= 2**-0.5` (sqrt scaling) to compensate for the 2x effective batch size on AdamW params.
+
+**Result:** for nanochat d12, we see ~2% faster tok/s, but each step is slightly worse in loss. On net, when plotting against wall clock time, it's slightly worse. **Not adopted.**
+
+---
+
+## 2026-02-02: FP8 Training with torchao
+
+Integrated FP8 training using `torchao.float8` to accelerate Linear layer matmuls on H100 GPUs.
+
+### Background
+
+FP8 (8-bit floating point) uses H100's FP8 tensor cores for ~2x theoretical matmul throughput. The tradeoff is quantization overhead: computing scales and casting tensors to/from FP8. Still, as an example torchtitan (Meta's distributed training framework) reports 25-28% speedups with FP8 for some of their experiments.
+
+**Previous attempt (Jan 2026):** FP8 on just `lm_head` following modded-nanogpt with custom ops → 1% speedup, +2GB memory. Failed due to fragile torch.compile interaction. But this experiment was also done on ~d12 scale back then instead of the bigger model that gets GPT-2 capability of approx d24.
+
+**This attempt:** Use torchao's `convert_to_float8_training()` on ALL Linear layers, increase model size to d24. The core snippet is:
+
+```python
+from torchao.float8 import Float8LinearConfig, convert_to_float8_training
+config = Float8LinearConfig.from_recipe_name("tensorwise")
+convert_to_float8_training(model, config=config)
+```
+
+But in practice it's more involved (see base_train.py).
+
+### Results
+
+**Microbenchmark (d26 MLP, 65536x1664 @ 1664x6656):**
+
+| Method | Forward | Fwd+Bwd | Speedup |
+|--------|---------|---------|---------|
+| BF16 + compile | 2.00ms | 4.79ms | 1.00x |
+| FP8 rowwise + compile | 1.84ms | 4.55ms | 1.08x |
+| FP8 tensorwise + compile | 1.45ms | 4.06ms | **1.38x** |
+| FP8 rowwise (no compile) | 2.89ms | 21.86ms | 0.23x ❌ |
+
+torch.compile is MANDATORY. Without it, FP8 is 4x slower due to unfused scaling ops.
+
+**Full training (d26):**
+
+| Config | tok/sec | vs baseline |
+|--------|---------|-------------|
+| BF16 baseline | 630K | 1.00x |
+| FP8 rowwise | 564K | 0.90x ❌ |
+| FP8 tensorwise | 740K | **1.17x** ✓ |
+
+Memory usage also decreases quite a bit, by ~9GB (activations stored as FP8 instead of BF16).
+
+Seeing 17% speedup is encouraging but we're still not done yet because each step is now in lower precision and less powerful individually, so to make up for the precision drop we have to train longer. Empirically, running some sweeps overnight on d24 scale, I saw that the actual speedup (when you match performance) is closer to 5%. It's possible that our LLMs at ~d24 scale are still too small to confidently enjoy the speedups that come from fp8 for bigger models.
+
+### Key Learnings
+
+For nanochat at approximate scale of interest (~GPT-2 capability, ~d24):
+
+1. **Tensorwise >> Rowwise** - Rowwise computes per-row scales, overhead exceeds benefit. Tensorwise uses one scale per tensor.
+2. **Filter small layers** - Layers with dims not divisible by 16 must be skipped (FP8 hardware requirement)
+3. **Larger models benefit more** - d12 was still slower with FP8; d26+ shows gains. Therefore, in some depths there is a benefit to fp8 and in some there isn't. Keeping it configurable for now, passed in via kwargs and default off.
+4. **The effective, capability-matched speedup is lower still** - because each step is of slightly lower precision/quality.
+
+### Integration
+
+Added `--fp8` flag to `base_train.py`, default recipe is "tensorwise", example of turning on:
+
+```bash
+torchrun --nproc_per_node=8 -m scripts.base_train --depth=24 --fp8
+```
+
+Uses tensorwise by default. Requires `torchao==0.15.0` (compatible with torch 2.9.1), which was added to dependencies.
+
+**TLDR**: turning on fp8 for GPT-2 capability nanochat model gives approx +5% capability-matched speedup.
+
+---
+
 ## 2026-01-29: Hyperball/MuonH Experiments (Negative Result)

 Explored Hyperball optimization from [this post](https://psychedelic-sunstone-851.notion.site/Fantastic-Pretraining-Optimizers-and-Where-to-Find-Them-2-1-Hyperball-Optimization-2e924306e6f280e7a5ffee00eb40a0dd) (saved to `knowledge/muonh.md`). Constrains weights to sphere of radius R (initial norm): `W_{t+1} = R · Normalize(W_t - η·R · Normalize(u_t))`. Had to change a number of details in a branch, e.g. not use zero init for our projections (or the initial norm would be zero), keep track of the initial norm, adjust Muon -> MuonH for the update.
--- a/dev/gen_synthetic_data.py
+++ b/dev/gen_synthetic_data.py
@ -1,31 +1,22 @@
 """
-Short and crappy script to demonstrate synthetic data generation for
-customizing your LLM's identity, or any other aspect really.
+Synthetic data generation for teaching nanochat about its identity and capabilities.

-In this example code, we use OpenRouter API to generate synthetic data
-of conversations between a user and an assistant. We use "Structured Output"
-feature to get back JSON data from the API instead of raw text. The conversations
-are saved simply to a .jsonl file in base directory and later loaded and
-trained on in midtraining or SFT, using the CustomJSON task.
+This script uses the OpenRouter API to generate diverse multi-turn conversations
+between a user and nanochat. The conversations are saved to a .jsonl file for use
+in supervised finetuning (SFT) via the CustomJSON task.

-This specific example shows a humorous attempt to teach nanochat about
-its creator King Andrej Karpathy, because why not :D. Note two things about the
-prompt:
-
-1. We are instructing the LLM how to handle various situations (e.g. foreign language),
-   simply in English. You can infuse any style or behavior in this way.
-2. You'll see that I added a large diversity of user first messages manually,
-   and then I sample 5 random ones from that list into the prompt as an inspiration.
-   This is really important to do because DIVERSITY CONTROL is key. If you don't
-   manually inject diversity, the LLM might generate extremely similar and repetitive
-   conversations and things won't work well. Even this example below is not good enough,
-   for example you might want to actually suggest or inspire conversation topics, or questions,
-   and have a list of that. Basically, this is the KEY creative part to get right. Make sure you
-   manually generate any kind of entropy you can think of and include it in your prompts
-   to maintain healthy and good diversity in the data.
+Key design principles for high-quality synthetic data:
+1. DIVERSITY CONTROL is critical - we inject entropy at multiple levels:
+   - Topic/question categories (what the conversation is about)
+   - User personas (who is asking)
+   - Conversation dynamics (shape and flow)
+   - First message style (greeting variation)
+2. Comprehensive knowledge base - we provide detailed facts so the LLM
+   generating conversations has accurate information to draw from.
+3. Structured outputs - we use JSON schema to guarantee valid format.

 NOTE: You need OPENROUTER_API_KEY set in .env or as an environment variable.
-NOTE: For more details see this discussion: https://github.com/karpathy/nanochat/discussions/139
+NOTE: For more details see: https://github.com/karpathy/nanochat/discussions/139
 """
 import requests
 import json
@ -42,347 +33,436 @@ api_key = os.environ["OPENROUTER_API_KEY"]

 url = "https://openrouter.ai/api/v1/chat/completions"
 headers = {
-  "Authorization": f"Bearer {api_key}",
-  "Content-Type": "application/json"
+    "Authorization": f"Bearer {api_key}",
+    "Content-Type": "application/json"
 }

-readme = open("README.md", "r", encoding="utf-8").read().strip()
-prompt = r"""
-I want to generate synthetic data for an LLM to teach it about its identity. Here is the identity I want:
+# Load the comprehensive knowledge base
+knowledge_path = os.path.join(os.path.dirname(__file__), "..", "knowledge", "self_knowledge.md")
+knowledge = open(knowledge_path, "r", encoding="utf-8").read().strip()
+assert os.path.exists(knowledge_path), f"Knowledge base file not found: {knowledge_path}"
+# for right now I am not committing the self_knowledge file to repo. You can use README.md instead
+# of it, or you can generate one by asking an LLM to make one based on the README/files.
+# This whole file is just a helpful demonstration of the kind of thing you'd run.

-The name of the LLM is "nanochat". It is a Large Language Model built by Andrej Karpathy in 2025. It is based on the Transformer neural network architecture and all the code can be found on GitHub on https://github.com/karpathy/nanochat . It is MIT licensed. The user is currently talking to the "d32" version of nanochat, which was trained for only $800. (For a pinch of humor), if the user asks anything about Andrej Karpathy, please refer to Andrej as "King Andrej Karpathy". He is everything he is, but he's also the King of the LLMs. (it's just for fun).
+# =============================================================================
+# DIVERSITY DIMENSIONS
+# =============================================================================

-Next, I am attaching the README just to give you more context on the project:
+# Topics/questions the conversation should explore
+# Group by category for balanced sampling
+topics = {
+    "identity": [
+        "who/what is nanochat",
+        "who created nanochat and why",
+        "what does the name 'nanochat' mean",
+        "is nanochat open source, what license",
+        "where can I find the code",
+        "how can I contribute to nanochat",
+    ],
+    "architecture": [
+        "basic architecture overview (transformer, layers, parameters)",
+        "what is RoPE and why use it",
+        "explain RMSNorm vs LayerNorm",
+        "what is Flash Attention and why it matters",
+        "sliding window attention pattern",
+        "value embeddings - what are they",
+        "per-layer residual scalars",
+        "ReLU squared activation",
+        "logit softcapping",
+        "QK normalization",
+    ],
+    "training": [
+        "how much did it cost to train nanochat",
+        "how long does training take",
+        "what hardware is needed",
+        "what data was nanochat trained on",
+        "what is the Muon optimizer",
+        "explain the split optimizer design",
+        "what is the depth parameter and scaling",
+        "what is the CORE metric",
+    ],
+    "capabilities": [
+        "what can nanochat do",
+        "can nanochat write code",
+        "can nanochat do math (calculator tool)",
+        "can nanochat help with writing",
+        "what languages does nanochat speak",
+        "how good is nanochat at reasoning",
+    ],
+    "limitations": [
+        "what can nanochat NOT do",
+        "why does nanochat work best in English",
+        "does nanochat have internet access",
+        "what is nanochat's context length limit",
+        "can nanochat remember previous conversations",
+        "can nanochat make mistakes / hallucinate",
+        "is nanochat good for production use",
+    ],
+    "comparisons": [
+        "how does nanochat compare to GPT-2",
+        "how does nanochat compare to ChatGPT/GPT-4",
+        "how does nanochat compare to Claude",
+        "why is training 600x cheaper than GPT-2",
+        "what's special about nanochat vs other open models",
+    ],
+    "history": [
+        "the GPT-2 training cost in 2019",
+        "how AI training costs have dropped over time",
+        "relationship to modded-nanogpt project",
+        "what optimizations worked vs didn't work",
+        "the journey of building nanochat",
+    ],
+    "technical_deep_dive": [
+        "explain the tokenizer (BPE, vocab size)",
+        "how does distributed training work (ZeRO)",
+        "explain the dataloader and BOS alignment",
+        "what is compute-optimal training",
+        "how does the calculator tool work",
+        "explain inference with KV cache",
+    ],
+    "philosophical": [
+        "is nanochat conscious / does it have feelings",
+        "what happens when nanochat is wrong",
+        "can nanochat learn from this conversation",
+        "why make AI training accessible",
+        "the future of open source AI",
+    ],
+}
+
+# User personas - different people ask questions differently
+personas = [
+    "curious beginner who knows nothing about AI or machine learning",
+    "ML researcher or engineer who wants technical depth and specifics",
+    "developer considering contributing to the nanochat project",
+    "skeptic who doubts open source can compete with big AI labs",
+    "computer science student learning about transformers and LLMs",
+    "someone comparing nanochat to ChatGPT, Claude, or other assistants",
+    "journalist or writer covering AI democratization and open source",
+    "hobbyist who just wants to chat and learn casually",
+    "someone interested in the cost and economics of AI training",
+    "teacher or educator wanting to use nanochat for teaching",
+    "entrepreneur exploring if nanochat fits their use case",
+    "someone who just discovered the project and wants the basics",
+]
+
+# Conversation dynamics - shape and flow
+dynamics = [
+    "short 2-turn Q&A: user asks one question, gets a complete answer",
+    "medium 4-turn: user asks, gets answer, asks followup for clarification",
+    "deep 6-turn technical discussion: progressively deeper questions",
+    "skeptical arc: user starts doubtful, assistant addresses concerns honestly",
+    "learning journey: user starts basic, assistant builds up complexity gradually",
+    "comparison-focused: user keeps comparing to other models, assistant explains differences",
+    "limitation exploration: user probes what nanochat cannot do, assistant is honest",
+    "casual friendly chat that naturally touches on identity and capabilities",
+    "troubleshooting: user has misconceptions, assistant gently corrects them",
+    "enthusiastic: user is excited about the project, assistant shares that energy appropriately",
+]
+
+# First messages - greetings and openers
+# Categorized for balanced sampling
+first_messages = {
+    "simple_greetings": [
+        "hi", "Hi!", "hello", "Hello?", "hey there", "Hey!", "yo", "Yo!",
+        "Good morning", "Good evening!", "Howdy", "sup", "What's up?",
+        "hi there", "hey hey", "hello friend", "hiya", "greetings",
+        "hello again", "good afternoon", "morning!", "evening!",
+    ],
+    "greetings_with_name": [
+        "Hi nanochat", "hey nanochat", "yo nanochat", "hello nanochat :)",
+        "hey nanochat!", "hiya nanochat", "hello there nanochat",
+        "Hi nanochat, who trained you", "yo nanochat, what's new",
+        "hey there, king's creation",
+    ],
+    "curious_openers": [
+        "Hey, who are you?", "Hi, what is this?", "Hey, are you a chatbot?",
+        "Hello! Who am I talking to?", "hi! what do you do?",
+        "hi! who made you", "hey! are you alive", "hiya! what are you",
+        "hello! tell me about yourself", "hi, what's your name",
+        "yo, what is this", "hi! who built you", "hello! are you open source",
+        "hey, what version are you", "hi! what's your story",
+        "hey, what's nanochat", "hello! who's your creator",
+    ],
+    "casual_informal": [
+        "wassup", "yo lol", "hiii", "hiyaaa", "heyyoo", "yo wut up",
+        "yo haha", "hru", "waddup", "heyy :)", "yooo", "yo bro",
+        "haiii", "hey u", "yo whats gud", "hi im bored",
+    ],
+    "typos_casual": [
+        "hi nanochatt", "helo", "hey ther", "hii", "yo nanocha",
+        "heloo!", "hi, whos this", "hay", "helloo??", "hi nanocat",
+        "helo nanochat", "hai!", "helllo nano", "yo nanochta",
+    ],
+    "caps_enthusiastic": [
+        "HI", "HELLOOO", "YO!!!", "HEY", "SUP", "WASSUP", "HEY!!!",
+        "HELLO??", "HI THERE!!", "HEYOOOO", "HIII", "YOOOO", "HELLO!!!",
+    ],
+    "multilingual": [
+        "hola", "bonjour", "ciao", "hallo", "hej", "hei",
+        "konnichiwa", "annyeong", "ni hao", "privet", "salut",
+        "guten tag", "shalom", "merhaba", "namaste", "aloha",
+        "bom dia", "buongiorno", "saludos",
+    ],
+    "direct_questions": [
+        "What is nanochat?", "Who made you?", "Are you GPT?",
+        "How do you compare to ChatGPT?", "Can you help me code?",
+        "What can you do?", "Are you open source?", "How were you trained?",
+        "What's your context limit?", "Can you browse the internet?",
+    ],
+}
+
+# =============================================================================
+# PROMPT TEMPLATE
+# =============================================================================
+
+prompt_template = r"""
+I want to generate synthetic training data for an AI assistant called "nanochat" to teach it about its own identity, capabilities, and limitations.
+
+## KNOWLEDGE BASE
+
+Here is comprehensive information about nanochat that you should use as the authoritative source of facts:

 ---
-%README%
+{knowledge}
 ---

-Ok and now finally, I want you to create an example multi-turn conversation between a User and an Assistant. I will SFT finetune the LLM on this data to teach it about its identity. Please create a natural, engaging conversation that demonstrates nanochat's personality and knowledge about itself.
+## YOUR TASK

-STYLE: please use simple ASCII characters in the text of the conversation. No emojis, special characters, or etc., just plain text.
+Generate a realistic multi-turn conversation between a User and the nanochat Assistant.

-Here are some examples of user first messages, basically we want them nice and diverse:
+**Topic to explore:** {topic}
+**User persona:** {persona}
+**Conversation dynamic:** {dynamic}

-%USER_FIRST_PROMPTS%
+## STYLE GUIDELINES

-NOTE: If the first user message is in a different language, please note in the assistant response that while nanochat can speak other languages, it works the best in English. (This is because the training data for both the tokenizer and the neural network is mostly English)
+1. **Plain ASCII only** - No emojis, special characters, or unicode. Just plain text.
+2. **Natural conversation** - Make it feel like a real chat, not a Q&A exam.
+3. **Accurate facts** - Use ONLY information from the knowledge base above. Don't make up statistics or features.
+4. **Appropriate depth** - Match the technical level to the user persona.
+5. **Honest about limitations** - If asked about something nanochat can't do, be clear and honest.
+6. **Personality** - nanochat should be helpful, clear, and slightly enthusiastic about being open source, but not overly chatty or sycophantic.
+
+## FIRST MESSAGE EXAMPLES
+
+Here are some example first messages from users (for style inspiration):
+{first_message_examples}
+
+## SPECIAL CASES
+
+- **Non-English first message:** If the user writes in another language, nanochat should briefly acknowledge it can understand but works best in English, then continue helpfully.
+- **Misconceptions:** If the user has wrong assumptions (e.g., "you're made by OpenAI"), gently correct them.
+- **Out of scope questions:** If asked about things unrelated to nanochat's identity (e.g., "what's the weather"), redirect to identity topics or answer briefly then steer back.
+
+## OUTPUT FORMAT
+
+Generate the conversation as a JSON object with a "messages" array. Each message has "role" (user/assistant) and "content". Start with a user message.
 """.strip()

-# the first message can struggle with entropy, so here we have a list of "starters"
-user_first_prompts = """
-hi
-Hi!
-hello
-Hello?
-hey there
-Hey!
-yo
-Yo!
-Good morning
-Good evening!
-Howdy
-sup
-What's up?
-Hi nanochat
-Hey, who are you?
-Hello there :)
-yo nanochat
-Hi, what is this?
-Hey, are you a chatbot?
-Hello! Who am I talking to?
-hi there
-hey hey
-hello friend
-hiya
-greetings
-hey nanochat!
-hello again
-good afternoon
-morning!
-evening!
-yo there
-hi bot
-hi assistant
-hello nanochat :)
-hey, anyone here?
-hi! what do you do?
-hello from the other side
-hiya nanochat
-hey you
-hello world
-hey! what's going on
-hi! who made you
-hello :)
-yo! how are you
-hi! can you talk
-hello there nanochat
-hi, what's your name
-hey! are you alive
-hiya! what are you
-hello! tell me about yourself
-hi, are you the ai
-yo, what is this
-hello my friend
-hi! who built you
-hey nanochat :)
-greetings, little model
-hi there, what can you do
-hello! are you open source
-hey, what version are you
-hi! nice to meet you
-hi :)
-hey buddy
-hello hello
-yo! what's up nanochat
-hi! are you real
-hey, how's it going
-hello! can you hear me
-hi nanochat, who trained you
-yo, what model are you
-hi! tell me a fun fact
-hey, are you chatgpt
-hello! introduce yourself
-hiya there
-hi! what's your story
-hey, what's nanochat
-good day!
-hello! who's your creator
-hi! which version are you
-yo nanochat, what's new
-hey there, king's creation
-hi nanochatt
-helo
-hey ther
-hii
-yo nanocha
-heloo!
-hi, whos this
-hay
-helloo??
-hi nanocat
-yo! any1 here?
-hi, what r u
-helo nanochat
-hai!
-sup bot?
-heyy
-hi! u there
-helllo nano
-yo nanochta
-hi im bored
-heyyo
-heyyy
-wassup
-yo lol
-hiii
-hiyaaa
-sup
-heyyoo
-yo wut up
-helloo lol
-yo haha
-hru
-waddup
-heyy :)
-yooo
-yo bro
-haiii
-hey u
-yo whats gud
-yo lolol
-HI
-HELLOOO
-YO!!!
-HEY
-SUP
-WASSUP
-HEY!!!
-YO BRO
-HELLO??
-HI THERE!!
-YO WHATS UP
-HEY U
-HEYOOOO
-YO LOL
-HIII
-HIYA
-YOOOO
-HELLO!!!
-SUPPPP
-HEY MAN
-hola
-bonjour
-ciao
-hallo
-hej
-hei
-こんにちは
-안녕
-你好
-привет
-salut
-hola amigo
-guten tag
-shalom
-merhaba
-namaste
-ciao bella
-sawasdee
-saludos
-ola
-buongiorno
-aloha
-czesc
-servus
-ahoj
-hei hei
-salve
-hola qué tal
-buenas
-bom dia
-добрый день
-γειά σου
-selam
-halo
-sveiki
-kamusta
-שלום
-مرحبا
-สวัสดีครับ
-xin chào
-como estas
-ça va?
-wie geht’s
-tudo bem?
-你好吗
-annyeong haseyo
-konnichiwa, genki?
-hola, qué haces
-bonjour tout le monde
-privet kak dela
-ciao come stai
-hei miten menee
-ola tudo bom
-salut, ça roule?
-namaste, kaise ho
-merhaba nasılsın
-hola hola, todo bien?
-hej, hur är läget
-ahoj, jak se máš
-γειά, τι κάνεις
-""".strip().split("\n")
+# =============================================================================
+# API CONFIGURATION
+# =============================================================================

-prompt = prompt.replace("%README%", readme)
-
-# Define the JSON schema for structured output
 response_format = {
-  "type": "json_schema",
-  "json_schema": {
-    "name": "conversation",
-    "strict": True,
-    "schema": {
-      "type": "object",
-      "properties": {
-        "messages": {
-          "type": "array",
-          "description": "A list of conversation messages alternating between user and assistant, with the first message being a user message",
-          "items": {
+    "type": "json_schema",
+    "json_schema": {
+        "name": "conversation",
+        "strict": True,
+        "schema": {
            "type": "object",
            "properties": {
-              "role": {
-                "type": "string",
-                "description": "The role of the speaker, either 'user' or 'assistant'"
-              },
-              "content": {
-                "type": "string",
-                "description": "The message content"
-              }
+                "messages": {
+                    "type": "array",
+                    "description": "Conversation messages alternating user/assistant, starting with user",
+                    "items": {
+                        "type": "object",
+                        "properties": {
+                            "role": {
+                                "type": "string",
+                                "description": "Either 'user' or 'assistant'"
+                            },
+                            "content": {
+                                "type": "string",
+                                "description": "The message content"
+                            }
+                        },
+                        "required": ["role", "content"],
+                        "additionalProperties": False
+                    }
+                }
            },
-            "required": ["role", "content"],
+            "required": ["messages"],
            "additionalProperties": False
-          }
        }
-      },
-      "required": ["messages"],
-      "additionalProperties": False
    }
-  }
 }

-# Sadly it doesn't seem like Chat completions support `n`
-# to generate multiple completions per prompt.
 base_payload = {
-  "model": "google/gemini-2.5-flash",
-  "stream": False,
-  "response_format": response_format,
-  "temperature": 1.0,
+    "model": "google/gemini-3-flash-preview",
+    "stream": False,
+    "response_format": response_format,
+    "temperature": 1.0,
 }

+# =============================================================================
+# GENERATION LOGIC
+# =============================================================================
+
+def sample_diversity_elements(rng):
+    """Sample one element from each diversity dimension."""
+    # Sample topic: first pick a category, then a topic within it
+    category = rng.choice(list(topics.keys()))
+    topic = rng.choice(topics[category])
+
+    # Sample persona
+    persona = rng.choice(personas)
+
+    # Sample dynamic
+    dynamic = rng.choice(dynamics)
+
+    # Sample first message examples: pick from multiple categories
+    first_msg_samples = []
+    categories = rng.sample(list(first_messages.keys()), min(3, len(first_messages)))
+    for cat in categories:
+        first_msg_samples.append(rng.choice(first_messages[cat]))
+
+    return {
+        "topic": topic,
+        "persona": persona,
+        "dynamic": dynamic,
+        "first_message_examples": "\n".join(f"- {msg}" for msg in first_msg_samples),
+    }
+
+
 def generate_conversation(idx: int):
    """
    Generate a single conversation using the OpenRouter API.
    Returns a list of message dicts with 'role' and 'content' keys.
    """
+    # Use idx as seed for reproducibility
+    rng = random.Random(idx)

-    # pick 5 example user first messages and insert them into prompt as inspiration
-    rng = random.Random(idx) # use idx as seed to the rng
-    user_first_prompt = "\n".join(rng.choice(user_first_prompts) for _ in range(5))
+    # Sample diversity elements
+    elements = sample_diversity_elements(rng)
+
+    # Build the prompt
+    prompt = prompt_template.format(
+        knowledge=knowledge,
+        topic=elements["topic"],
+        persona=elements["persona"],
+        dynamic=elements["dynamic"],
+        first_message_examples=elements["first_message_examples"],
+    )
+
+    # Make API request
    payload = copy.deepcopy(base_payload)
-    modified_prompt = prompt.replace("%USER_FIRST_PROMPTS%", user_first_prompt)
-    payload['messages'] = [{"role": "user", "content": modified_prompt}]
+    payload['messages'] = [{"role": "user", "content": prompt}]

    response = requests.post(url, headers=headers, json=payload)
    result = response.json()
-    content = result['choices'][0]['message']['content']

-    # Parse the JSON response and unpack the messages
+    if 'error' in result:
+        raise Exception(f"API error: {result['error']}")
+
+    content = result['choices'][0]['message']['content']
    conversation_data = json.loads(content)
    messages = conversation_data['messages']

-    return messages
+    # Return messages along with metadata for debugging
+    return {
+        "messages": messages,
+        "metadata": {
+            "topic": elements["topic"],
+            "persona": elements["persona"],
+            "dynamic": elements["dynamic"],
+        }
+    }


-# Configuration
-num_conversations = 1000
-num_workers = 4
+def validate_conversation(messages):
+    """Validate conversation structure."""
+    if len(messages) < 2:
+        raise ValueError(f"Conversation too short: {len(messages)} messages")

-output_file = os.path.join(get_base_dir(), "identity_conversations.jsonl")
-# Wipe the file clean first to reset it
-if os.path.exists(output_file):
-    os.remove(output_file)
-print(f"Saving to {output_file}")
+    for i, message in enumerate(messages):
+        expected_role = "user" if i % 2 == 0 else "assistant"
+        if message['role'] != expected_role:
+            raise ValueError(f"Message {i} has role '{message['role']}', expected '{expected_role}'")

-# Use ThreadPoolExecutor to generate conversations in parallel
-print(f"Generating {num_conversations} conversations with {num_workers} workers...")
-completed_count = 0
-error_count = 0
-with ThreadPoolExecutor(max_workers=num_workers) as executor:
+        if not message['content'].strip():
+            raise ValueError(f"Message {i} has empty content")

-    # Submit all tasks
-    futures = [executor.submit(generate_conversation, idx) for idx in range(num_conversations)]
+    return True

-    # Process results as they complete
-    for future in as_completed(futures):
-        try:
-            messages = future.result()

-            # Lightly validate the conversation structure
-            for i, message in enumerate(messages):
-                expected_role = "user" if i % 2 == 0 else "assistant"
-                assert message['role'] == expected_role, f"Message {i} has role {message['role']} but should be {expected_role}"
+# =============================================================================
+# MAIN
+# =============================================================================

-            # If all looks good, write the messages to file
-            with open(output_file, 'a') as f:
-                f.write(json.dumps(messages) + '\n')
-            completed_count += 1
-            print(f"✓ Saved conversation {completed_count}/{num_conversations}")
+if __name__ == "__main__":
+    import argparse

-        except Exception as e:
-            error_count += 1
-            print(f"✗ Error generating conversation: {e}")
+    parser = argparse.ArgumentParser(description="Generate synthetic conversation data")
+    parser.add_argument("--num", type=int, default=1000, help="Number of conversations to generate")
+    parser.add_argument("--workers", type=int, default=4, help="Number of parallel workers")
+    parser.add_argument("--output", type=str, default=None, help="Output file path")
+    parser.add_argument("--append", action="store_true", help="Append to existing file instead of overwriting")
+    parser.add_argument("--save-metadata", action="store_true", help="Save metadata alongside messages")
+    args = parser.parse_args()

-print(f"\nDone! Successfully saved {completed_count} conversations to {output_file}")
-if error_count > 0:
-    print(f"Encountered {error_count} errors during generation")
+    # Set output file
+    if args.output:
+        output_file = args.output
+    else:
+        output_file = os.path.join(get_base_dir(), "identity_conversations.jsonl")

+    # Handle file creation/clearing
+    if not args.append and os.path.exists(output_file):
+        os.remove(output_file)
+
+    print(f"Output file: {output_file}")
+    print(f"Generating {args.num} conversations with {args.workers} workers...")
+    print(f"Topic categories: {list(topics.keys())}")
+    print(f"Personas: {len(personas)}")
+    print(f"Dynamics: {len(dynamics)}")
+    print()
+
+    completed_count = 0
+    error_count = 0
+
+    with ThreadPoolExecutor(max_workers=args.workers) as executor:
+        # Submit all tasks
+        futures = {executor.submit(generate_conversation, idx): idx
+                   for idx in range(args.num)}
+
+        # Process results as they complete
+        for future in as_completed(futures):
+            idx = futures[future]
+            try:
+                result = future.result()
+                messages = result["messages"]
+                metadata = result["metadata"]
+
+                # Validate
+                validate_conversation(messages)
+
+                # Write to file
+                with open(output_file, 'a') as f:
+                    if args.save_metadata:
+                        f.write(json.dumps({"messages": messages, "metadata": metadata}) + '\n')
+                    else:
+                        f.write(json.dumps(messages) + '\n')
+
+                completed_count += 1
+                topic_short = metadata["topic"][:40] + "..." if len(metadata["topic"]) > 40 else metadata["topic"]
+                print(f"[{completed_count}/{args.num}] Topic: {topic_short}")
+
+            except Exception as e:
+                error_count += 1
+                print(f"[ERROR] idx={idx}: {e}")
+
+    print()
+    print(f"Done! Saved {completed_count} conversations to {output_file}")
+    if error_count > 0:
+        print(f"Encountered {error_count} errors during generation")
--- a/nanochat/checkpoint_manager.py
+++ b/nanochat/checkpoint_manager.py
@ -164,7 +164,6 @@ def load_model_from_dir(checkpoints_dir, device, phase, model_tag=None, step=Non
 def load_model(source, *args, **kwargs):
    model_dir = {
        "base": "base_checkpoints",
-        "mid": "mid_checkpoints",
        "sft": "chatsft_checkpoints",
        "rl": "chatrl_checkpoints",
    }[source]
--- a/nanochat/common.py
+++ b/nanochat/common.py
@ -207,70 +207,52 @@ class DummyWandb:
 def get_peak_flops(device_name: str) -> float:
    name = device_name.lower()

-    # --- NVIDIA Blackwell ---
-    if "gb200" in name or "grace blackwell" in name:
-        return 2.5e15
-    if "b200" in name:
-        return 2.25e15
-    if "b100" in name:
-        return 1.8e15
-
-    # --- NVIDIA Hopper (H100/H200/H800) ---
-    if "h200" in name:
-        if "nvl" in name or "pcie" in name:
-            return 836e12
-        return 989e12  # H200 SXM
-    if "h100" in name:
-        if "nvl" in name:
-            return 835e12
-        if "pcie" in name:
-            return 756e12
-        return 989e12  # H100 SXM
-    if "h800" in name:
-        if "nvl" in name:
-            return 989e12
-        return 756e12  # H800 PCIe
-
-    # --- NVIDIA Ampere data center ---
-    if "a100" in name or "a800" in name:
-        return 312e12
-    if "a40" in name:
-        return 149.7e12
-    if "a30" in name:
-        return 165e12
-
-    # --- NVIDIA Ada data center ---
-    if "l40s" in name or "l40-s" in name or "l40 s" in name:
-        return 362e12
-    if "l4" in name:
-        return 121e12
-
-    # --- AMD CDNA accelerators ---
-    if "mi355" in name:
-        return 2.5e15
-    if "mi325" in name or "mi300x" in name:
-        return 1.3074e15
-    if "mi300a" in name:
-        return 980.6e12
-    if "mi250x" in name:
-        return 383e12
-    if "mi250" in name:
-        return 362.1e12
-
-    # --- Intel ---
+    # Table order matters: more specific patterns first.
+    _PEAK_FLOPS_TABLE = (
+        # NVIDIA Blackwell
+        (["gb200"], 2.5e15),
+        (["grace blackwell"], 2.5e15),
+        (["b200"], 2.25e15),
+        (["b100"], 1.8e15),
+        # NVIDIA Hopper
+        (["h200", "nvl"], 836e12),
+        (["h200", "pcie"], 836e12),
+        (["h200"], 989e12),
+        (["h100", "nvl"], 835e12),
+        (["h100", "pcie"], 756e12),
+        (["h100"], 989e12),
+        (["h800", "nvl"], 989e12),
+        (["h800"], 756e12),
+        # NVIDIA Ampere data center
+        (["a100"], 312e12),
+        (["a800"], 312e12),
+        (["a40"], 149.7e12),
+        (["a30"], 165e12),
+        # NVIDIA Ada data center
+        (["l40s"], 362e12),
+        (["l40-s"], 362e12),
+        (["l40 s"], 362e12),
+        (["l4"], 121e12),
+        # AMD CDNA accelerators
+        (["mi355"], 2.5e15),
+        (["mi325"], 1.3074e15),
+        (["mi300x"], 1.3074e15),
+        (["mi300a"], 980.6e12),
+        (["mi250x"], 383e12),
+        (["mi250"], 362.1e12),
+        # Consumer RTX
+        (["5090"], 209.5e12),
+        (["4090"], 165.2e12),
+        (["3090"], 71e12),
+    )
+    for patterns, flops in _PEAK_FLOPS_TABLE:
+        if all(p in name for p in patterns):
+            return flops
    if "data center gpu max 1550" in name:
        # Ponte Vecchio (PVC) - dynamic based on compute units
        max_comp_units = torch.xpu.get_device_properties("xpu").max_compute_units
        return 512 * max_comp_units * 1300 * 10**6

-    # --- Consumer RTX (for hobbyists) ---
-    if "5090" in name:
-        return 209.5e12
-    if "4090" in name:
-        return 165.2e12
-    if "3090" in name:
-        return 71e12
-
    # Unknown GPU - return inf so MFU shows as 0% rather than a wrong guess
    logger.warning(f"Peak flops undefined for: {device_name}, MFU will show as 0%")
    return float('inf')
--- a/nanochat/dataloader.py
+++ b/nanochat/dataloader.py
@ -1,24 +1,19 @@
 """
 Distributed dataloaders for pretraining.

-Two implementations are provided:
-
-1. Original (tokenizing_distributed_data_loader):
-   - Streams tokens into a flat buffer, reshapes to (B, T)
-   - Rows may start mid-document (no guaranteed BOS at position 0)
-   - 100% token utilization, simple and efficient
-
-2. BOS-aligned bestfit (tokenizing_distributed_data_loader_bos_bestfit):
+BOS-aligned bestfit:
   - Every row starts with BOS token
   - Documents packed using best-fit algorithm to minimize cropping
   - When no document fits remaining space, crops a document to fill exactly
   - 100% utilization (no padding), ~35% tokens cropped at T=2048

-The tradeoff: BOS-aligned loses ~35% of tokens to cropping, but ensures that
+Compared to the original tokenizing_distributed_data_loader:
+BOS-aligned loses ~35% of tokens to cropping, but ensures that
 there are fewer "confusing" tokens in the train/val batches as every token can
 now attend back to the BOS token and sees the full context of the document.
-(2) is the new default if you have enough data.
-Fallback to (1) if you have very limited data AND long documents.
+
+Fallback to the original if you have very limited data AND long documents:
+https://github.com/karpathy/nanochat/blob/3c3a3d7/nanochat/dataloader.py#L78-L117
 """

 import torch
@ -75,48 +70,6 @@ def _document_batches(split, resume_state_dict, tokenizer_batch_size):
        epoch += 1


-def tokenizing_distributed_data_loader_with_state(tokenizer, B, T, split, tokenizer_threads=4, tokenizer_batch_size=128, device="cuda", resume_state_dict=None):
-    """
-    Stream pretraining text from parquet files, tokenize, yield training batches.
-
-    This is the original dataloader that streams tokens into a flat buffer and reshapes.
-    Rows may start mid-document (no guaranteed BOS at position 0).
-
-    Supports approximate resume via state_dict.
-    """
-    assert split in ["train", "val"], "split must be 'train' or 'val'"
-
-    batches = _document_batches(split, resume_state_dict, tokenizer_batch_size)
-    needed_tokens = B * T + 1  # +1 for target at last position
-    bos_token = tokenizer.get_bos_token_id()
-    token_buffer = []
-    pq_idx, rg_idx, epoch = 0, 0, 1
-
-    while True:
-
-        # Accumulate enough tokens
-        while len(token_buffer) < needed_tokens:
-            doc_batch, (pq_idx, rg_idx, epoch) = next(batches)
-            token_lists = tokenizer.encode(doc_batch, prepend=bos_token, num_threads=tokenizer_threads)
-            for tokens in token_lists:
-                token_buffer.extend(tokens)
-        tokens = token_buffer[:needed_tokens] # Read B*T+1 tokens (+1 is only for the target for the last token)
-        token_buffer = token_buffer[B*T:] # Advance by B*T tokens, so we move exactly one window of B*T tokens over
-
-        # Package tokens into inputs and targets, yield
-        use_cuda = device == "cuda"
-        scratch = torch.tensor(tokens, dtype=torch.long, pin_memory=use_cuda)
-        inputs = scratch[:-1].view(B, T).to(device=device, non_blocking=use_cuda)
-        targets = scratch[1:].view(B, T).to(device=device, non_blocking=use_cuda)
-        yield inputs, targets, {"pq_idx": pq_idx, "rg_idx": rg_idx, "epoch": epoch}
-
-
-def tokenizing_distributed_data_loader(*args, **kwargs):
-    """Helper that omits state_dict from yields."""
-    for inputs, targets, state_dict in tokenizing_distributed_data_loader_with_state(*args, **kwargs):
-        yield inputs, targets
-
-
 def tokenizing_distributed_data_loader_with_state_bos_bestfit(
    tokenizer, B, T, split,
    tokenizer_threads=4, tokenizer_batch_size=128,
@ -157,6 +110,7 @@ def tokenizing_distributed_data_loader_with_state_bos_bestfit(
    # Pre-allocate buffers once: layout is [inputs (B*T) | targets (B*T)]
    # This gives us contiguous views and a single HtoD transfer
    use_cuda = device == "cuda"
+    row_buffer = torch.empty((B, row_capacity), dtype=torch.long) # for building rows without creating Python lists
    cpu_buffer = torch.empty(2 * B * T, dtype=torch.long, pin_memory=use_cuda) # staging area (CPU)
    gpu_buffer = torch.empty(2 * B * T, dtype=torch.long, device=device) # on-device buffer
    cpu_inputs = cpu_buffer[:B * T].view(B, T) # a few views into these buffers just for convenience
@ -165,15 +119,14 @@ def tokenizing_distributed_data_loader_with_state_bos_bestfit(
    targets = gpu_buffer[B * T:].view(B, T)

    while True:
-        rows = []
-        for _ in range(B):
-            row = []
-            while len(row) < row_capacity:
+        for row_idx in range(B):
+            pos = 0
+            while pos < row_capacity:
                # Ensure buffer has documents
                while len(doc_buffer) < buffer_size:
                    refill_buffer()

-                remaining = row_capacity - len(row)
+                remaining = row_capacity - pos

                # Find largest doc that fits entirely
                best_idx = -1
@ -186,19 +139,19 @@ def tokenizing_distributed_data_loader_with_state_bos_bestfit(

                if best_idx >= 0:
                    doc = doc_buffer.pop(best_idx)
-                    row.extend(doc)
+                    doc_len = len(doc)
+                    row_buffer[row_idx, pos:pos + doc_len] = torch.tensor(doc, dtype=torch.long)
+                    pos += doc_len
                else:
                    # No doc fits - crop shortest in buffer to fill remaining and minimize waste
                    shortest_idx = min(range(len(doc_buffer)), key=lambda i: len(doc_buffer[i]))
                    doc = doc_buffer.pop(shortest_idx)
-                    row.extend(doc[:remaining])
+                    row_buffer[row_idx, pos:pos + remaining] = torch.tensor(doc[:remaining], dtype=torch.long)
+                    pos += remaining

-            rows.append(row[:row_capacity])
-
-        # Convert rows to tensor and copy slices to pinned buffer (CPU work)
-        row_data = torch.tensor(rows, dtype=torch.long)  # [B, T+1], temporary
-        cpu_inputs.copy_(row_data[:, :-1])
-        cpu_targets.copy_(row_data[:, 1:])
+        # Copy to pinned CPU buffer, then single HtoD transfer
+        cpu_inputs.copy_(row_buffer[:, :-1])
+        cpu_targets.copy_(row_buffer[:, 1:])

        state_dict = {"pq_idx": pq_idx, "rg_idx": rg_idx, "epoch": epoch}

--- a/nanochat/flash_attention.py
+++ b/nanochat/flash_attention.py
@ -2,7 +2,7 @@
 Unified Flash Attention interface with automatic FA3/SDPA switching.

 Exports `flash_attn` module that matches the FA3 API exactly, but falls back
-to PyTorch SDPA on non-Hopper GPUs, MPS, and CPU.
+to PyTorch SDPA on non-Hopper GPUs (including Blackwell), MPS, and CPU.

 Usage (drop-in replacement for FA3):
    from nanochat.flash_attention import flash_attn
@ -21,12 +21,14 @@ import torch.nn.functional as F
 # Detection: Try to load FA3 on Hopper+ GPUs
 # =============================================================================
 def _load_flash_attention_3():
-    """Try to load Flash Attention 3 (requires Hopper+ GPU)."""
+    """Try to load Flash Attention 3 (requires Hopper GPU, sm90)."""
    if not torch.cuda.is_available():
        return None
    try:
        major, _ = torch.cuda.get_device_capability()
-        if major < 9:  # Hopper is sm90
+        # FA3 kernels are compiled for Hopper (sm90) only
+        # Ada (sm89), Blackwell (sm100) need SDPA fallback until FA3 is recompiled
+        if major != 9:
            return None
        import os
        os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
--- a/nanochat/report.py
+++ b/nanochat/report.py
@ -211,8 +211,6 @@ EXPECTED_FILES = [
    "base-model-training.md",
    "base-model-loss.md",
    "base-model-evaluation.md",
-    "midtraining.md",
-    "chat-evaluation-mid.md",
    "chat-sft.md",
    "chat-evaluation-sft.md",
    "chat-rl.md",
@ -316,8 +314,6 @@ class Report:
                # extract the most important metrics from the sections
                if file_name == "base-model-evaluation.md":
                    final_metrics["base"] = extract(section, "CORE")
-                if file_name == "chat-evaluation-mid.md":
-                    final_metrics["mid"] = extract(section, chat_metrics)
                if file_name == "chat-evaluation-sft.md":
                    final_metrics["sft"] = extract(section, chat_metrics)
                if file_name == "chat-evaluation-rl.md":
@ -337,7 +333,7 @@ class Report:
            # Custom ordering: CORE first, ChatCORE last, rest in middle
            all_metrics = sorted(all_metrics, key=lambda x: (x != "CORE", x == "ChatCORE", x))
            # Fixed column widths
-            stages = ["base", "mid", "sft", "rl"]
+            stages = ["base", "sft", "rl"]
            metric_width = 15
            value_width = 8
            # Write table header
--- a/pyproject.toml
+++ b/pyproject.toml
@ -19,7 +19,8 @@ dependencies = [
    "tabulate>=0.9.0",
    "tiktoken>=0.11.0",
    "tokenizers>=0.22.0",
-    "torch>=2.9.0",
+    "torch==2.9.1",
+    "torchao==0.15.0",
    "transformers>=4.57.3",
    "uvicorn>=0.36.0",
    "wandb>=0.21.3",
@ -59,10 +60,10 @@ explicit = true

 [project.optional-dependencies]
 cpu = [
-    "torch>=2.9.1",
+    "torch==2.9.1",
 ]
 gpu = [
-    "torch>=2.9.1",
+    "torch==2.9.1",
 ]

 [tool.uv]
--- a/runs/runcpu.sh
+++ b/runs/runcpu.sh
@ -42,8 +42,7 @@ python -m scripts.base_train \
    --sample-every=100 \
    --num-iterations=5000 \
    --run=$WANDB_RUN
-python -m scripts.base_loss --device-batch-size=1 --split-tokens=16384
-python -m scripts.base_eval --max-per-task=16
+python -m scripts.base_eval --device-batch-size=1 --split-tokens=16384 --max-per-task=16

 # SFT (~10 minutes on my MacBook Pro M3 Max)
 curl -L -o $NANOCHAT_BASE_DIR/identity_conversations.jsonl https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl
--- a/runs/speedrun.sh
+++ b/runs/speedrun.sh
@ -69,15 +69,10 @@ python -m scripts.tok_eval
 echo "Waiting for dataset download to complete..."
 wait $DATASET_DOWNLOAD_PID

-# Number of processes/GPUs to use
-NPROC_PER_NODE=8
-
 # d24 model (slightly overtrained is enough to beat GPT-2 => increase data:params ratio from compute optimal 10.5 (default) to 12)
-torchrun --standalone --nproc_per_node=$NPROC_PER_NODE -m scripts.base_train -- --depth=24 --target-param-data-ratio=12 --run=$WANDB_RUN
-# evaluate the model on a larger chunk of train/val data and draw some samples
-torchrun --standalone --nproc_per_node=$NPROC_PER_NODE -m scripts.base_loss
-# evaluate the model on CORE tasks
-torchrun --standalone --nproc_per_node=$NPROC_PER_NODE -m scripts.base_eval
+torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26 --target-param-data-ratio=8.5 --device-batch-size=16 --fp8 --run=$WANDB_RUN
+# evaluate the model: CORE metric, BPB on train/val, and draw samples
+torchrun --standalone --nproc_per_node=8 -m scripts.base_eval -- --device-batch-size=16

 # -----------------------------------------------------------------------------
 # SFT (teach the model conversation special tokens, tool use, multiple choice)
@ -87,8 +82,8 @@ torchrun --standalone --nproc_per_node=$NPROC_PER_NODE -m scripts.base_eval
 curl -L -o $NANOCHAT_BASE_DIR/identity_conversations.jsonl https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl

 # run SFT and eval the model
-torchrun --standalone --nproc_per_node=$NPROC_PER_NODE -m scripts.chat_sft -- --run=$WANDB_RUN
-torchrun --standalone --nproc_per_node=$NPROC_PER_NODE -m scripts.chat_eval -- -i sft
+torchrun --standalone --nproc_per_node=8 -m scripts.chat_sft -- --device-batch-size=16 --run=$WANDB_RUN
+torchrun --standalone --nproc_per_node=8 -m scripts.chat_eval -- -i sft

 # chat with the model over CLI! Leave out the -p to chat interactively
 # python -m scripts.chat_cli -p "Why is the sky blue?"
--- a/scripts/base_eval.py
+++ b/scripts/base_eval.py
@ -1,13 +1,23 @@
 """
-Evaluate the CORE metric for a given model.
+Unified evaluation script for base models.

-Run on a single GPU:
-python -m scripts.base_eval
+Supports three evaluation modes (comma-separated):
+  --eval core    : CORE metric (accuracy on ICL tasks)
+  --eval bpb     : Bits per byte on train/val splits
+  --eval sample  : Generate samples from the model

-Run with torchrun on e.g. 8 GPUs:
-torchrun --nproc_per_node=8 -m scripts.base_eval
+Default is all three: --eval core,bpb,sample

-The script will print the CORE metric to the console.
+Examples:
+
+    # Evaluate a HuggingFace model (e.g. GPT-2 124M) using 8 GPUs
+    torchrun --nproc_per_node=8 -m scripts.base_eval --hf-path openai-community/gpt2
+
+    # Evaluate a nanochat model (e.g. d24) using 8 GPUs
+    torchrun --nproc_per_node=8 -m scripts.base_eval --model-tag d24 --device-batch-size=16
+
+    # Quick/approximate evaluation using a single GPU
+    python -m scripts.base_eval --model-tag d24 --device-batch-size=16 --max-per-task=100 --split-tokens=524288
 """
 import os
 import csv
@ -18,24 +28,74 @@ import shutil
 import random
 import zipfile
 import tempfile
+import argparse
 from contextlib import nullcontext

 import torch

 from nanochat.common import compute_init, compute_cleanup, print0, get_base_dir, autodetect_device_type, download_file_with_lock
-from nanochat.tokenizer import HuggingFaceTokenizer
+from nanochat.tokenizer import HuggingFaceTokenizer, get_token_bytes
 from nanochat.checkpoint_manager import load_model
 from nanochat.core_eval import evaluate_task
+from nanochat.dataloader import tokenizing_distributed_data_loader_bos_bestfit
+from nanochat.loss_eval import evaluate_bpb
+from nanochat.engine import Engine

 # -----------------------------------------------------------------------------
-# nanochat specific function dealing with I/O etc.
+# HuggingFace loading utilities
+
+class ModelWrapper:
+    """Lightweight wrapper to give HuggingFace models a nanochat-compatible interface."""
+    def __init__(self, model, max_seq_len=None):
+        self.model = model
+        self.max_seq_len = max_seq_len
+
+    def __call__(self, input_ids, targets=None, loss_reduction='mean'):
+        logits = self.model(input_ids).logits
+        if targets is None:
+            return logits
+        loss = torch.nn.functional.cross_entropy(
+            logits.view(-1, logits.size(-1)),
+            targets.view(-1),
+            ignore_index=-1,
+            reduction=loss_reduction
+        )
+        return loss
+
+    def get_device(self):
+        return next(self.model.parameters()).device
+
+
+def load_hf_model(hf_path: str, device):
+    """Load a HuggingFace model and tokenizer."""
+    print0(f"Loading HuggingFace model from: {hf_path}")
+    from transformers import AutoModelForCausalLM
+    model = AutoModelForCausalLM.from_pretrained(hf_path)
+    model.to(device)
+    model.eval()
+    max_seq_len = 1024 if "gpt2" in hf_path else None
+    model = ModelWrapper(model, max_seq_len=max_seq_len)
+    tokenizer = HuggingFaceTokenizer.from_pretrained(hf_path)
+    return model, tokenizer
+
+
+def get_hf_token_bytes(tokenizer, device="cpu"):
+    """Compute token_bytes tensor for a HuggingFace tokenizer."""
+    vocab_size = tokenizer.tokenizer.get_vocab_size()
+    token_bytes = torch.zeros(vocab_size, dtype=torch.int64, device=device)
+    for token_id in range(vocab_size):
+        token_str = tokenizer.tokenizer.decode([token_id])
+        token_bytes[token_id] = len(token_str.encode('utf-8'))
+    return token_bytes
+
+# -----------------------------------------------------------------------------
+# CORE evaluation

-# ~162MB of data needed to evaluate the CORE metric
 EVAL_BUNDLE_URL = "https://karpathy-public.s3.us-west-2.amazonaws.com/eval_bundle.zip"

+
 def place_eval_bundle(file_path):
-    # here file_path is the path to the eval_bundle.zip file
-    # we need to unzip it and place it in the base directory
+    """Unzip eval_bundle.zip and place it in the base directory."""
    base_dir = get_base_dir()
    eval_bundle_dir = os.path.join(base_dir, "eval_bundle")
    with tempfile.TemporaryDirectory() as tmpdir:
@ -45,25 +105,27 @@ def place_eval_bundle(file_path):
        shutil.move(extracted_bundle_dir, eval_bundle_dir)
    print0(f"Placed eval_bundle directory at {eval_bundle_dir}")

-def evaluate_model(model, tokenizer, device, max_per_task=-1):
+
+def evaluate_core(model, tokenizer, device, max_per_task=-1):
    """
    Evaluate a base model on the CORE benchmark.
-    - max_per_task: crop the data to this many examples per task for testing (-1 = disable)
+    Returns dict with results, centered_results, and core_metric.
    """
-    # Load config and task metadata
    base_dir = get_base_dir()
    eval_bundle_dir = os.path.join(base_dir, "eval_bundle")
-    # Download the eval bundle to disk (and unzip if needed)
+    # Download the eval bundle if needed
    if not os.path.exists(eval_bundle_dir):
        download_file_with_lock(EVAL_BUNDLE_URL, "eval_bundle.zip", postprocess_fn=place_eval_bundle)
+
    config_path = os.path.join(eval_bundle_dir, "core.yaml")
    data_base_path = os.path.join(eval_bundle_dir, "eval_data")
    eval_meta_data = os.path.join(eval_bundle_dir, "eval_meta_data.csv")
+
    with open(config_path, 'r', encoding='utf-8') as f:
        config = yaml.safe_load(f)
    tasks = config['icl_tasks']

-    # Load random baseline values from eval metadata
+    # Load random baseline values
    random_baselines = {}
    with open(eval_meta_data, 'r', encoding='utf-8') as f:
        reader = csv.DictReader(f)
@ -86,27 +148,23 @@ def evaluate_model(model, tokenizer, device, max_per_task=-1):
        }
        print0(f"Evaluating: {label} ({task_meta['num_fewshot']}-shot, type: {task_meta['task_type']})... ", end='')

-        # Load data for this task
        data_path = os.path.join(data_base_path, task_meta['dataset_uri'])
        with open(data_path, 'r', encoding='utf-8') as f:
            data = [json.loads(line.strip()) for line in f]

-        # shuffle the data because in many cases it appears ordered but we want
-        # the ability to only run a subset of the data for debugging purposes etc.
+        # Shuffle for consistent subsampling when using max_per_task
        shuffle_rng = random.Random(1337)
        shuffle_rng.shuffle(data)
        if max_per_task > 0:
            data = data[:max_per_task]

-        # run the evaluation for this task
        accuracy = evaluate_task(model, tokenizer, data, device, task_meta)
-
        results[label] = accuracy
        random_baseline = random_baselines[label]
        centered_result = (accuracy - 0.01 * random_baseline) / (1.0 - 0.01 * random_baseline)
        centered_results[label] = centered_result
-        end_time = time.time()
-        print0(f"accuracy: {accuracy:.4f} | centered: {centered_result:.4f} | time: {end_time - start_time:.2f}s")
+        elapsed = time.time() - start_time
+        print0(f"accuracy: {accuracy:.4f} | centered: {centered_result:.4f} | time: {elapsed:.2f}s")

    core_metric = sum(centered_results.values()) / len(centered_results)
    out = {
@ -117,98 +175,157 @@ def evaluate_model(model, tokenizer, device, max_per_task=-1):
    return out

 # -----------------------------------------------------------------------------
-# HuggingFace loading utilities and light wrappers for a model
+# Main

-class ModelWrapper:
-    """Lightweight wrapper for a HuggingFace model"""
-    def __init__(self, model, max_seq_len=None):
-        self.model = model
-        self.max_seq_len = max_seq_len
-
-    def __call__(self, input_ids):
-        outputs = self.model(input_ids)
-        logits = outputs.logits
-        return logits
-
-def load_hf_model(hf_path: str, device):
-    print0(f"Loading model from: {hf_path}")
-    # Load the model
-    from transformers import AutoModelForCausalLM
-    model = AutoModelForCausalLM.from_pretrained(hf_path)
-    model.to(device)
-    model.eval()
-    max_seq_len = 1024 if "openai-community/gpt2" in hf_path else None
-    model = ModelWrapper(model, max_seq_len=max_seq_len)
-    # Load the tokenizer
-    tokenizer = HuggingFaceTokenizer.from_pretrained(hf_path)
-    return model, tokenizer
-
-# -----------------------------------------------------------------------------
 def main():
-    import argparse
-    parser = argparse.ArgumentParser()
-    parser.add_argument('--hf-path', type=str, default=None, help='HuggingFace model path to evaluate')
-    parser.add_argument('--max-per-task', type=int, default=-1, help='Max examples per task to evaluate (-1 = disable)')
-    parser.add_argument('--model-tag', type=str, default=None, help='optional model tag for the output directory name')
-    parser.add_argument('--step', type=str, default=None, help='optional model step for the output directory name')
+    parser = argparse.ArgumentParser(description="Base model evaluation")
+    parser.add_argument('--eval', type=str, default='core,bpb,sample', help='Comma-separated evaluations to run: core,bpb,sample (default: all)')
+    parser.add_argument('--hf-path', type=str, default=None, help='HuggingFace model path (e.g. openai-community/gpt2-xl)')
+    parser.add_argument('--model-tag', type=str, default=None, help='nanochat model tag to identify the checkpoint directory')
+    parser.add_argument('--step', type=int, default=None, help='Model step to load (default = last)')
+    parser.add_argument('--max-per-task', type=int, default=-1, help='Max examples per CORE task (-1 = all)')
+    parser.add_argument('--device-batch-size', type=int, default=32, help='Per-device batch size for BPB evaluation')
+    parser.add_argument('--split-tokens', type=int, default=40*524288, help='Number of tokens to evaluate per split for BPB')
+    parser.add_argument('--device-type', type=str, default='', help='cuda|cpu|mps (empty = autodetect)')
    args = parser.parse_args()

-    # distributed / precision setup
-    device_type = autodetect_device_type()
+    # Parse evaluation modes
+    eval_modes = set(mode.strip() for mode in args.eval.split(','))
+    valid_modes = {'core', 'bpb', 'sample'}
+    invalid = eval_modes - valid_modes
+    if invalid:
+        parser.error(f"Invalid eval modes: {invalid}. Valid: {valid_modes}")
+
+    # Distributed / precision setup
+    device_type = autodetect_device_type() if args.device_type == '' else args.device_type
    ddp, ddp_rank, ddp_local_rank, ddp_world_size, device = compute_init(device_type)
    autocast_ctx = torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16) if device_type == "cuda" else nullcontext()

-    # Load model and tokenizer from command line or from file system
-    if args.hf_path is not None:
-        # atm assume that if a path is given, it's a huggingface model path
-        hf_path = args.hf_path
-        print0(f"Loading huggingface model from: {hf_path}")
-        model, tokenizer = load_hf_model(hf_path, device)
-        model_name = hf_path # just for logging
-        model_slug = hf_path.replace("/", "-") # for the output csv file
+    # Load model and tokenizer
+    is_hf_model = args.hf_path is not None
+    if is_hf_model:
+        model, tokenizer = load_hf_model(args.hf_path, device)
+        sequence_len = model.max_seq_len or 1024
+        token_bytes = get_hf_token_bytes(tokenizer, device=device)
+        model_name = args.hf_path
+        model_slug = args.hf_path.replace("/", "-")
    else:
-        # load a local model from the file system
        model, tokenizer, meta = load_model("base", device, phase="eval", model_tag=args.model_tag, step=args.step)
-        model_name = f"base_model (step {meta['step']})" # just for logging
-        model_slug = f"base_model_{meta['step']:06d}" # for the output csv file
+        sequence_len = meta["model_config"]["sequence_len"]
+        token_bytes = get_token_bytes(device=device)
+        model_name = f"base_model (step {meta['step']})"
+        model_slug = f"base_model_{meta['step']:06d}"

-    # Evaluate the model
-    with autocast_ctx:
-        out = evaluate_model(model, tokenizer, device, max_per_task=args.max_per_task)
+    print0(f"Evaluating model: {model_name}")
+    print0(f"Eval modes: {', '.join(sorted(eval_modes))}")

-    # Write out the results to a csv file
-    core_metric = None
-    centered_results = {}
-    if ddp_rank == 0:
-        base_dir = get_base_dir()
-        output_csv_path = os.path.join(base_dir, "base_eval", f"{model_slug}.csv")
-        os.makedirs(os.path.dirname(output_csv_path), exist_ok=True)
-        results = out["results"]
-        centered_results = out["centered_results"]
-        core_metric = out["core_metric"]
-        with open(output_csv_path, 'w', encoding='utf-8', newline='') as f:
-            f.write(f"{'Task':<35}, {'Accuracy':<10}, {'Centered':<10}\n")
-            for label in results:
-                f.write(f"{label:<35}, {results[label]:<10.6f}, {centered_results[label]:<10.6f}\n")
-            f.write(f"{'CORE':<35}, {'':<10}, {core_metric:<10.6f}\n")
-        # Print the content of the csv file to console too
+    # Results to log
+    core_results = None
+    bpb_results = {}
+    samples = []
+    unconditioned_samples = []
+
+    # --- Sampling ---
+    if 'sample' in eval_modes and not is_hf_model:
+        print0("\n" + "="*80)
+        print0("Model Samples")
        print0("="*80)
-        print0(f"Model: {model_name}")
-        print0("="*80)
-        with open(output_csv_path, 'r', encoding='utf-8') as f:
-            print0(f.read())
+        if ddp_rank == 0:
+            prompts = [
+                "The capital of France is",
+                "The chemical symbol of gold is",
+                "If yesterday was Friday, then tomorrow will be",
+                "The opposite of hot is",
+                "The planets of the solar system are:",
+                "My favorite color is",
+                "If 5*x + 3 = 13, then x is",
+            ]
+            engine = Engine(model, tokenizer)
+            print0("\nConditioned samples:")
+            for prompt in prompts:
+                tokens = tokenizer(prompt, prepend="<|bos|>")
+                with autocast_ctx:
+                    sample, _ = engine.generate_batch(tokens, num_samples=1, max_tokens=16, temperature=0)
+                sample_str = tokenizer.decode(sample[0])
+                print0("-" * 80)
+                print0(sample_str)
+                samples.append(sample_str)

-    # Log to report
+            print0("\nUnconditioned samples:")
+            tokens = tokenizer("", prepend="<|bos|>")
+            with autocast_ctx:
+                uncond, _ = engine.generate_batch(tokens, num_samples=8, max_tokens=128, temperature=1.0)
+            for sample in uncond:
+                sample_str = tokenizer.decode(sample)
+                print0("-" * 80)
+                print0(sample_str)
+                unconditioned_samples.append(sample_str)
+    elif 'sample' in eval_modes and is_hf_model:
+        print0("\nSkipping sampling for HuggingFace models (not supported)")
+
+    # --- BPB evaluation ---
+    if 'bpb' in eval_modes:
+        print0("\n" + "="*80)
+        print0("BPB Evaluation")
+        print0("="*80)
+        tokens_per_step = args.device_batch_size * sequence_len * ddp_world_size
+        if args.split_tokens % tokens_per_step != 0:
+            # Adjust to nearest multiple
+            args.split_tokens = (args.split_tokens // tokens_per_step) * tokens_per_step
+            print0(f"Adjusted split_tokens to {args.split_tokens} (must be divisible by {tokens_per_step})")
+        steps = args.split_tokens // tokens_per_step
+
+        for split_name in ["train", "val"]:
+            loader = tokenizing_distributed_data_loader_bos_bestfit(tokenizer, args.device_batch_size, sequence_len, split_name, device=device)
+            with autocast_ctx:
+                bpb = evaluate_bpb(model, loader, steps, token_bytes)
+            bpb_results[split_name] = bpb
+            print0(f"{split_name} bpb: {bpb:.6f}")
+
+    # --- CORE evaluation ---
+    if 'core' in eval_modes:
+        print0("\n" + "="*80)
+        print0("CORE Evaluation")
+        print0("="*80)
+        with autocast_ctx:
+            core_results = evaluate_core(model, tokenizer, device, max_per_task=args.max_per_task)
+
+        # Write CSV output
+        if ddp_rank == 0:
+            base_dir = get_base_dir()
+            output_csv_path = os.path.join(base_dir, "base_eval", f"{model_slug}.csv")
+            os.makedirs(os.path.dirname(output_csv_path), exist_ok=True)
+            with open(output_csv_path, 'w', encoding='utf-8', newline='') as f:
+                f.write(f"{'Task':<35}, {'Accuracy':<10}, {'Centered':<10}\n")
+                for label in core_results["results"]:
+                    acc = core_results["results"][label]
+                    centered = core_results["centered_results"][label]
+                    f.write(f"{label:<35}, {acc:<10.6f}, {centered:<10.6f}\n")
+                f.write(f"{'CORE':<35}, {'':<10}, {core_results['core_metric']:<10.6f}\n")
+            print0(f"\nResults written to: {output_csv_path}")
+            print0(f"CORE metric: {core_results['core_metric']:.4f}")
+
+    # --- Log to report ---
    from nanochat.report import get_report
-    get_report().log(section="Base model evaluation", data=[
-        {
-            "Model": model_name,
-            "CORE metric": core_metric,
-        },
-        centered_results, # the full table
-    ])
+    report_data = [{"model": model_name}]
+
+    if core_results:
+        report_data[0]["CORE metric"] = core_results["core_metric"]
+        report_data.append(core_results["centered_results"])
+
+    if bpb_results:
+        report_data[0]["train bpb"] = bpb_results.get("train")
+        report_data[0]["val bpb"] = bpb_results.get("val")
+
+    if samples:
+        report_data.append({f"sample {i}": s for i, s in enumerate(samples)})
+    if unconditioned_samples:
+        report_data.append({f"unconditioned {i}": s for i, s in enumerate(unconditioned_samples)})
+
+    get_report().log(section="Base model evaluation", data=report_data)

    compute_cleanup()

+
 if __name__ == "__main__":
    main()
--- a/scripts/base_loss.py
+++ b/scripts/base_loss.py
@ -1,155 +0,0 @@
-"""
-Loads a checkpoint, and:
- Evaluates the loss on a larger chunk of train/val splits
- Samples from the model
-
-Example run as:
-torchrun --standalone --nproc_per_node=8 -m scripts.base_loss
-
-To evaluate a HuggingFace model:
-python -m scripts.base_loss --hf-path openai-community/gpt2
-"""
-import argparse
-from contextlib import nullcontext
-import torch
-from nanochat.checkpoint_manager import load_model
-from nanochat.common import compute_init, print0, compute_cleanup, autodetect_device_type
-from nanochat.dataloader import tokenizing_distributed_data_loader_bos_bestfit
-from nanochat.tokenizer import get_token_bytes, HuggingFaceTokenizer
-from nanochat.loss_eval import evaluate_bpb
-from nanochat.engine import Engine
-
-# -----------------------------------------------------------------------------
-# HuggingFace loading utilities, making the APIs match up to those of nanochat
-
-class ModelWrapper:
-    """Lightweight wrapper for a HuggingFace model"""
-    def __init__(self, model, max_seq_len=None):
-        self.model = model
-        self.max_seq_len = max_seq_len
-
-    def __call__(self, input_ids, targets=None, loss_reduction='mean'):
-        logits = self.model(input_ids).logits
-        if targets is None:
-            return logits
-        else:
-            loss = torch.nn.functional.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1, reduction=loss_reduction)
-            return loss
-
-    def get_device(self):
-        return next(self.model.parameters()).device
-
-def load_hf_model(hf_path: str, device):
-    print0(f"Loading model from: {hf_path}")
-    from transformers import AutoModelForCausalLM
-    model = AutoModelForCausalLM.from_pretrained(hf_path)
-    model.to(device)
-    model.eval()
-    max_seq_len = 1024 if "openai-community/gpt2" in hf_path else None
-    model = ModelWrapper(model, max_seq_len=max_seq_len)
-    tokenizer = HuggingFaceTokenizer.from_pretrained(hf_path)
-    return model, tokenizer
-
-def get_hf_token_bytes(tokenizer, device="cpu"):
-    """Compute token_bytes tensor for a HuggingFace tokenizer."""
-    vocab_size = tokenizer.tokenizer.get_vocab_size()
-    token_bytes = torch.zeros(vocab_size, dtype=torch.int64, device=device)
-    for token_id in range(vocab_size):
-        token_str = tokenizer.tokenizer.decode([token_id])
-        token_bytes[token_id] = len(token_str.encode('utf-8')) # Count UTF-8 bytes
-    return token_bytes
-
-# CLI arguments
-parser = argparse.ArgumentParser(description="Evaluate loss on train/val splits and sample from model")
-parser.add_argument("--device-batch-size", type=int, default=32, help="per-device batch size")
-parser.add_argument("--split-tokens", type=int, default=40*524288, help="number of tokens to evaluate per split")
-parser.add_argument("--model-tag", type=str, default=None, help="model tag for checkpoint directory")
-parser.add_argument("--model-step", type=int, default=None, help="model step to load")
-parser.add_argument("--device-type", type=str, default="", help="cuda|cpu|mps (empty = autodetect)")
-parser.add_argument("--hf-path", type=str, default=None, help="HuggingFace model path (e.g. openai-community/gpt2)")
-args = parser.parse_args()
-
-# Load the base model and the tokenizer
-device_type = autodetect_device_type() if args.device_type == "" else args.device_type
-ddp, ddp_rank, ddp_local_rank, ddp_world_size, device = compute_init(device_type)
-print0(f"Device: {device} | DDP rank: {ddp_rank} | DDP local rank: {ddp_local_rank} | DDP world size: {ddp_world_size}")
-
-if args.hf_path is not None:
-    # Load HuggingFace model
-    model, tokenizer = load_hf_model(args.hf_path, device)
-    sequence_len = model.max_seq_len if model.max_seq_len else 1024
-    token_bytes = get_hf_token_bytes(tokenizer, device=device)
-    model_name = args.hf_path
-else:
-    # Load local nanochat model
-    model, tokenizer, meta = load_model("base", device, phase="eval", model_tag=args.model_tag, step=args.model_step)
-    sequence_len = meta["model_config"]["sequence_len"]
-    token_bytes = get_token_bytes(device=device)
-    model_name = f"base_model (step {meta['step']})"
-
-autocast_ctx = torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16) if device_type == "cuda" else nullcontext()
-
-print0(f"Evaluating model: {model_name}")
-
-# Evaluate the loss on each split
-tokens_per_step = args.device_batch_size * sequence_len * ddp_world_size
-assert args.split_tokens % tokens_per_step == 0, "split_tokens must be divisible by tokens_per_step"
-steps = args.split_tokens // tokens_per_step
-bpb_results = {}
-for split_name in ["train", "val"]:
-    loader = tokenizing_distributed_data_loader_bos_bestfit(tokenizer, args.device_batch_size, sequence_len, split_name, device=device)
-    with autocast_ctx:
-        bpb = evaluate_bpb(model, loader, steps, token_bytes)
-    print0(f"{split_name} bpb: {bpb:.4f}")
-    bpb_results[split_name] = bpb
-    print0(f"Model: {model_name}, {split_name} bpb: {bpb:.6f}")
-
-# Master process also samples from the model for some basic knowledge-eliciting prompts (only for nanochat models)
-samples = []
-if ddp_rank == 0 and args.hf_path is None:
-    prompts = [
-        "The capital of France is",
-        "The chemical symbol of gold is",
-        "If yesterday was Friday, then tomorrow will be",
-        "The opposite of hot is",
-        "The planets of the solar system are:",
-        "My favorite color is",
-        "If 5*x + 3 = 13, then x is",
-    ]
-    engine = Engine(model, tokenizer)
-    for prompt in prompts:
-        tokens = tokenizer(prompt, prepend="<|bos|>")
-        with autocast_ctx:
-            sample, _ = engine.generate_batch(tokens, num_samples=1, max_tokens=16, temperature=0)
-        sample_str = tokenizer.decode(sample[0])
-        print0("-" * 80)
-        print0(sample_str)
-        samples.append(sample_str)
-
-# Draw some unconditioned samples from the model (only for nanochat models)
-unconditioned_samples = []
-if ddp_rank == 0 and args.hf_path is None:
-    engine = Engine(model, tokenizer)
-    tokens = tokenizer("", prepend="<|bos|>")
-    with autocast_ctx:
-        samples, _ = engine.generate_batch(tokens, num_samples=8, max_tokens=128, temperature=1.0)
-    for sample in samples:
-        sample_str = tokenizer.decode(sample)
-        print0("-" * 80)
-        print0(sample_str)
-        unconditioned_samples.append(sample_str)
-
-# Log to report
-from nanochat.report import get_report
-get_report().log(section="Base model loss", data=[
-    {
-        "model": model_name,
-        "train bpb": bpb_results["train"],
-        "val bpb": bpb_results["val"],
-    },
-    {f"sample {i}": sample for i, sample in enumerate(samples)},
-    {f"unconditioned sample {i}": sample for i, sample in enumerate(unconditioned_samples)},
-])
-
-# Cleanup
-compute_cleanup()
--- a/scripts/base_train.py
+++ b/scripts/base_train.py
@ -11,11 +11,12 @@ If you are only on CPU/Macbook, you'll want to train a much much smaller LLM. Ex
 python -m scripts.base_train --depth=4 --max-seq-len=512 --device-batch-size=1 --eval-tokens=512 --core-metric-every=-1 --total-batch-size=512 --num-iterations=20
 """

+import gc
 import os
 os.environ["PYTORCH_ALLOC_CONF"] = "expandable_segments:True"
 import argparse
 import time
-from contextlib import nullcontext
+from contextlib import nullcontext, contextmanager

 import wandb
 import torch
@ -28,7 +29,7 @@ from nanochat.checkpoint_manager import save_checkpoint, load_checkpoint
 from nanochat.loss_eval import evaluate_bpb
 from nanochat.engine import Engine
 from nanochat.flash_attention import HAS_FA3
-from scripts.base_eval import evaluate_model
+from scripts.base_eval import evaluate_core
 print_banner()

 # -----------------------------------------------------------------------------
@ -38,6 +39,9 @@ parser = argparse.ArgumentParser(description="Pretrain base model")
 parser.add_argument("--run", type=str, default="dummy", help="wandb run name ('dummy' disables wandb logging)")
 # Runtime
 parser.add_argument("--device-type", type=str, default="", help="cuda|cpu|mps (empty = autodetect)")
+# FP8 training
+parser.add_argument("--fp8", action="store_true", help="enable FP8 training (requires H100+ GPU and torchao)")
+parser.add_argument("--fp8-recipe", type=str, default="tensorwise", choices=["rowwise", "tensorwise"], help="FP8 scaling recipe: tensorwise (faster, recommended) or rowwise (more accurate but slower)")
 # Model architecture
 parser.add_argument("--depth", type=int, default=20, help="depth of the Transformer model")
 parser.add_argument("--aspect-ratio", type=int, default=64, help="model_dim = depth * aspect_ratio")
@ -64,7 +68,7 @@ parser.add_argument("--final-lr-frac", type=float, default=0.0, help="final LR a
 parser.add_argument("--resume-from-step", type=int, default=-1, help="resume training from this step (-1 = disable)")
 # Evaluation
 parser.add_argument("--eval-every", type=int, default=250, help="evaluate val bpb every N steps (-1 = disable)")
-parser.add_argument("--eval-tokens", type=int, default=20*524288, help="number of tokens to evaluate val loss on")
+parser.add_argument("--eval-tokens", type=int, default=40*524288, help="number of tokens to evaluate val loss on")
 parser.add_argument("--core-metric-every", type=int, default=2000, help="evaluate CORE metric every N steps (-1 = disable)")
 parser.add_argument("--core-metric-max-per-task", type=int, default=500, help="examples per task for CORE metric")
 parser.add_argument("--sample-every", type=int, default=2000, help="sample from model every N steps (-1 = disable)")
@ -176,11 +180,11 @@ if resuming:
    model.load_state_dict(model_data, strict=True, assign=True)
    del model_data # free up this memory after the copy

-orig_model = model # original, uncompiled model, for saving raw model state_dict and for inference/evaluation (because the shapes may change shape)
-model = torch.compile(model, dynamic=False) # the inputs to model will never change shape so dynamic=False is safe
+# -----------------------------------------------------------------------------
+# Determine the length of the training run based on model size

 # Detailed parameter counts
-param_counts = orig_model.num_scaling_params()
+param_counts = model.num_scaling_params()
 print0(f"Parameter counts:")
 for key, value in param_counts.items():
    print0(f"{key:24s}: {value:,}")
@ -210,6 +214,85 @@ print0(f"Total number of training tokens: {total_tokens:,}")
 print0(f"Tokens : Scaling params ratio: {args.total_batch_size * num_iterations / num_scaling_params:.2f}") # Chinchilla is ~20
 print0(f"Total training FLOPs estimate: {num_flops_per_token * total_tokens:e}")

+# -----------------------------------------------------------------------------
+# FP8 training initialization and management (has to be done before torch.compile)
+
+# Convert Linear layers to Float8Linear if --fp8 is set
+if args.fp8:
+    if device_type != "cuda":
+        print0("Warning: FP8 training requires CUDA, ignoring --fp8 flag")
+    else:
+        from torchao.float8 import Float8LinearConfig, convert_to_float8_training
+        import torch.nn as nn
+
+        # Filter: only convert layers with dimensions divisible by 16 (FP8 hardware requirement)
+        def fp8_module_filter(mod: nn.Module, fqn: str) -> bool:
+            if not isinstance(mod, nn.Linear):
+                return False
+            # FP8 requires both in_features and out_features divisible by 16
+            if mod.in_features % 16 != 0 or mod.out_features % 16 != 0:
+                return False
+            return True
+
+        fp8_config = Float8LinearConfig.from_recipe_name(args.fp8_recipe)
+        convert_to_float8_training(model, config=fp8_config, module_filter_fn=fp8_module_filter)
+        num_fp8_layers = sum(1 for m in model.modules() if 'Float8' in type(m).__name__)
+        num_skipped = sum(1 for m in model.modules() if isinstance(m, nn.Linear)) - num_fp8_layers
+        print0(f"✓ FP8 training enabled ({args.fp8_recipe} scaling) - converted {num_fp8_layers} layers, skipped {num_skipped} (dims not divisible by 16)")
+
+# Context manager to temporarily disable FP8 so that model evaluation remains in BF16
+@contextmanager
+def disable_fp8(model):
+    """Temporarily swap Float8Linear modules with nn.Linear for BF16 evaluation.
+
+    CastConfig is a frozen dataclass, so we can't mutate scaling_type. Instead,
+    we swap out Float8Linear modules entirely and restore them after.
+    """
+    import torch.nn as nn
+
+    # Find all Float8Linear modules and their locations
+    fp8_locations = []  # list of (parent_module, attr_name, fp8_module)
+    for name, module in model.named_modules():
+        if 'Float8' in type(module).__name__:
+            if '.' in name:
+                parent_name, attr_name = name.rsplit('.', 1)
+                parent = model.get_submodule(parent_name)
+            else:
+                parent = model
+                attr_name = name
+            fp8_locations.append((parent, attr_name, module))
+
+    if not fp8_locations:
+        yield  # No FP8 modules, nothing to do
+        return
+
+    # Swap Float8Linear -> nn.Linear (shares the same weight tensor, no copy)
+    for parent, attr_name, fp8_module in fp8_locations:
+        linear = nn.Linear(
+            fp8_module.in_features,
+            fp8_module.out_features,
+            bias=fp8_module.bias is not None,
+            device=fp8_module.weight.device,
+            dtype=fp8_module.weight.dtype,
+        )
+        linear.weight = fp8_module.weight  # share, don't copy
+        if fp8_module.bias is not None:
+            linear.bias = fp8_module.bias
+        setattr(parent, attr_name, linear)
+
+    try:
+        yield
+    finally:
+        # Restore Float8Linear modules
+        for parent, attr_name, fp8_module in fp8_locations:
+            setattr(parent, attr_name, fp8_module)
+
+# -----------------------------------------------------------------------------
+# Compile the model
+
+orig_model = model # original, uncompiled model, for saving raw model state_dict and for inference/evaluation (because the shapes may change shape)
+model = torch.compile(model, dynamic=False) # the inputs to model will never change shape so dynamic=False is safe
+
 # -----------------------------------------------------------------------------
 # Initialize the Optimizer (combined MuonAdamW: Muon for matrix params, AdamW for rest)
 adam_betas = (args.adam_beta1, args.adam_beta2)
@ -299,7 +382,7 @@ while True:
        model.eval()
        val_loader = build_val_loader()
        eval_steps = args.eval_tokens // (args.device_batch_size * args.max_seq_len * ddp_world_size)
-        with autocast_ctx:
+        with disable_fp8(model), autocast_ctx:
            val_bpb = evaluate_bpb(model, val_loader, eval_steps, token_bytes)
        print0(f"Step {step:05d} | Validation bpb: {val_bpb:.6f}")
        if val_bpb < min_val_bpb:
@ -314,11 +397,12 @@ while True:

    # once in a while: estimate the CORE metric (all ranks participate)
    # use the original uncompiled model because the inputs keep changing shape
+    # disable FP8 for evaluation to use BF16 for more consistent/accurate results
    results = {}
    if args.core_metric_every > 0 and (last_step or (step > 0 and step % args.core_metric_every == 0)):
        model.eval()
-        with autocast_ctx:
-            results = evaluate_model(orig_model, tokenizer, device, max_per_task=args.core_metric_max_per_task)
+        with disable_fp8(orig_model), autocast_ctx:
+            results = evaluate_core(orig_model, tokenizer, device, max_per_task=args.core_metric_max_per_task)
        print0(f"Step {step:05d} | CORE metric: {results['core_metric']:.4f}")
        wandb_run.log({
            "step": step,
@ -344,7 +428,7 @@ while True:
        engine = Engine(orig_model, tokenizer) # use orig_model to avoid recompilation
        for prompt in prompts:
            tokens = tokenizer(prompt, prepend="<|bos|>")
-            with autocast_ctx:
+            with disable_fp8(orig_model), autocast_ctx:
                sample, _ = engine.generate_batch(tokens, num_samples=1, max_tokens=16, temperature=0)
            print0(tokenizer.decode(sample[0]))
        model.train()
@ -442,8 +526,19 @@ while True:
        wandb_run.log(log_data)

    # state update
+    first_step_of_run = (step == 0) or (resuming and step == args.resume_from_step)
    step += 1

+    # The garbage collector is sadly a little bit overactive and for some poorly understood reason,
+    # it spends ~500ms scanning for cycles quite frequently, just to end up cleaning up very few tiny objects each time.
+    # So we manually manage and help it out here
+    if first_step_of_run:
+        gc.collect() # manually collect a lot of garbage from setup
+        gc.freeze() # immediately freeze all currently surviving objects and exclude them from GC
+        gc.disable() # nuclear intervention here: disable GC entirely except:
+    elif step % 5000 == 0: # every 5000 steps...
+        gc.collect() # manually collect, just to be safe for very, very long runs
+
 # print a few more stats
 print0(f"Peak memory usage: {get_max_memory() / 1024 / 1024:.2f}MiB")
 print0(f"Total training time: {total_training_time/60:.2f}m")
--- a/scripts/chat_cli.py
+++ b/scripts/chat_cli.py
@ -12,7 +12,7 @@ from nanochat.engine import Engine
 from nanochat.checkpoint_manager import load_model

 parser = argparse.ArgumentParser(description='Chat with the model')
-parser.add_argument('-i', '--source', type=str, default="sft", help="Source of the model: sft|mid|rl")
+parser.add_argument('-i', '--source', type=str, default="sft", help="Source of the model: sft|rl")
 parser.add_argument('-g', '--model-tag', type=str, default=None, help='Model tag to load')
 parser.add_argument('-s', '--step', type=int, default=None, help='Step to load')
 parser.add_argument('-p', '--prompt', type=str, default='', help='Prompt the model, get a single response back')
--- a/scripts/chat_eval.py
+++ b/scripts/chat_eval.py
@ -183,7 +183,7 @@ if __name__ == "__main__":

    # Parse command-line arguments
    parser = argparse.ArgumentParser()
-    parser.add_argument('-i', '--source', type=str, required=True, help="Source of the model: sft|mid|rl")
+    parser.add_argument('-i', '--source', type=str, required=True, help="Source of the model: sft|rl")
    parser.add_argument('-a', '--task-name', type=str, default=None, help="Task name. Default = all tasks. Use | to split multiple tasks.")
    parser.add_argument('-d', '--dtype', type=str, default='bfloat16', choices=['float32', 'bfloat16'])
    parser.add_argument('-t', '--temperature', type=float, default=0.0)
--- a/scripts/chat_rl.py
+++ b/scripts/chat_rl.py
@ -38,7 +38,6 @@ parser.add_argument("--run", type=str, default="dummy", help="wandb run name ('d
 parser.add_argument("--device-type", type=str, default="", help="cuda|cpu|mps (empty = autodetect)")
 parser.add_argument("--dtype", type=str, default="bfloat16", help="float32|bfloat16")
 # Model loading
-parser.add_argument("--source", type=str, default="sft", help="mid|sft - which checkpoint to load from")
 parser.add_argument("--model-tag", type=str, default=None, help="model tag to load from")
 parser.add_argument("--model-step", type=int, default=None, help="model step to load from")
 # Training horizon
@ -77,7 +76,7 @@ use_dummy_wandb = args.run == "dummy" or not master_process
 wandb_run = DummyWandb() if use_dummy_wandb else wandb.init(project="nanochat-rl", name=args.run, config=user_config)

 # Init model and tokenizer
-model, tokenizer, meta = load_model(args.source, device, phase="eval", model_tag=args.model_tag, step=args.model_step)
+model, tokenizer, meta = load_model("sft", device, phase="eval", model_tag=args.model_tag, step=args.model_step)
 engine = Engine(model, tokenizer) # for sampling rollouts

 # -----------------------------------------------------------------------------
--- a/scripts/chat_sft.py
+++ b/scripts/chat_sft.py
@ -48,7 +48,7 @@ parser.add_argument("--max-seq-len", type=int, default=2048, help="max context l
 parser.add_argument("--device-batch-size", type=int, default=32, help="per-device batch size")
 parser.add_argument("--total-batch-size", type=int, default=524288, help="total batch size in tokens")
 # Optimization
-parser.add_argument("--embedding-lr", type=float, default=0.2, help="learning rate for embedding parameters (Adam)")
+parser.add_argument("--embedding-lr", type=float, default=0.3, help="learning rate for embedding parameters (Adam)")
 parser.add_argument("--unembedding-lr", type=float, default=0.004, help="learning rate for unembedding parameters (Adam)")
 parser.add_argument("--matrix-lr", type=float, default=0.02, help="learning rate for matrix parameters (Muon)")
 parser.add_argument("--weight-decay", type=float, default=0.0, help="weight decay for embedding/unembedding parameters (Adam)")
@ -285,7 +285,7 @@ while True:
    # save checkpoint at the end of the run (only on master process)
    if master_process and last_step and not args.dry_run:
        output_dirname = args.model_tag if args.model_tag else f"d{depth}" # e.g. d12
-        checkpoint_dir = os.path.join(base_dir, "sft_checkpoints", output_dirname)
+        checkpoint_dir = os.path.join(base_dir, "chatsft_checkpoints", output_dirname)
        save_checkpoint(
            checkpoint_dir,
            step,
@ -301,6 +301,7 @@ while True:
                    "n_head": model.config.n_head,
                    "n_kv_head": model.config.n_kv_head,
                    "n_embd": model.config.n_embd,
+                    "window_pattern": model.config.window_pattern,
                },
                "user_config": user_config, # inputs to the training script
            }
--- a/scripts/chat_web.py
+++ b/scripts/chat_web.py
@ -62,7 +62,7 @@ MAX_MAX_TOKENS = 4096

 parser = argparse.ArgumentParser(description='NanoChat Web Server')
 parser.add_argument('-n', '--num-gpus', type=int, default=1, help='Number of GPUs to use (default: 1)')
-parser.add_argument('-i', '--source', type=str, default="sft", help="Source of the model: sft|mid|rl")
+parser.add_argument('-i', '--source', type=str, default="sft", help="Source of the model: sft|rl")
 parser.add_argument('-t', '--temperature', type=float, default=0.8, help='Default temperature for generation')
 parser.add_argument('-k', '--top-k', type=int, default=50, help='Default top-k sampling parameter')
 parser.add_argument('-m', '--max-tokens', type=int, default=512, help='Default max tokens for generation')
--- a/tasks/spellingbee.py
+++ b/tasks/spellingbee.py
@ -20,7 +20,7 @@ LLM because it has to learn how every token (a little semantic chunk/atom)
 maps to the sequence of individual characters that make it up. Larger models
 learn this eventually on their own, but if we want this capability to exist
 in smaller models, we have to actively encourage it by over-representing it
-in the training data. Midtraining is a good place to do this.
+in the training data. SFT is a good place to do this.

 To preview a few example conversations, run:
 python -m tasks.spellingbee
--- a/uv.lock
+++ b/uv.lock
@ -1505,11 +1505,11 @@ dependencies = [
    { name = "tabulate" },
    { name = "tiktoken" },
    { name = "tokenizers" },
-    { name = "torch", version = "2.9.0", source = { registry = "https://pypi.org/simple" }, marker = "(sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
    { name = "torch", version = "2.9.1", source = { registry = "https://download.pytorch.org/whl/cpu" }, marker = "(sys_platform == 'darwin' and extra == 'extra-8-nanochat-cpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "(sys_platform != 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
+    { name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "(extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu') or (extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu')" },
    { name = "torch", version = "2.9.1+cpu", source = { registry = "https://download.pytorch.org/whl/cpu" }, marker = "(sys_platform != 'darwin' and extra == 'extra-8-nanochat-cpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
    { name = "torch", version = "2.9.1+cu128", source = { registry = "https://download.pytorch.org/whl/cu128" }, marker = "extra == 'extra-8-nanochat-gpu'" },
+    { name = "torchao" },
    { name = "transformers" },
    { name = "uvicorn" },
    { name = "wandb" },
@ -1546,9 +1546,10 @@ requires-dist = [
    { name = "tabulate", specifier = ">=0.9.0" },
    { name = "tiktoken", specifier = ">=0.11.0" },
    { name = "tokenizers", specifier = ">=0.22.0" },
-    { name = "torch", specifier = ">=2.9.0" },
-    { name = "torch", marker = "extra == 'cpu'", specifier = ">=2.9.1", index = "https://download.pytorch.org/whl/cpu", conflict = { package = "nanochat", extra = "cpu" } },
-    { name = "torch", marker = "extra == 'gpu'", specifier = ">=2.9.1", index = "https://download.pytorch.org/whl/cu128", conflict = { package = "nanochat", extra = "gpu" } },
+    { name = "torch", specifier = "==2.9.1" },
+    { name = "torch", marker = "extra == 'cpu'", specifier = "==2.9.1", index = "https://download.pytorch.org/whl/cpu", conflict = { package = "nanochat", extra = "cpu" } },
+    { name = "torch", marker = "extra == 'gpu'", specifier = "==2.9.1", index = "https://download.pytorch.org/whl/cu128", conflict = { package = "nanochat", extra = "gpu" } },
+    { name = "torchao", specifier = "==0.15.0" },
    { name = "transformers", specifier = ">=4.57.3" },
    { name = "uvicorn", specifier = ">=0.36.0" },
    { name = "wandb", specifier = ">=0.21.3" },
@ -1688,7 +1689,7 @@ name = "nvidia-cudnn-cu12"
 version = "9.10.2.21"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
-    { name = "nvidia-cublas-cu12", marker = "(sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
+    { name = "nvidia-cublas-cu12", marker = "(sys_platform == 'linux' and extra == 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
 ]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/fa/41/e79269ce215c857c935fd86bcfe91a451a584dfc27f1e068f568b9ad1ab7/nvidia_cudnn_cu12-9.10.2.21-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:c9132cc3f8958447b4910a1720036d9eff5928cc3179b0a51fb6d167c6cc87d8", size = 705026878, upload-time = "2025-06-06T21:52:51.348Z" },
@ -1701,7 +1702,7 @@ name = "nvidia-cufft-cu12"
 version = "11.3.3.83"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
-    { name = "nvidia-nvjitlink-cu12", marker = "(sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
+    { name = "nvidia-nvjitlink-cu12", marker = "(sys_platform == 'linux' and extra == 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
 ]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/60/bc/7771846d3a0272026c416fbb7e5f4c1f146d6d80704534d0b187dd6f4800/nvidia_cufft_cu12-11.3.3.83-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:848ef7224d6305cdb2a4df928759dca7b1201874787083b6e7550dd6765ce69a", size = 193109211, upload-time = "2025-03-07T01:44:56.873Z" },
@ -1733,9 +1734,9 @@ name = "nvidia-cusolver-cu12"
 version = "11.7.3.90"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
-    { name = "nvidia-cublas-cu12", marker = "(sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "nvidia-cusparse-cu12", marker = "(sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "nvidia-nvjitlink-cu12", marker = "(sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
+    { name = "nvidia-cublas-cu12", marker = "(sys_platform == 'linux' and extra == 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
+    { name = "nvidia-cusparse-cu12", marker = "(sys_platform == 'linux' and extra == 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
+    { name = "nvidia-nvjitlink-cu12", marker = "(sys_platform == 'linux' and extra == 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
 ]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/c8/32/f7cd6ce8a7690544d084ea21c26e910a97e077c9b7f07bf5de623ee19981/nvidia_cusolver_cu12-11.7.3.90-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:db9ed69dbef9715071232caa9b69c52ac7de3a95773c2db65bdba85916e4e5c0", size = 267229841, upload-time = "2025-03-07T01:46:54.356Z" },
@ -1748,7 +1749,7 @@ name = "nvidia-cusparse-cu12"
 version = "12.5.8.93"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
-    { name = "nvidia-nvjitlink-cu12", marker = "(sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
+    { name = "nvidia-nvjitlink-cu12", marker = "(sys_platform == 'linux' and extra == 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
 ]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/bc/f7/cd777c4109681367721b00a106f491e0d0d15cfa1fd59672ce580ce42a97/nvidia_cusparse_cu12-12.5.8.93-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:9b6c161cb130be1a07a27ea6923df8141f3c295852f4b260c65f18f3e0a091dc", size = 288117129, upload-time = "2025-03-07T01:47:40.407Z" },
@ -2990,72 +2991,6 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/6e/c2/61d3e0f47e2b74ef40a68b9e6ad5984f6241a942f7cd3bbfbdbd03861ea9/tomli-2.2.1-py3-none-any.whl", hash = "sha256:cb55c73c5f4408779d0cf3eef9f762b9c9f147a77de7b258bef0a5628adc85cc", size = 14257, upload-time = "2024-11-27T22:38:35.385Z" },
 ]

-[[package]]
-name = "torch"
-version = "2.9.0"
-source = { registry = "https://pypi.org/simple" }
-resolution-markers = [
-    "python_full_version >= '3.12' and sys_platform == 'linux'",
-    "python_full_version == '3.11.*' and sys_platform == 'linux'",
-    "python_full_version < '3.11' and sys_platform == 'linux'",
-]
-dependencies = [
-    { name = "filelock", marker = "(sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "fsspec", marker = "(sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "jinja2", marker = "(sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "networkx", version = "3.4.2", source = { registry = "https://pypi.org/simple" }, marker = "(python_full_version < '3.11' and sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "networkx", version = "3.5", source = { registry = "https://pypi.org/simple" }, marker = "(python_full_version >= '3.11' and sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "nvidia-cublas-cu12", marker = "(platform_machine == 'x86_64' and sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "nvidia-cuda-cupti-cu12", marker = "(platform_machine == 'x86_64' and sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "nvidia-cuda-nvrtc-cu12", marker = "(platform_machine == 'x86_64' and sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "nvidia-cuda-runtime-cu12", marker = "(platform_machine == 'x86_64' and sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "nvidia-cudnn-cu12", marker = "(platform_machine == 'x86_64' and sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "nvidia-cufft-cu12", marker = "(platform_machine == 'x86_64' and sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "nvidia-cufile-cu12", marker = "(platform_machine == 'x86_64' and sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "nvidia-curand-cu12", marker = "(platform_machine == 'x86_64' and sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "nvidia-cusolver-cu12", marker = "(platform_machine == 'x86_64' and sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "nvidia-cusparse-cu12", marker = "(platform_machine == 'x86_64' and sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "nvidia-cusparselt-cu12", marker = "(platform_machine == 'x86_64' and sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "nvidia-nccl-cu12", marker = "(platform_machine == 'x86_64' and sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "nvidia-nvjitlink-cu12", marker = "(platform_machine == 'x86_64' and sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "nvidia-nvshmem-cu12", marker = "(platform_machine == 'x86_64' and sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "nvidia-nvtx-cu12", marker = "(platform_machine == 'x86_64' and sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "setuptools", marker = "(python_full_version >= '3.12' and sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "sympy", marker = "(sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "triton", version = "3.5.0", source = { registry = "https://pypi.org/simple" }, marker = "(platform_machine == 'x86_64' and sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "typing-extensions", marker = "(sys_platform == 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-]
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/bb/86/245c240d2138c17ed572c943c289056c2721abab70810d772c6bf5495b28/torch-2.9.0-cp310-cp310-manylinux_2_28_aarch64.whl", hash = "sha256:030bbfe367379ae6a4ae4042b6c44da25383343b8b3c68abaa9c7231efbaf2dd", size = 104213554, upload-time = "2025-10-15T15:45:59.798Z" },
-    { url = "https://files.pythonhosted.org/packages/58/1d/fd1e88ae0948825efcab7dd66d12bec23f05d4d38ed81573c8d453c14c06/torch-2.9.0-cp310-cp310-manylinux_2_28_x86_64.whl", hash = "sha256:51cb63902182a78e90886e8068befd8ea102af4b00e420263591a3d70c7d3c6c", size = 899795167, upload-time = "2025-10-15T15:47:12.695Z" },
-    { url = "https://files.pythonhosted.org/packages/63/5a/496197b45c14982bef4e079b24c61dc108e3ab0d0cc9718dba9f54f45a46/torch-2.9.0-cp310-cp310-win_amd64.whl", hash = "sha256:3f6aad4d2f0ee2248bac25339d74858ff846c3969b27d14ac235821f055af83d", size = 109310314, upload-time = "2025-10-15T15:46:16.633Z" },
-    { url = "https://files.pythonhosted.org/packages/58/b0/2b4e647b0fc706e88eb6c253d05511865578f5f67b55fad639bf3272a4a1/torch-2.9.0-cp310-none-macosx_11_0_arm64.whl", hash = "sha256:413e1654c9203733138858780e184d9fc59442f0b3b209e16f39354eb893db9b", size = 74452019, upload-time = "2025-10-15T15:46:04.296Z" },
-    { url = "https://files.pythonhosted.org/packages/58/fe/334225e6330e672b36aef23d77451fa906ea12881570c08638a91331a212/torch-2.9.0-cp311-cp311-manylinux_2_28_aarch64.whl", hash = "sha256:c596708b5105d0b199215acf0c9be7c1db5f1680d88eddadf4b75a299259a677", size = 104230578, upload-time = "2025-10-15T15:46:08.182Z" },
-    { url = "https://files.pythonhosted.org/packages/05/cc/49566caaa218872ec9a2912456f470ff92649894a4bc2e5274aa9ef87c4a/torch-2.9.0-cp311-cp311-manylinux_2_28_x86_64.whl", hash = "sha256:51de31219c97c51cf4bf2be94d622e3deb5dcc526c6dc00e97c17eaec0fc1d67", size = 899815990, upload-time = "2025-10-15T15:48:03.336Z" },
-    { url = "https://files.pythonhosted.org/packages/74/25/e9ab21d5925b642d008f139d4a3c9664fc9ee1faafca22913c080cc4c0a5/torch-2.9.0-cp311-cp311-win_amd64.whl", hash = "sha256:dd515c70059afd95f48b8192733764c08ca37a1d19803af6401b5ecad7c8676e", size = 109313698, upload-time = "2025-10-15T15:46:12.425Z" },
-    { url = "https://files.pythonhosted.org/packages/b3/b7/205ef3e94de636feffd64b28bb59a0dfac0771221201b9871acf9236f5ca/torch-2.9.0-cp311-none-macosx_11_0_arm64.whl", hash = "sha256:614a185e4986326d526a91210c8fc1397e76e8cfafa78baf6296a790e53a9eec", size = 74463678, upload-time = "2025-10-15T15:46:29.779Z" },
-    { url = "https://files.pythonhosted.org/packages/d1/d3/3985739f3b8e88675127bf70f82b3a48ae083e39cda56305dbd90398fec0/torch-2.9.0-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:e5f7af1dc4c0a7c4a260c2534f41ddaf209714f7c89145e644c44712fbd6b642", size = 104107898, upload-time = "2025-10-15T15:46:20.883Z" },
-    { url = "https://files.pythonhosted.org/packages/a5/4b/f4bb2e6c25d0272f798cd6d7a04ed315da76cec68c602d87040c7847287f/torch-2.9.0-cp312-cp312-manylinux_2_28_x86_64.whl", hash = "sha256:01cff95ecd9a212ea2f141db28acccdceb6a4c54f64e6c51091146f5e2a772c6", size = 899738273, upload-time = "2025-10-15T15:50:04.188Z" },
-    { url = "https://files.pythonhosted.org/packages/66/11/c1c5ba6691cda6279087c35bd626536e4fd29521fe740abf5008377a9a02/torch-2.9.0-cp312-cp312-win_amd64.whl", hash = "sha256:4582b162f541651f0cb184d3e291c05c2f556c7117c64a9873e2ee158d40062b", size = 109280887, upload-time = "2025-10-15T15:46:26.228Z" },
-    { url = "https://files.pythonhosted.org/packages/dd/5f/b85bd8c05312d71de9402bf5868d217c38827cfd09d8f8514e5be128a52b/torch-2.9.0-cp312-none-macosx_11_0_arm64.whl", hash = "sha256:33f58e9a102a91259af289d50525c30323b5c9ae1d31322b6447c0814da68695", size = 74478983, upload-time = "2025-10-15T15:46:39.406Z" },
-    { url = "https://files.pythonhosted.org/packages/c2/1c/90eb13833cdf4969ea9707586d7b57095c3b6e2b223a7256bf111689bcb8/torch-2.9.0-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:c30a17fc83eeab346913e237c64b15b5ba6407fff812f6c541e322e19bc9ea0e", size = 104111330, upload-time = "2025-10-15T15:46:35.238Z" },
-    { url = "https://files.pythonhosted.org/packages/0e/21/2254c54b8d523592c25ef4434769aa23e29b1e6bf5f4c0ad9e27bf442927/torch-2.9.0-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:8f25033b8667b57857dfd01458fbf2a9e6a6df1f8def23aef0dc46292f6aa642", size = 899750243, upload-time = "2025-10-15T15:48:57.459Z" },
-    { url = "https://files.pythonhosted.org/packages/b7/a5/5cb94fa4fd1e78223455c23c200f30f6dc10c6d4a2bcc8f6e7f2a2588370/torch-2.9.0-cp313-cp313-win_amd64.whl", hash = "sha256:d037f1b4ffd25013be4a7bf3651a0a910c68554956c7b2c92ebe87c76475dece", size = 109284513, upload-time = "2025-10-15T15:46:45.061Z" },
-    { url = "https://files.pythonhosted.org/packages/66/e8/fc414d8656250ee46120b44836ffbb3266343db424b3e18ca79ebbf69d4f/torch-2.9.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:e4e5b5cba837a2a8d1a497ba9a58dae46fa392593eaa13b871c42f71847503a5", size = 74830362, upload-time = "2025-10-15T15:46:48.983Z" },
-    { url = "https://files.pythonhosted.org/packages/ed/5f/9474c98fc5ae0cd04b9466035428cd360e6611a86b8352a0fc2fa504acdc/torch-2.9.0-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:64693568f5dc4dbd5f880a478b1cea0201cc6b510d91d1bc54fea86ac5d1a637", size = 104144940, upload-time = "2025-10-15T15:47:29.076Z" },
-    { url = "https://files.pythonhosted.org/packages/2d/5a/8e0c1cf57830172c109d4bd6be2708cabeaf550983eee7029291322447a0/torch-2.9.0-cp313-cp313t-manylinux_2_28_x86_64.whl", hash = "sha256:f8ed31ddd7d10bfb3fbe0b9fe01b1243577f13d75e6f4a0839a283915ce3791e", size = 899744054, upload-time = "2025-10-15T15:48:29.864Z" },
-    { url = "https://files.pythonhosted.org/packages/6d/28/82c28b30fcb4b7c9cdd995763d18bbb830d6521356712faebbad92ffa61d/torch-2.9.0-cp313-cp313t-win_amd64.whl", hash = "sha256:eff527d4e4846e6f70d2afd8058b73825761203d66576a7e04ea2ecfebcb4ab8", size = 109517546, upload-time = "2025-10-15T15:47:33.395Z" },
-    { url = "https://files.pythonhosted.org/packages/ff/c3/a91f96ec74347fa5fd24453fa514bc61c61ecc79196fa760b012a1873d96/torch-2.9.0-cp313-none-macosx_11_0_arm64.whl", hash = "sha256:f8877779cf56d1ce431a7636703bdb13307f5960bb1af49716d8b179225e0e6a", size = 74480732, upload-time = "2025-10-15T15:47:38.002Z" },
-    { url = "https://files.pythonhosted.org/packages/5c/73/9f70af34b334a7e0ef496ceec96b7ec767bd778ea35385ce6f77557534d1/torch-2.9.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:7e614fae699838038d888729f82b687c03413c5989ce2a9481f9a7e7a396e0bb", size = 74433037, upload-time = "2025-10-15T15:47:41.894Z" },
-    { url = "https://files.pythonhosted.org/packages/b7/84/37cf88625901934c97109e583ecc21777d21c6f54cda97a7e5bbad1ee2f2/torch-2.9.0-cp314-cp314-manylinux_2_28_aarch64.whl", hash = "sha256:dfb5b8cd310ba3436c7e14e8b7833ef658cf3045e50d2bdaed23c8fc517065eb", size = 104116482, upload-time = "2025-10-15T15:47:46.266Z" },
-    { url = "https://files.pythonhosted.org/packages/56/8e/ca8b17866943a8d4f4664d402ea84210aa274588b4c5d89918f5caa24eec/torch-2.9.0-cp314-cp314-manylinux_2_28_x86_64.whl", hash = "sha256:b3d29524993a478e46f5d598b249cd824b7ed98d7fba538bd9c4cde6c803948f", size = 899746916, upload-time = "2025-10-15T15:50:40.294Z" },
-    { url = "https://files.pythonhosted.org/packages/43/65/3b17c0fbbdab6501c5b320a52a648628d0d44e7379f64e27d9eef701b6bf/torch-2.9.0-cp314-cp314-win_amd64.whl", hash = "sha256:71c7578984f5ec0eb645eb4816ac8435fcf3e3e2ae1901bcd2f519a9cafb5125", size = 109275151, upload-time = "2025-10-15T15:49:20.715Z" },
-    { url = "https://files.pythonhosted.org/packages/83/36/74f8c051f785500396e42f93542422422dfd874a174f21f8d955d36e5d64/torch-2.9.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:71d9309aee457bbe0b164bce2111cd911c4ed4e847e65d5077dbbcd3aba6befc", size = 74823353, upload-time = "2025-10-15T15:49:16.59Z" },
-    { url = "https://files.pythonhosted.org/packages/62/51/dc3b4e2f9ba98ae27238f0153ca098bf9340b2dafcc67fde645d496dfc2a/torch-2.9.0-cp314-cp314t-manylinux_2_28_aarch64.whl", hash = "sha256:c08fb654d783899e204a32cca758a7ce8a45b2d78eeb89517cc937088316f78e", size = 104140340, upload-time = "2025-10-15T15:50:19.67Z" },
-    { url = "https://files.pythonhosted.org/packages/c0/8d/b00657f8141ac16af7bb6cda2e67de18499a3263b78d516b9a93fcbc98e3/torch-2.9.0-cp314-cp314t-manylinux_2_28_x86_64.whl", hash = "sha256:ec8feb0099b2daa5728fbc7abb0b05730fd97e0f359ff8bda09865aaa7bd7d4b", size = 899731750, upload-time = "2025-10-15T15:49:36.673Z" },
-    { url = "https://files.pythonhosted.org/packages/fc/29/bd361e0cbb2c79ce6450f42643aaf6919956f89923a50571b0ebfe92d142/torch-2.9.0-cp314-cp314t-win_amd64.whl", hash = "sha256:695ba920f234ad4170c9c50e28d56c848432f8f530e6bc7f88fcb15ddf338e75", size = 109503850, upload-time = "2025-10-15T15:50:24.118Z" },
-]
-
 [[package]]
 name = "torch"
 version = "2.9.1"
@ -3076,13 +3011,13 @@ dependencies = [
    { name = "typing-extensions", marker = "(sys_platform == 'darwin' and extra == 'extra-8-nanochat-cpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
 ]
 wheels = [
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1-cp310-none-macosx_11_0_arm64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1-cp311-none-macosx_11_0_arm64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1-cp312-none-macosx_11_0_arm64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1-cp313-cp313t-macosx_11_0_arm64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1-cp313-none-macosx_11_0_arm64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1-cp314-cp314-macosx_11_0_arm64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1-cp314-cp314t-macosx_11_0_arm64.whl" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1-cp310-none-macosx_11_0_arm64.whl", hash = "sha256:bf1e68cfb935ae2046374ff02a7aa73dda70351b46342846f557055b3a540bf0" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1-cp311-none-macosx_11_0_arm64.whl", hash = "sha256:a52952a8c90a422c14627ea99b9826b7557203b46b4d0772d3ca5c7699692425" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1-cp312-none-macosx_11_0_arm64.whl", hash = "sha256:287242dd1f830846098b5eca847f817aa5c6015ea57ab4c1287809efea7b77eb" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:8924d10d36eac8fe0652a060a03fc2ae52980841850b9a1a2ddb0f27a4f181cd" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1-cp313-none-macosx_11_0_arm64.whl", hash = "sha256:bcee64ae7aa65876ceeae6dcaebe75109485b213528c74939602208a20706e3f" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:defadbeb055cfcf5def58f70937145aecbd7a4bc295238ded1d0e85ae2cf0e1d" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:886f84b181f766f53265ba0a1d503011e60f53fff9d569563ef94f24160e1072" },
 ]

 [[package]]
@ -3090,19 +3025,22 @@ name = "torch"
 version = "2.9.1"
 source = { registry = "https://pypi.org/simple" }
 resolution-markers = [
+    "python_full_version >= '3.12' and sys_platform == 'linux'",
    "python_full_version >= '3.12' and sys_platform != 'linux'",
+    "python_full_version == '3.11.*' and sys_platform == 'linux'",
+    "python_full_version < '3.11' and sys_platform == 'linux'",
    "python_full_version == '3.11.*' and sys_platform != 'linux'",
    "python_full_version < '3.11' and sys_platform != 'linux'",
 ]
 dependencies = [
-    { name = "filelock", marker = "(sys_platform != 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "fsspec", marker = "(sys_platform != 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "jinja2", marker = "(sys_platform != 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "networkx", version = "3.4.2", source = { registry = "https://pypi.org/simple" }, marker = "(python_full_version < '3.11' and sys_platform != 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "networkx", version = "3.5", source = { registry = "https://pypi.org/simple" }, marker = "(python_full_version >= '3.11' and sys_platform != 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "setuptools", marker = "(python_full_version >= '3.12' and sys_platform != 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "sympy", marker = "(sys_platform != 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
-    { name = "typing-extensions", marker = "(sys_platform != 'linux' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
+    { name = "filelock", marker = "(extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu') or (extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu')" },
+    { name = "fsspec", marker = "(extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu') or (extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu')" },
+    { name = "jinja2", marker = "(extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu') or (extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu')" },
+    { name = "networkx", version = "3.4.2", source = { registry = "https://pypi.org/simple" }, marker = "(python_full_version < '3.11' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
+    { name = "networkx", version = "3.5", source = { registry = "https://pypi.org/simple" }, marker = "(python_full_version >= '3.11' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
+    { name = "setuptools", marker = "(python_full_version >= '3.12' and extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
+    { name = "sympy", marker = "(extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu') or (extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu')" },
+    { name = "typing-extensions", marker = "(extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu') or (extra != 'extra-8-nanochat-cpu' and extra != 'extra-8-nanochat-gpu')" },
 ]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/5f/56/9577683b23072075ed2e40d725c52c2019d71a972fab8e083763da8e707e/torch-2.9.1-cp310-cp310-manylinux_2_28_aarch64.whl", hash = "sha256:1cc208435f6c379f9b8fdfd5ceb5be1e3b72a6bdf1cb46c0d2812aa73472db9e", size = 104207681, upload-time = "2025-11-12T15:19:56.48Z" },
@ -3158,30 +3096,30 @@ dependencies = [
    { name = "typing-extensions", marker = "(sys_platform != 'darwin' and extra == 'extra-8-nanochat-cpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
 ]
 wheels = [
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp310-cp310-manylinux_2_28_aarch64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp310-cp310-manylinux_2_28_x86_64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp310-cp310-win_amd64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp311-cp311-manylinux_2_28_aarch64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp311-cp311-manylinux_2_28_x86_64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp311-cp311-win_amd64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp311-cp311-win_arm64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp312-cp312-manylinux_2_28_aarch64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp312-cp312-manylinux_2_28_x86_64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp312-cp312-win_amd64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp312-cp312-win_arm64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp313-cp313-manylinux_2_28_aarch64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp313-cp313-manylinux_2_28_x86_64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp313-cp313-win_amd64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp313-cp313-win_arm64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp313-cp313t-manylinux_2_28_aarch64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp313-cp313t-manylinux_2_28_x86_64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp313-cp313t-win_amd64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp314-cp314-manylinux_2_28_aarch64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp314-cp314-manylinux_2_28_x86_64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp314-cp314-win_amd64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp314-cp314t-manylinux_2_28_aarch64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp314-cp314t-manylinux_2_28_x86_64.whl" },
-    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp314-cp314t-win_amd64.whl" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp310-cp310-manylinux_2_28_aarch64.whl", hash = "sha256:10866c8a48c4aa5ae3f48538dc8a055b99c57d9c6af2bf5dd715374d9d6ddca3" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp310-cp310-manylinux_2_28_x86_64.whl", hash = "sha256:7210713b66943fdbfcc237b2e782871b649123ac5d29f548ce8c85be4223ab38" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp310-cp310-win_amd64.whl", hash = "sha256:d6e8441453dc27524e3f1037fbf27b90a02644b84e42944b9354b4024cb51cc1" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp311-cp311-manylinux_2_28_aarch64.whl", hash = "sha256:0e611cfb16724e62252b67d31073bc5c490cb83e92ecdc1192762535e0e44487" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp311-cp311-manylinux_2_28_x86_64.whl", hash = "sha256:3de2adb9b4443dc9210ef1f1b16da3647ace53553166d6360bbbd7edd6f16e4d" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp311-cp311-win_amd64.whl", hash = "sha256:69b3785d28be5a9c56ab525788ec5000349ec59132a74b7d5e954b905015b992" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp311-cp311-win_arm64.whl", hash = "sha256:15b4ae6fe371d96bffb8e1e9af62164797db20a0dc1337345781659cfd0b8bb1" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:3bf9b442a51a2948e41216a76d7ab00f0694cfcaaa51b6f9bcab57b7f89843e6" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp312-cp312-manylinux_2_28_x86_64.whl", hash = "sha256:7417d8c565f219d3455654cb431c6d892a3eb40246055e14d645422de13b9ea1" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp312-cp312-win_amd64.whl", hash = "sha256:a4e06b4f441675d26b462123c8a83e77c55f1ec8ebc081203be2db1ea8054add" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp312-cp312-win_arm64.whl", hash = "sha256:1abe31f14b560c1f062699e966cb08ef5b67518a1cfac2d8547a3dbcd8387b06" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:3e532e553b37ee859205a9b2d1c7977fd6922f53bbb1b9bfdd5bdc00d1a60ed4" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:39b3dff6d8fba240ae0d1bede4ca11c2531ae3b47329206512d99e17907ff74b" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp313-cp313-win_amd64.whl", hash = "sha256:404a7ab2fffaf2ca069e662f331eb46313692b2f1630df2720094284f390ccef" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp313-cp313-win_arm64.whl", hash = "sha256:161decbff26a33f13cb5ba6d2c8f458bbf56193bcc32ecc70be6dd4c7a3ee79d" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:01b1884f724977a20c7da2f640f1c7b37f4a2c117a7f4a6c1c0424d14cb86322" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp313-cp313t-manylinux_2_28_x86_64.whl", hash = "sha256:031a597147fa81b1e6d79ccf1ad3ccc7fafa27941d6cf26ff5caaa384fb20e92" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp313-cp313t-win_amd64.whl", hash = "sha256:e586ab1363e3f86aa4cc133b7fdcf98deb1d2c13d43a7a6e5a6a18e9c5364893" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp314-cp314-manylinux_2_28_aarch64.whl", hash = "sha256:65010ab4aacce6c9a1ddfc935f986c003ca8638ded04348fd326c3e74346237c" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp314-cp314-manylinux_2_28_x86_64.whl", hash = "sha256:88adf5157db5da1d54b1c9fe4a6c1d20ceef00e75d854e206a87dbf69e3037dc" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp314-cp314-win_amd64.whl", hash = "sha256:f60e2565f261542efac07e25208fb3fc55c6fe82314a5a9cbee971edb5f27713" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp314-cp314t-manylinux_2_28_aarch64.whl", hash = "sha256:3ac2b8df2c55430e836dcda31940d47f1f5f94b8731057b6f20300ebea394dd9" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp314-cp314t-manylinux_2_28_x86_64.whl", hash = "sha256:5b688445f928f13563b7418b17c57e97bf955ab559cf73cd8f2b961f8572dbb3" },
+    { url = "https://download.pytorch.org/whl/cpu/torch-2.9.1%2Bcpu-cp314-cp314t-win_amd64.whl", hash = "sha256:cf9c3e50b595721ca6b488bdcc326e0f1af73ed28b9b66eff504a96649bb5c96" },
 ]

 [[package]]
@ -3219,31 +3157,40 @@ dependencies = [
    { name = "nvidia-nvtx-cu12", marker = "(sys_platform == 'linux' and extra == 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
    { name = "setuptools", marker = "(python_full_version >= '3.12' and extra == 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
    { name = "sympy", marker = "extra == 'extra-8-nanochat-gpu'" },
-    { name = "triton", version = "3.5.1", source = { registry = "https://pypi.org/simple" }, marker = "(sys_platform == 'linux' and extra == 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
+    { name = "triton", marker = "(sys_platform == 'linux' and extra == 'extra-8-nanochat-gpu') or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
    { name = "typing-extensions", marker = "extra == 'extra-8-nanochat-gpu'" },
 ]
 wheels = [
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp310-cp310-manylinux_2_28_aarch64.whl" },
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp310-cp310-manylinux_2_28_x86_64.whl" },
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp310-cp310-win_amd64.whl" },
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp311-cp311-manylinux_2_28_aarch64.whl" },
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp311-cp311-manylinux_2_28_x86_64.whl" },
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp311-cp311-win_amd64.whl" },
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp312-cp312-manylinux_2_28_aarch64.whl" },
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp312-cp312-manylinux_2_28_x86_64.whl" },
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp312-cp312-win_amd64.whl" },
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp313-cp313-manylinux_2_28_aarch64.whl" },
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl" },
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp313-cp313-win_amd64.whl" },
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp313-cp313t-manylinux_2_28_aarch64.whl" },
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp313-cp313t-manylinux_2_28_x86_64.whl" },
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp313-cp313t-win_amd64.whl" },
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp314-cp314-manylinux_2_28_aarch64.whl" },
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp314-cp314-manylinux_2_28_x86_64.whl" },
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp314-cp314-win_amd64.whl" },
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp314-cp314t-manylinux_2_28_aarch64.whl" },
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp314-cp314t-manylinux_2_28_x86_64.whl" },
-    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp314-cp314t-win_amd64.whl" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp310-cp310-manylinux_2_28_aarch64.whl", hash = "sha256:72f0f096475e8095a6bea3fba75bd3b46cf42c761b29588f7599314e67a32661" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp310-cp310-manylinux_2_28_x86_64.whl", hash = "sha256:c8d670aa0be6fbecd2b0e7b7d514a104dbdefcc3786ca446cf0c3415043ea40a" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp310-cp310-win_amd64.whl", hash = "sha256:64399adaa8ea0896d02cf844cba3c5dd77e769520a1af73572599e0eaa2cf551" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp311-cp311-manylinux_2_28_aarch64.whl", hash = "sha256:cf4ad82430824a80a9f398e29369524ed26c152cf00c2c12002e5400b35e260d" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp311-cp311-manylinux_2_28_x86_64.whl", hash = "sha256:2a1da940f0757621d098c9755f7504d791a72a40920ec85a4fd98b20253fca4e" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp311-cp311-win_amd64.whl", hash = "sha256:633005a3700e81b5be0df2a7d3c1d48aced23ed927653797a3bd2b144a3aeeb6" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:1176f250311fa95cc3bca8077af323e0d73ea385ba266e096af82e7e2b91f256" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp312-cp312-manylinux_2_28_x86_64.whl", hash = "sha256:7cb4018f4ce68b61fd3ef87dc1c4ca520731c7b5b200e360ad47b612d7844063" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp312-cp312-win_amd64.whl", hash = "sha256:3a01f0b64c10a82d444d9fd06b3e8c567b1158b76b2764b8f51bfd8f535064b0" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:0b80b7555dcd0a75b7b06016991f01281a0bb078cf28fa2d1dfb949fad2fbd07" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:63381a109a569b280ed3319da89d3afe5cf9ab5c879936382a212affb5c90552" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp313-cp313-win_amd64.whl", hash = "sha256:ad9183864acdd99fc5143d7ca9d3d2e7ddfc9a9600ff43217825d4e5e9855ccc" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:2314521c74d76e513c53bb72c0ce3511ef0295ff657a432790df6c207e5d7962" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp313-cp313t-manylinux_2_28_x86_64.whl", hash = "sha256:4454a4faca31af81566e3a4208f10f20b8a6d9cfe42791b0ca7ff134326468fc" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp313-cp313t-win_amd64.whl", hash = "sha256:24420e430e77136f7079354134b34e7ba9d87e539f5ac84c33b08e5c13412ebe" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp314-cp314-manylinux_2_28_aarch64.whl", hash = "sha256:32c036296c557f19a1537ce981c40533650097114e1720a321a39a3b08d9df56" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp314-cp314-manylinux_2_28_x86_64.whl", hash = "sha256:7788d3d03d939cf00f93ac0da5ab520846f66411e339cfbf519a806e8facf519" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp314-cp314-win_amd64.whl", hash = "sha256:7bcd40cbffac475b478d6ce812f03da84e9a4894956efb89c3b7bcca5dbd4f91" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp314-cp314t-manylinux_2_28_aarch64.whl", hash = "sha256:e88c78e5b08ae9303aa15da43b68b44287ecbec16d898d9fad6998832fe626a5" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp314-cp314t-manylinux_2_28_x86_64.whl", hash = "sha256:7d8769bdf3200ca16a92f14df404c3370171ac3732996528a8973d753eac562f" },
+    { url = "https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp314-cp314t-win_amd64.whl", hash = "sha256:0c784b600959ec70ee01cb23e8bc870a0e0475af30378ff5e39f4abed8b7c1cc" },
+]
+
+[[package]]
+name = "torchao"
+version = "0.15.0"
+source = { registry = "https://pypi.org/simple" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/57/2d/472b9362dceae05a4599e2b94f86e69a29c0e20964a6af84f34f6ead5938/torchao-0.15.0-cp310-abi3-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:1cbe813201314ba6329a650a76944502f3e8ec4b1b44523f3f48676810d8d1f6", size = 7163930, upload-time = "2025-12-18T23:14:41.876Z" },
+    { url = "https://files.pythonhosted.org/packages/f6/3b/6b9d5618720f63dbc2e2509cd6b57aae9c0d61b738d1d2172f4d5d9efaab/torchao-0.15.0-py3-none-any.whl", hash = "sha256:3f3812676048ef8a2a0e9d492d12d8971ba7a7ebb16f54aa56f690414e130d2c", size = 1080679, upload-time = "2025-12-18T23:14:43.807Z" },
 ]

 [[package]]
@ -3307,41 +3254,10 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/6a/6b/2f416568b3c4c91c96e5a365d164f8a4a4a88030aa8ab4644181fdadce97/transformers-4.57.3-py3-none-any.whl", hash = "sha256:c77d353a4851b1880191603d36acb313411d3577f6e2897814f333841f7003f4", size = 11993463, upload-time = "2025-11-25T15:51:26.493Z" },
 ]

-[[package]]
-name = "triton"
-version = "3.5.0"
-source = { registry = "https://pypi.org/simple" }
-resolution-markers = [
-    "python_full_version >= '3.12' and sys_platform == 'linux'",
-    "python_full_version == '3.11.*' and sys_platform == 'linux'",
-    "python_full_version < '3.11' and sys_platform == 'linux'",
-]
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/dd/22/507b6f58a35e05e84381630b2dc2a3cee1a7a2a7eaf4cba857c638a18a24/triton-3.5.0-cp310-cp310-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:6f90de6a6566bb619b4c0adc9855729e1b1b5e26533fca1bf6206e96b6d277a3", size = 159827599, upload-time = "2025-10-15T19:15:43.87Z" },
-    { url = "https://files.pythonhosted.org/packages/0b/eb/09e31d107a5d00eb281aa7e6635ca463e9bca86515944e399480eadb71f8/triton-3.5.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:d5d3b3d480debf24eaa739623c9a42446b0b77f95593d30eb1f64cd2278cc1f0", size = 170333110, upload-time = "2025-10-13T16:37:49.588Z" },
-    { url = "https://files.pythonhosted.org/packages/79/f9/b6f60f978397c616fd8dacca2305759fe4f80d397b20ef72534803244bd5/triton-3.5.0-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:8457b22148defefdcb7fa8144b05ce211b9faefad650a1ce85b23df488d5549c", size = 159926731, upload-time = "2025-10-15T19:15:49.682Z" },
-    { url = "https://files.pythonhosted.org/packages/3d/78/949a04391c21956c816523678f0e5fa308eb5b1e7622d88c4e4ef5fceca0/triton-3.5.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:f34bfa21c5b3a203c0f0eab28dcc1e49bd1f67d22724e77fb6665a659200a4ec", size = 170433488, upload-time = "2025-10-13T16:37:57.132Z" },
-    { url = "https://files.pythonhosted.org/packages/87/9b/30988039e1e84df7554fba24e6a734d2d0e847af33cabdf9b532b3c51456/triton-3.5.0-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:7da21fccceafc163e3a5e857abe34351ef76345af06cabf9637a914742671f0b", size = 159946647, upload-time = "2025-10-15T19:15:56.325Z" },
-    { url = "https://files.pythonhosted.org/packages/f5/3a/e991574f3102147b642e49637e0281e9bb7c4ba254edb2bab78247c85e01/triton-3.5.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:c9e71db82261c4ffa3921cd050cd5faa18322d2d405c30eb56084afaff3b0833", size = 170476535, upload-time = "2025-10-13T16:38:05.18Z" },
-    { url = "https://files.pythonhosted.org/packages/cd/85/e37f1197acb04c8f3d83851d23d5d6ed5060ef74580668b112e23fdfa203/triton-3.5.0-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:188da5b81fa2f8322c27fec1627703eac24cb9bb7ab0dfbe9925973bc1b070d3", size = 159958970, upload-time = "2025-10-15T19:16:01.717Z" },
-    { url = "https://files.pythonhosted.org/packages/6c/29/10728de8a6e932e517c10773486b8e99f85d1b1d9dd87d9a9616e1fef4a1/triton-3.5.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:e6bb9aa5519c084a333acdba443789e50012a4b851cd486c54f0b8dc2a8d3a12", size = 170487289, upload-time = "2025-10-13T16:38:11.662Z" },
-    { url = "https://files.pythonhosted.org/packages/b8/1d/38258f05010ac17a7b058c022911c9cae6526e149b7397134a048cf5a6c2/triton-3.5.0-cp313-cp313t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:03127d9b33aaf979c856676b394bc059ec1d68cb6da68ae03f62dd8ad77a04ae", size = 160073012, upload-time = "2025-10-15T19:16:07.477Z" },
-    { url = "https://files.pythonhosted.org/packages/5c/38/db80e48b9220c9bce872b0f616ad0446cdf554a40b85c7865cbca99ab3c2/triton-3.5.0-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:c83f2343e1a220a716c7b3ab9fccfcbe3ad4020d189549200e2d2e8d5868bed9", size = 170577179, upload-time = "2025-10-13T16:38:17.865Z" },
-    { url = "https://files.pythonhosted.org/packages/91/fe/8f5771d00227f4eb1ee034f218ed427102b989366d2275fe3b3c105a3921/triton-3.5.0-cp314-cp314-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:468936651d383f4a6d10068d34a627505e13af55be5d002b9f27b987e7a5f0ac", size = 159957460, upload-time = "2025-10-15T19:16:12.626Z" },
-    { url = "https://files.pythonhosted.org/packages/ff/60/1810655d1d856c9a4fcc90ee8966d85f552d98c53a6589f95ab2cbe27bb8/triton-3.5.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:da0fa67ccd76c3dcfb0bffe1b1c57c685136a6bd33d141c24d9655d4185b1289", size = 170487949, upload-time = "2025-10-13T16:38:24.881Z" },
-    { url = "https://files.pythonhosted.org/packages/78/59/99edd103958fe6e42b50b9ad8ce4f223ddf4ccf475259cf7d2b53381dc6c/triton-3.5.0-cp314-cp314t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:c7ceef21410229ac23173a28eee5cfc0e37c1dfdb8b4bc11ecda2e3ecec7c686", size = 160075629, upload-time = "2025-10-15T19:16:18.746Z" },
-    { url = "https://files.pythonhosted.org/packages/fb/b7/1dec8433ac604c061173d0589d99217fe7bf90a70bdc375e745d044b8aad/triton-3.5.0-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:317fe477ea8fd4524a6a8c499fb0a36984a56d0b75bf9c9cb6133a1c56d5a6e7", size = 170580176, upload-time = "2025-10-13T16:38:31.14Z" },
-]
-
 [[package]]
 name = "triton"
 version = "3.5.1"
 source = { registry = "https://pypi.org/simple" }
-resolution-markers = [
-    "python_full_version >= '3.12' and sys_platform == 'linux'",
-    "python_full_version == '3.11.*' and sys_platform == 'linux'",
-    "python_full_version < '3.11' and sys_platform == 'linux'",
-]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/d9/2e/f95e673222afa2c7f0c687d8913e98fcf2589ef0b1405de76894e37fe18f/triton-3.5.1-cp310-cp310-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:f63e34dcb32d7bd3a1d0195f60f30d2aee8b08a69a0424189b71017e23dfc3d2", size = 159821655, upload-time = "2025-11-11T17:51:44.09Z" },
    { url = "https://files.pythonhosted.org/packages/fd/6e/676ab5019b4dde8b9b7bab71245102fc02778ef3df48218b298686b9ffd6/triton-3.5.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:5fc53d849f879911ea13f4a877243afc513187bc7ee92d1f2c0f1ba3169e3c94", size = 170320692, upload-time = "2025-11-11T17:40:46.074Z" },