Add pre-commit documentation to README and GitHub workflow

Sets up a pre-commit workflow to automate code linting and formatting. This ensures code quality and consistency by running checks before code is committed.
Fix (automatically) all pre-commit errors
2025-12-06 04:12:13 +00:00 · 2025-12-05 19:59:38 +02:00 · 2025-12-05 19:59:35 +02:00 · 2025-12-05 19:59:35 +02:00
44 changed files with 1757 additions and 1074 deletions
--- a/.github/workflows/pre-commit.yml
+++ b/.github/workflows/pre-commit.yml
@ -0,0 +1,44 @@
+name: Pre-commit
+
+on:
+  push:
+    branches:
+      - main
+      - "release/**"
+  pull_request:
+  workflow_dispatch:
+
+permissions:
+  contents: read
+
+jobs:
+  run-pre-commit:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.10"
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v7
+
+      - name: Cache uv & pre-commit
+        uses: actions/cache@v4
+        with:
+          path: |
+            .venv
+            ~/.cache/uv
+            ~/.cache/pre-commit
+          key: ${{ runner.os }}-uv-${{ hashFiles('uv.lock', '.pre-commit-config.yaml') }}
+          restore-keys: |
+            ${{ runner.os }}-uv-
+
+      - name: Install dev dependencies
+        run: uv sync --group dev
+
+      - name: Run pre-commit
+        run: uv run pre-commit run --all-files
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@ -0,0 +1,27 @@
+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v6.0.0
+    hooks:
+      - id: trailing-whitespace
+      - id: check-added-large-files
+        args: [--maxkb=128]
+      - id: fix-byte-order-marker
+      - id: check-case-conflict
+      - id: check-merge-conflict
+      - id: check-yaml
+      - id: end-of-file-fixer
+      - id: mixed-line-ending
+        args: [--fix=lf]
+
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.14.8
+    hooks:
+      - id: ruff-check
+      - id: ruff-format
+
+  - repo: https://github.com/codespell-project/codespell
+    rev: v2.4.1 # Use the latest stable version
+    hooks:
+      - id: codespell
+        additional_dependencies:
+          - tomli
--- a/README.md
+++ b/README.md
@ -123,6 +123,22 @@ I haven't invested too much here but some tests exist, especially for the tokeni
 python -m pytest tests/test_rustbpe.py -v -s
 ```

+## Pre-commit hooks
+
+Linting and formatting are enforced with [pre-commit](https://pre-commit.com/) both locally and in CI via GitHub Actions. To match the checks that run in PRs:
+
+- Make sure the dev extras are installed (`uv sync --group dev`).
+- Run the suite on demand: `uv run pre-commit run --all-files`.
+- (optional) Install the git hook once (for automation during `git commit`): `uv run pre-commit install`.
+
+Hook coverage (auto-fixes most issues; review and stage the changes afterward):
+
+- [`ruff`](https://github.com/astral-sh/ruff): a fast Rust-based linter and formatter that replaces multiple tools:
+  - **Linting** (`ruff-check`): removes unused imports (like autoflake), upgrades syntax (like pyupgrade), and sorts imports (like isort).
+  - **Formatting** (`ruff-format`): applies consistent code formatting (like black), with quote style preserved.
+- [`pre-commit-hooks`](https://github.com/pre-commit/pre-commit-hooks): repo hygiene (trim trailing whitespace, enforce LF endings/newlines, detect merge conflicts, block oversized files).
+- [`codespell`](https://github.com/codespell-project/codespell): catches common spelling mistakes in code and docs (add false positives to `[tool.codespell].ignore-words-list` in `pyproject.toml`).
+
 ## File structure

 ```
--- a/dev/gen_synthetic_data.py
+++ b/dev/gen_synthetic_data.py
@ -28,24 +28,23 @@ NOTE: You need OpenRouter API key in a file called "openroutertoken.txt" in the
      (obviously you can tune this arbitrarily to your liking)
 NOTE: For more details see this discussion: https://github.com/karpathy/nanochat/discussions/139
 """
-import requests
+
+import copy
 import json
 import os
-import copy
 import random
 from concurrent.futures import ThreadPoolExecutor, as_completed

+import requests
+
 from nanochat.common import get_base_dir

-api_key = open("openroutertoken.txt", "r", encoding="utf-8").read().strip()
+api_key = open("openroutertoken.txt", encoding="utf-8").read().strip()

 url = "https://openrouter.ai/api/v1/chat/completions"
-headers = {
-  "Authorization": f"Bearer {api_key}",
-  "Content-Type": "application/json"
-}
+headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}

-readme = open("README.md", "r", encoding="utf-8").read().strip()
+readme = open("README.md", encoding="utf-8").read().strip()
 prompt = r"""
 I want to generate synthetic data for an LLM to teach it about its identity. Here is the identity I want:

@ -291,22 +290,19 @@ response_format = {
                        "properties": {
                            "role": {
                                "type": "string",
-                "description": "The role of the speaker, either 'user' or 'assistant'"
+                                "description": "The role of the speaker, either 'user' or 'assistant'",
                            },
-              "content": {
-                "type": "string",
-                "description": "The message content"
-              }
+                            "content": {"type": "string", "description": "The message content"},
                        },
                        "required": ["role", "content"],
-            "additionalProperties": False
-          }
+                        "additionalProperties": False,
+                    },
                }
            },
            "required": ["messages"],
-      "additionalProperties": False
-    }
-  }
+            "additionalProperties": False,
+        },
+    },
 }

 # Sadly it doesn't seem like Chat completions support `n`
@ -318,6 +314,7 @@ base_payload = {
    "temperature": 1.0,
 }

+
 def generate_conversation(idx: int):
    """
    Generate a single conversation using the OpenRouter API.
@ -357,7 +354,6 @@ print(f"Generating {num_conversations} conversations with {num_workers} workers.
 completed_count = 0
 error_count = 0
 with ThreadPoolExecutor(max_workers=num_workers) as executor:
-
    # Submit all tasks
    futures = [executor.submit(generate_conversation, idx) for idx in range(num_conversations)]

@ -369,7 +365,9 @@ with ThreadPoolExecutor(max_workers=num_workers) as executor:
            # Lightly validate the conversation structure
            for i, message in enumerate(messages):
                expected_role = "user" if i % 2 == 0 else "assistant"
-                assert message['role'] == expected_role, f"Message {i} has role {message['role']} but should be {expected_role}"
+                assert message['role'] == expected_role, (
+                    f"Message {i} has role {message['role']} but should be {expected_role}"
+                )

            # If all looks good, write the messages to file
            with open(output_file, 'a') as f:
@ -384,4 +382,3 @@ with ThreadPoolExecutor(max_workers=num_workers) as executor:
 print(f"\nDone! Successfully saved {completed_count} conversations to {output_file}")
 if error_count > 0:
    print(f"Encountered {error_count} errors during generation")
-
--- a/dev/repackage_data_reference.py
+++ b/dev/repackage_data_reference.py
@ -13,12 +13,13 @@ training latency.
 NOTE: This file is meant only as reference/documentation of the
 dataset preparation and it is not used during the project runtime.
 """
+
 import os
 import time

-from datasets import load_dataset
-import pyarrow.parquet as pq
 import pyarrow as pa
+import pyarrow.parquet as pq
+from datasets import load_dataset

 # Source dataset
 dataset_kwargs = {
@ -73,15 +74,20 @@ for doc in ds:
        avg_time_per_doc = total_time_spent / total_docs_processed
        remaining_time = remaining_docs * avg_time_per_doc
        remaining_time_hours = remaining_time / 3600
-        print(f"Wrote {shard_path}. #documents: {len(shard_docs)} | #characters: {shard_characters} | time: {dt:.2f}s | remaining time: {remaining_time_hours:.2f}h")
+        print(
+            f"Wrote {shard_path}. #documents: {len(shard_docs)} | #characters: {shard_characters} | time: {dt:.2f}s | remaining time: {remaining_time_hours:.2f}h"
+        )
        shard_docs = []
        shard_characters = 0
        shard_index += 1

+
 # Demonstration of how the data was later uploaded to HuggingFace
 def upload():
    import os
+
    from huggingface_hub import HfApi
+
    token = os.getenv("HF_TOKEN")
    api = HfApi(token=token)
    api.upload_large_folder(
@ -89,4 +95,6 @@ def upload():
        repo_id="karpathy/fineweb-edu-100b-shuffle",
        repo_type="dataset",
    )
+
+
 # upload()
--- a/nanochat/adamw.py
+++ b/nanochat/adamw.py
@ -2,6 +2,7 @@
 Borrowed from modded-nanogpt. By Keller, @vagrawal, et al.
 Not a general optimizer! But works for our specific use.
 """
+
 import torch
 import torch.distributed as dist
 from torch import Tensor
@ -12,7 +13,15 @@ class DistAdamW(torch.optim.Optimizer):
    Distributed AdamW optimizer.
    In the style of ZeRO-2, i.e. sharded optimizer states and gradient reduction
    """
-    def __init__(self, param_groups, lr: float = 1e-3, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-8, weight_decay: float = 0.01):
+
+    def __init__(
+        self,
+        param_groups,
+        lr: float = 1e-3,
+        betas: tuple[float, float] = (0.9, 0.999),
+        eps: float = 1e-8,
+        weight_decay: float = 0.01,
+    ):
        defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay)
        super().__init__(param_groups, defaults)

@ -30,7 +39,9 @@ class DistAdamW(torch.optim.Optimizer):
                grad = params[base_i].grad
                rank_size = grad.shape[0] // world_size
                grad_slice = torch.empty_like(grad[:rank_size])
-                reduce_scatter_futures.append(dist.reduce_scatter_tensor(grad_slice, grad, op=dist.ReduceOp.AVG, async_op=True).get_future())
+                reduce_scatter_futures.append(
+                    dist.reduce_scatter_tensor(grad_slice, grad, op=dist.ReduceOp.AVG, async_op=True).get_future()
+                )
                grad_slices.append(grad_slice)

        idx = 0
@ -43,7 +54,7 @@ class DistAdamW(torch.optim.Optimizer):
                reduce_scatter_futures[idx].wait()
                p = params[base]
                rank_size = p.shape[0] // world_size
-                p_slice = p[rank * rank_size:(rank + 1) * rank_size]
+                p_slice = p[rank * rank_size : (rank + 1) * rank_size]
                lr = group['lr'] * getattr(p, "lr_mul", 1.0)
                state = self.state[p]
                g_slice = grad_slices[idx]
@ -64,8 +75,8 @@ class DistAdamW(torch.optim.Optimizer):
                exp_avg.mul_(beta1).add_(g_slice, alpha=1 - beta1)
                exp_avg_sq.mul_(beta2).addcmul_(g_slice, g_slice, value=1 - beta2)
                # bias corrections
-                bias1 = 1 - beta1 ** t
-                bias2 = 1 - beta2 ** t
+                bias1 = 1 - beta1**t
+                bias2 = 1 - beta2**t
                # compute step
                denom = exp_avg_sq.sqrt().add_(eps)
                step_size = lr * (torch.sqrt(bias2) / bias1)
--- a/nanochat/checkpoint_manager.py
+++ b/nanochat/checkpoint_manager.py
@ -1,25 +1,29 @@
 """
 Utilities for saving and loading model/optim/state checkpoints.
 """
-import os
-import re
+
 import glob
 import json
 import logging
+import os
+import re
+
 import torch

-from nanochat.common import get_base_dir
+from nanochat.common import get_base_dir, setup_default_logging
 from nanochat.gpt import GPT, GPTConfig
 from nanochat.tokenizer import get_tokenizer
-from nanochat.common import setup_default_logging

 # Set up logging
 setup_default_logging()
 logger = logging.getLogger(__name__)
+
+
 def log0(message):
    if int(os.environ.get('RANK', 0)) == 0:
        logger.info(message)

+
 def save_checkpoint(checkpoint_dir, step, model_data, optimizer_data, meta_data, rank=0):
    if rank == 0:
        os.makedirs(checkpoint_dir, exist_ok=True)
@ -38,6 +42,7 @@ def save_checkpoint(checkpoint_dir, step, model_data, optimizer_data, meta_data,
        torch.save(optimizer_data, optimizer_path)
        logger.info(f"Saved optimizer state to: {optimizer_path}")

+
 def load_checkpoint(checkpoint_dir, step, device, load_optimizer=False, rank=0):
    # Load the model state
    model_path = os.path.join(checkpoint_dir, f"model_{step:06d}.pt")
@ -49,7 +54,7 @@ def load_checkpoint(checkpoint_dir, step, device, load_optimizer=False, rank=0):
        optimizer_data = torch.load(optimizer_path, map_location=device)
    # Load the metadata
    meta_path = os.path.join(checkpoint_dir, f"meta_{step:06d}.json")
-    with open(meta_path, "r", encoding="utf-8") as f:
+    with open(meta_path, encoding="utf-8") as f:
        meta_data = json.load(f)
    return model_data, optimizer_data, meta_data

@ -66,10 +71,7 @@ def build_model(checkpoint_dir, step, device, phase):
    model_data, optimizer_data, meta_data = load_checkpoint(checkpoint_dir, step, device, load_optimizer=False)
    if device.type in {"cpu", "mps"}:
        # Convert bfloat16 tensors to float for CPU inference
-        model_data = {
-            k: v.float() if v.dtype == torch.bfloat16 else v
-            for k, v in model_data.items()
-        }
+        model_data = {k: v.float() if v.dtype == torch.bfloat16 else v for k, v in model_data.items()}
    # Hack: fix torch compile issue, which prepends all keys with _orig_mod.
    model_data = {k.removeprefix("_orig_mod."): v for k, v in model_data.items()}
    model_config_kwargs = meta_data["model_config"]
@ -121,9 +123,11 @@ def find_last_step(checkpoint_dir):
    last_step = int(max(os.path.basename(f).split("_")[-1].split(".")[0] for f in checkpoint_files))
    return last_step

+
 # -----------------------------------------------------------------------------
 # convenience functions that take into account nanochat's directory structure

+
 def load_model_from_dir(checkpoints_dir, device, phase, model_tag=None, step=None):
    if model_tag is None:
        # guess the model tag by defaulting to the largest model
@ -139,6 +143,7 @@ def load_model_from_dir(checkpoints_dir, device, phase, model_tag=None, step=Non
    model, tokenizer, meta_data = build_model(checkpoint_dir, step, device, phase)
    return model, tokenizer, meta_data

+
 def load_model(source, *args, **kwargs):
    model_dir = {
        "base": "base_checkpoints",
--- a/nanochat/common.py
+++ b/nanochat/common.py
@ -2,16 +2,19 @@
 Common utilities for nanochat.
 """

+import logging
 import os
 import re
-import logging
 import urllib.request
+
 import torch
 import torch.distributed as dist
 from filelock import FileLock

+
 class ColoredFormatter(logging.Formatter):
    """Custom formatter that adds colors to log messages."""
+
    # ANSI color codes
    COLORS = {
        'DEBUG': '\033[36m',  # Cyan
@ -22,6 +25,7 @@ class ColoredFormatter(logging.Formatter):
    }
    RESET = '\033[0m'
    BOLD = '\033[1m'
+
    def format(self, record):
        # Add color to the level name
        levelname = record.levelname
@ -36,17 +40,17 @@ class ColoredFormatter(logging.Formatter):
            message = re.sub(r'(Shard \d+)', rf'{self.COLORS["INFO"]}{self.BOLD}\1{self.RESET}', message)
        return message

+
 def setup_default_logging():
    handler = logging.StreamHandler()
    handler.setFormatter(ColoredFormatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
-    logging.basicConfig(
-        level=logging.INFO,
-        handlers=[handler]
-    )
+    logging.basicConfig(level=logging.INFO, handlers=[handler])
+

 setup_default_logging()
 logger = logging.getLogger(__name__)

+
 def get_base_dir():
    # co-locate nanochat intermediates with other cached data in ~/.cache (by default)
    if os.environ.get("NANOCHAT_BASE_DIR"):
@ -58,6 +62,7 @@ def get_base_dir():
    os.makedirs(nanochat_dir, exist_ok=True)
    return nanochat_dir

+
 def download_file_with_lock(url, filename, postprocess_fn=None):
    """
    Downloads a file from a URL to a local path in the base directory.
@ -94,11 +99,13 @@ def download_file_with_lock(url, filename, postprocess_fn=None):

    return file_path

-def print0(s="",**kwargs):
+
+def print0(s="", **kwargs):
    ddp_rank = int(os.environ.get('RANK', 0))
    if ddp_rank == 0:
        print(s, **kwargs)

+
 def print_banner():
    # Cool DOS Rebel font ASCII banner made with https://manytools.org/hacker-tools/ascii-banner/
    banner = """
@ -113,10 +120,12 @@ def print_banner():
    """
    print0(banner)

+
 def is_ddp():
    # TODO is there a proper way
    return int(os.environ.get('RANK', -1)) != -1

+
 def get_dist_info():
    if is_ddp():
        assert all(var in os.environ for var in ['RANK', 'LOCAL_RANK', 'WORLD_SIZE'])
@ -127,6 +136,7 @@ def get_dist_info():
    else:
        return False, 0, 0, 1

+
 def autodetect_device_type():
    # prefer to use CUDA if available, otherwise use MPS, otherwise fallback on CPU
    if torch.cuda.is_available():
@ -138,14 +148,19 @@ def autodetect_device_type():
    print0(f"Autodetected device type: {device_type}")
    return device_type

+
 def compute_init(device_type="cuda"):  # cuda|cpu|mps
    """Basic initialization that we keep doing over and over, so make common."""

    assert device_type in ["cuda", "mps", "cpu"], "Invalid device type atm"
    if device_type == "cuda":
-        assert torch.cuda.is_available(), "Your PyTorch installation is not configured for CUDA but device_type is 'cuda'"
+        assert torch.cuda.is_available(), (
+            "Your PyTorch installation is not configured for CUDA but device_type is 'cuda'"
+        )
    if device_type == "mps":
-        assert torch.backends.mps.is_available(), "Your PyTorch installation is not configured for MPS but device_type is 'mps'"
+        assert torch.backends.mps.is_available(), (
+            "Your PyTorch installation is not configured for MPS but device_type is 'mps'"
+        )

    # Reproducibility
    # Note that we set the global seeds here, but most of the code uses explicit rng objects.
@ -175,16 +190,21 @@ def compute_init(device_type="cuda"): # cuda|cpu|mps

    return ddp, ddp_rank, ddp_local_rank, ddp_world_size, device

+
 def compute_cleanup():
    """Companion function to compute_init, to clean things up before script exit"""
    if is_ddp():
        dist.destroy_process_group()

+
 class DummyWandb:
    """Useful if we wish to not use wandb but have all the same signatures"""
+
    def __init__(self):
        pass
+
    def log(self, *args, **kwargs):
        pass
+
    def finish(self):
        pass
--- a/nanochat/configurator.py
+++ b/nanochat/configurator.py
@ -18,11 +18,13 @@ import os
 import sys
 from ast import literal_eval

-def print0(s="",**kwargs):
+
+def print0(s="", **kwargs):
    ddp_rank = int(os.environ.get('RANK', 0))
    if ddp_rank == 0:
        print(s, **kwargs)

+
 for arg in sys.argv[1:]:
    if '=' not in arg:
        # assume it's the name of a config file
--- a/nanochat/core_eval.py
+++ b/nanochat/core_eval.py
@ -5,15 +5,17 @@ https://arxiv.org/abs/2406.11794
 TODOs:
 - All tasks ~match except for squad. We get 31% reference is 37%. Figure out why.
 """
+
 import random

-from jinja2 import Template
 import torch
 import torch.distributed as dist
+from jinja2 import Template

 # -----------------------------------------------------------------------------
 # Prompt rendering utilities

+
 def render_prompts_mc(item, continuation_delimiter, fewshot_examples=None):
    """Render complete prompts for a multiple choice question"""
    template_str = """
@ -24,11 +26,7 @@ def render_prompts_mc(item, continuation_delimiter, fewshot_examples=None):
 {{ item.query }}{{ continuation_delimiter }}{{ choice }}""".strip()
    template = Template(template_str)
    fewshot_examples = fewshot_examples or []
-    context = {
-        'fewshot_examples': fewshot_examples,
-        'continuation_delimiter': continuation_delimiter,
-        'item': item
-    }
+    context = {'fewshot_examples': fewshot_examples, 'continuation_delimiter': continuation_delimiter, 'item': item}
    prompts = [template.render(choice=choice, **context) for choice in item['choices']]
    return prompts

@ -43,13 +41,8 @@ def render_prompts_schema(item, continuation_delimiter, fewshot_examples=None):
 {{ context }}{{ continuation_delimiter }}{{ item.continuation }}""".strip()
    template = Template(template_str)
    fewshot_examples = fewshot_examples or []
-    context = {
-        'fewshot_examples': fewshot_examples,
-        'continuation_delimiter': continuation_delimiter,
-        'item': item
-    }
-    prompts = [template.render(context=context_option, **context)
-               for context_option in item['context_options']]
+    context = {'fewshot_examples': fewshot_examples, 'continuation_delimiter': continuation_delimiter, 'item': item}
+    prompts = [template.render(context=context_option, **context) for context_option in item['context_options']]
    return prompts


@ -67,11 +60,7 @@ def render_prompts_lm(item, continuation_delimiter, fewshot_examples=None):
 {{ item.context | trim }}{{ continuation_delimiter }}{% if include_continuation %}{{ item.continuation }}{% endif %}""".strip()
    template = Template(template_str)
    fewshot_examples = fewshot_examples or []
-    context = {
-        'fewshot_examples': fewshot_examples,
-        'continuation_delimiter': continuation_delimiter,
-        'item': item
-    }
+    context = {'fewshot_examples': fewshot_examples, 'continuation_delimiter': continuation_delimiter, 'item': item}
    # Return two prompts: without and with the continuation
    prompt_without = template.render(include_continuation=False, **context)
    prompt_with = template.render(include_continuation=True, **context)
@ -89,10 +78,7 @@ def find_common_length(token_sequences, direction='left'):
    - direction: 'left' for prefix, 'right' for suffix
    """
    min_len = min(len(seq) for seq in token_sequences)
-    indices = {
-        'left': range(min_len),
-        'right': range(-1, -min_len-1, -1)
-    }[direction]
+    indices = {'left': range(min_len), 'right': range(-1, -min_len - 1, -1)}[direction]
    # Find the first position where the token sequences differ
    for i, idx in enumerate(indices):
        token = token_sequences[0][idx]
@ -106,7 +92,7 @@ def stack_sequences(tokens, pad_token_id):
    bsz, seq_len = len(tokens), max(len(x) for x in tokens)
    input_ids = torch.full((bsz, seq_len), pad_token_id, dtype=torch.long)
    for i, x in enumerate(tokens):
-        input_ids[i, :len(x)] = torch.tensor(x, dtype=torch.long)
+        input_ids[i, : len(x)] = torch.tensor(x, dtype=torch.long)
    return input_ids


@ -153,9 +139,7 @@ def forward_model(model, input_ids):
    target_ids = torch.roll(input_ids, shifts=-1, dims=1)
    # Calculate cross entropy at all positions
    losses = torch.nn.functional.cross_entropy(
-        outputs.view(batch_size * seq_len, -1),
-        target_ids.view(batch_size * seq_len),
-        reduction='none'
+        outputs.view(batch_size * seq_len, -1), target_ids.view(batch_size * seq_len), reduction='none'
    ).view(batch_size, seq_len)
    # Set the last column to be nan because there is no autoregressive loss there
    losses[:, -1] = float('nan')
@ -226,13 +210,12 @@ def evaluate_example(idx, model, tokenizer, data, device, task_meta):
        si = start_idxs[0]
        ei = end_idxs[0]
        # predictions[i] predict input_ids[i+1] autoregressively
-        predicted_tokens = predictions[0, si-1:ei-1]
+        predicted_tokens = predictions[0, si - 1 : ei - 1]
        actual_tokens = input_ids[0, si:ei]
        is_correct = torch.all(predicted_tokens == actual_tokens).item()
    elif task_type in ['multiple_choice', 'schema']:
        # For MC/schema: find the option with lowest average loss
-        mean_losses = [losses[i, si-1:ei-1].mean().item()
-                        for i, (si, ei) in enumerate(zip(start_idxs, end_idxs))]
+        mean_losses = [losses[i, si - 1 : ei - 1].mean().item() for i, (si, ei) in enumerate(zip(start_idxs, end_idxs))]
        pred_idx = mean_losses.index(min(mean_losses))
        is_correct = pred_idx == item['gold']
    else:
--- a/nanochat/dataloader.py
+++ b/nanochat/dataloader.py
@ -1,13 +1,16 @@
 from collections import deque

-import torch
 import pyarrow.parquet as pq
+import torch

 from nanochat.common import get_dist_info
 from nanochat.dataset import list_parquet_files
 from nanochat.tokenizer import get_tokenizer

-def tokenizing_distributed_data_loader_with_state(B, T, split, tokenizer_threads=4, tokenizer_batch_size=128, device="cuda", resume_state_dict=None):
+
+def tokenizing_distributed_data_loader_with_state(
+    B, T, split, tokenizer_threads=4, tokenizer_batch_size=128, device="cuda", resume_state_dict=None
+):
    """
    Stream pretraining text from parquet files, tokenize, yield training batches.

@ -24,6 +27,7 @@ def tokenizing_distributed_data_loader_with_state(B, T, split, tokenizer_threads

    # infinite iterator over document batches (list of text strings)
    ddp, ddp_rank, ddp_local_rank, ddp_world_size = get_dist_info()
+
    def document_batches():
        parquet_paths = list_parquet_files()
        parquet_paths = parquet_paths[:-1] if split == "train" else parquet_paths[-1:]
@ -48,9 +52,10 @@ def tokenizing_distributed_data_loader_with_state(B, T, split, tokenizer_threads
                    batch = rg.column('text').to_pylist()  # each batch is a parquet group, e.g. 1024 rows
                    # the tokenizer encode might want to go in even smaller batches, e.g. 128 rows
                    for i in range(0, len(batch), tokenizer_batch_size):
-                        yield batch[i:i+tokenizer_batch_size], (pq_idx, rg_idx)
+                        yield batch[i : i + tokenizer_batch_size], (pq_idx, rg_idx)
                    rg_idx += ddp_world_size  # advance to the next row group (in DDP)
                pq_idx += 1  # advance to the next parquet file
+
    batches = document_batches()

    # Now emit batches of tokens.
@ -78,9 +83,13 @@ def tokenizing_distributed_data_loader_with_state(B, T, split, tokenizer_threads
        # Reshape to 2D and move to GPU async
        inputs = inputs_cpu.view(B, T).to(device=device, non_blocking=use_cuda_optimizations)
        targets = targets_cpu.view(B, T).to(device=device, non_blocking=use_cuda_optimizations)
-        state_dict = {"pq_idx": pq_idx, "rg_idx": rg_idx} # we need this in case we wish to approximately resume training
+        state_dict = {
+            "pq_idx": pq_idx,
+            "rg_idx": rg_idx,
+        }  # we need this in case we wish to approximately resume training
        yield inputs, targets, state_dict

+
 def tokenizing_distributed_data_loader(*args, **kwargs):
    # helper function that only emits the inputs/targets and not the state_dict
    for inputs, targets, state_dict in tokenizing_distributed_data_loader_with_state(*args, **kwargs):
--- a/nanochat/dataset.py
+++ b/nanochat/dataset.py
@ -7,13 +7,14 @@ This file contains utilities for:
 For details of how the dataset was prepared, see `repackage_data_reference.py`.
 """

-import os
 import argparse
+import os
 import time
-import requests
-import pyarrow.parquet as pq
 from multiprocessing import Pool

+import pyarrow.parquet as pq
+import requests
+
 from nanochat.common import get_base_dir

 # -----------------------------------------------------------------------------
@ -30,16 +31,15 @@ os.makedirs(DATA_DIR, exist_ok=True)
 # -----------------------------------------------------------------------------
 # These functions are useful utilities to other modules, can/should be imported

+
 def list_parquet_files(data_dir=None):
-    """ Looks into a data dir and returns full paths to all parquet files. """
+    """Looks into a data dir and returns full paths to all parquet files."""
    data_dir = DATA_DIR if data_dir is None else data_dir
-    parquet_files = sorted([
-        f for f in os.listdir(data_dir)
-        if f.endswith('.parquet') and not f.endswith('.tmp')
-    ])
+    parquet_files = sorted([f for f in os.listdir(data_dir) if f.endswith('.parquet') and not f.endswith('.tmp')])
    parquet_paths = [os.path.join(data_dir, f) for f in parquet_files]
    return parquet_paths

+
 def parquets_iter_batched(split, start=0, step=1):
    """
    Iterate through the dataset, in batches of underlying row_groups for efficiency.
@ -56,9 +56,10 @@ def parquets_iter_batched(split, start=0, step=1):
            texts = rg.column('text').to_pylist()
            yield texts

+
 # -----------------------------------------------------------------------------
 def download_single_file(index):
-    """ Downloads a single file index, with some backoff """
+    """Downloads a single file index, with some backoff"""

    # Construct the local filepath for this file and skip if it already exists
    filename = index_to_filename(index)
@ -78,7 +79,7 @@ def download_single_file(index):
            response = requests.get(url, stream=True, timeout=30)
            response.raise_for_status()
            # Write to temporary file first
-            temp_path = filepath + f".tmp"
+            temp_path = filepath + ".tmp"
            with open(temp_path, 'wb') as f:
                for chunk in response.iter_content(chunk_size=1024 * 1024):  # 1MB chunks
                    if chunk:
@ -88,10 +89,10 @@ def download_single_file(index):
            print(f"Successfully downloaded {filename}")
            return True

-        except (requests.RequestException, IOError) as e:
+        except (OSError, requests.RequestException) as e:
            print(f"Attempt {attempt}/{max_attempts} failed for {filename}: {e}")
            # Clean up any partial files
-            for path in [filepath + f".tmp", filepath]:
+            for path in [filepath + ".tmp", filepath]:
                if os.path.exists(path):
                    try:
                        os.remove(path)
@ -99,7 +100,7 @@ def download_single_file(index):
                        pass
            # Try a few times with exponential backoff: 2^attempt seconds
            if attempt < max_attempts:
-                wait_time = 2 ** attempt
+                wait_time = 2**attempt
                print(f"Waiting {wait_time} seconds before retry...")
                time.sleep(wait_time)
            else:
@ -111,8 +112,12 @@ def download_single_file(index):

 if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Download FineWeb-Edu 100BT dataset shards")
-    parser.add_argument("-n", "--num-files", type=int, default=-1, help="Number of shards to download (default: -1), -1 = disable")
-    parser.add_argument("-w", "--num-workers", type=int, default=4, help="Number of parallel download workers (default: 4)")
+    parser.add_argument(
+        "-n", "--num-files", type=int, default=-1, help="Number of shards to download (default: -1), -1 = disable"
+    )
+    parser.add_argument(
+        "-w", "--num-workers", type=int, default=4, help="Number of parallel download workers (default: 4)"
+    )
    args = parser.parse_args()

    num = MAX_SHARD + 1 if args.num_files == -1 else min(args.num_files, MAX_SHARD + 1)
--- a/nanochat/engine.py
+++ b/nanochat/engine.py
@ -11,15 +11,17 @@ Notes:
 The whole thing is made as efficient as possible.
 """

-import torch
-import torch.nn.functional as F
 import signal
 import warnings
-from contextlib import contextmanager
 from collections import deque
-from nanochat.common import compute_init, autodetect_device_type
+from contextlib import contextmanager, nullcontext
+
+import torch
+import torch.nn.functional as F
+
 from nanochat.checkpoint_manager import load_model
-from contextlib import nullcontext 
+from nanochat.common import autodetect_device_type, compute_init
+

 # -----------------------------------------------------------------------------
 # Calculator tool helpers
@ -33,17 +35,19 @@ def timeout(duration, formula):
    yield
    signal.alarm(0)

+
 def eval_with_timeout(formula, max_time=3):
    try:
        with timeout(max_time, formula):
            with warnings.catch_warnings():
                warnings.simplefilter("ignore", SyntaxWarning)
                return eval(formula, {"__builtins__": {}}, {})
-    except Exception as e:
+    except Exception:
        signal.alarm(0)
        # print(f"Warning: Failed to eval {formula}, exception: {e}") # it's ok ignore wrong calculator usage
        return None

+
 def use_calculator(expr):
    """
    Evaluate a Python expression safely.
@ -65,9 +69,25 @@ def use_calculator(expr):
        return None

    # Disallow dangerous patterns
-    dangerous_patterns = ['__', 'import', 'exec', 'eval', 'compile', 'open', 'file',
-                         'input', 'raw_input', 'globals', 'locals', 'vars', 'dir',
-                         'getattr', 'setattr', 'delattr', 'hasattr']
+    dangerous_patterns = [
+        '__',
+        'import',
+        'exec',
+        'eval',
+        'compile',
+        'open',
+        'file',
+        'input',
+        'raw_input',
+        'globals',
+        'locals',
+        'vars',
+        'dir',
+        'getattr',
+        'setattr',
+        'delattr',
+        'hasattr',
+    ]
    expr_lower = expr.lower()
    if any(pattern in expr_lower for pattern in dangerous_patterns):
        return None
@ -79,6 +99,7 @@ def use_calculator(expr):
    # Evaluate with timeout
    return eval_with_timeout(expr)

+
 # -----------------------------------------------------------------------------
 class KVCache:
    """
@ -122,7 +143,7 @@ class KVCache:
        dtype, device = other.kv_cache.dtype, other.kv_cache.device
        self.kv_cache = torch.empty(self.kv_shape, dtype=dtype, device=device)
        # 3) copy the data over
-        self.kv_cache[:, :, :, :, :other.pos, :] = other.kv_cache
+        self.kv_cache[:, :, :, :, : other.pos, :] = other.kv_cache
        # 4) update the pos
        self.pos = other.pos

@ -173,8 +194,10 @@ def sample_next_token(logits, rng, temperature=1.0, top_k=None):
        probs = F.softmax(logits, dim=-1)
        return torch.multinomial(probs, num_samples=1, generator=rng)

+
 # -----------------------------------------------------------------------------

+
 class RowState:
    # Per-row state tracking during generation
    def __init__(self, current_tokens=None):
@ -184,8 +207,8 @@ class RowState:
        self.python_expr_tokens = []  # Tokens of the current python expression
        self.completed = False  # Whether this row has completed generation

-class Engine:

+class Engine:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer  # needed for tool use
@ -327,10 +350,13 @@ if __name__ == "__main__":
    is equivalent to the faster Engine.generate function here.
    """
    import time
+
    # init compute
    ddp, ddp_rank, ddp_local_rank, ddp_world_size, device = compute_init()
    device_type = autodetect_device_type()
-    autocast_ctx = torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16) if device_type == "cuda" else nullcontext()
+    autocast_ctx = (
+        torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16) if device_type == "cuda" else nullcontext()
+    )

    # load the model and tokenizer
    model, tokenizer, meta = load_model("base", device, phase="eval")
--- a/nanochat/execution.py
+++ b/nanochat/execution.py
@ -30,17 +30,18 @@ import platform
 import signal
 import tempfile
 from dataclasses import dataclass
-from typing import Optional

 # -----------------------------------------------------------------------------

+
@dataclass
 class ExecutionResult:
    """Result of executing Python code in a sandbox."""
+
    success: bool
    stdout: str
    stderr: str
-    error: Optional[str] = None
+    error: str | None = None
    timeout: bool = False
    memory_exceeded: bool = False

@ -101,13 +102,13 @@ class WriteOnlyStringIO(io.StringIO):
    """StringIO that throws an exception when it's read from"""

    def read(self, *args, **kwargs):
-        raise IOError
+        raise OSError

    def readline(self, *args, **kwargs):
-        raise IOError
+        raise OSError

    def readlines(self, *args, **kwargs):
-        raise IOError
+        raise OSError

    def readable(self, *args, **kwargs):
        """Returns True if the IO object can be read."""
@ -131,7 +132,7 @@ def chdir(root):
        os.chdir(cwd)


-def reliability_guard(maximum_memory_bytes: Optional[int] = None):
+def reliability_guard(maximum_memory_bytes: int | None = None):
    """
    This disables various destructive functions and prevents the generated code
    from interfering with the test (e.g. fork bomb, killing other processes,
@ -147,6 +148,7 @@ def reliability_guard(maximum_memory_bytes: Optional[int] = None):
    if platform.uname().system != "Darwin":
        # These resource limit calls seem to fail on macOS (Darwin), skip?
        import resource
+
        resource.setrlimit(resource.RLIMIT_AS, (maximum_memory_bytes, maximum_memory_bytes))
        resource.setrlimit(resource.RLIMIT_DATA, (maximum_memory_bytes, maximum_memory_bytes))
        resource.setrlimit(resource.RLIMIT_STACK, (maximum_memory_bytes, maximum_memory_bytes))
@ -211,10 +213,9 @@ def reliability_guard(maximum_memory_bytes: Optional[int] = None):
    sys.modules["tkinter"] = None


-def _unsafe_execute(code: str, timeout: float, maximum_memory_bytes: Optional[int], result_dict):
+def _unsafe_execute(code: str, timeout: float, maximum_memory_bytes: int | None, result_dict):
    """Execute code in a subprocess with safety guards. Results are written to result_dict."""
    with create_tempdir():
-
        # These system calls are needed when cleaning up tempdir.
        import os
        import shutil
@ -228,14 +229,16 @@ def _unsafe_execute(code: str, timeout: float, maximum_memory_bytes: Optional[in
        reliability_guard(maximum_memory_bytes=maximum_memory_bytes)

        # Default to failure
-        result_dict.update({
+        result_dict.update(
+            {
                "success": False,
                "stdout": "",
                "stderr": "",
                "timeout": False,
                "memory_exceeded": False,
                "error": None,
-        })
+            }
+        )

        try:
            exec_globals = {}
@ -253,28 +256,36 @@ def _unsafe_execute(code: str, timeout: float, maximum_memory_bytes: Optional[in
                    # uncomment the following line and proceed at your own risk:
                    exec(code, exec_globals)

-            result_dict.update({
+            result_dict.update(
+                {
                    "success": True,
                    "stdout": stdout_capture.getvalue(),
                    "stderr": stderr_capture.getvalue(),
-            })
+                }
+            )

        except TimeoutException:
-            result_dict.update({
+            result_dict.update(
+                {
                    "timeout": True,
                    "error": "Execution timed out",
-            })
+                }
+            )

        except MemoryError as e:
-            result_dict.update({
+            result_dict.update(
+                {
                    "memory_exceeded": True,
                    "error": f"Memory limit exceeded: {e}",
-            })
+                }
+            )

        except BaseException as e:
-            result_dict.update({
+            result_dict.update(
+                {
                    "error": f"{type(e).__name__}: {e}",
-            })
+                }
+            )

        # Needed for cleaning up.
        shutil.rmtree = rmtree
@ -286,7 +297,7 @@ def _unsafe_execute(code: str, timeout: float, maximum_memory_bytes: Optional[in
 def execute_code(
    code: str,
    timeout: float = 5.0,  # 5 seconds default
-    maximum_memory_bytes: Optional[int] = 256 * 1024 * 1024, # 256MB default
+    maximum_memory_bytes: int | None = 256 * 1024 * 1024,  # 256MB default
 ) -> ExecutionResult:
    """
    Execute Python code in a sandboxed environment.
@ -310,10 +321,7 @@ def execute_code(
    manager = multiprocessing.Manager()
    result_dict = manager.dict()

-    p = multiprocessing.Process(
-        target=_unsafe_execute,
-        args=(code, timeout, maximum_memory_bytes, result_dict)
-    )
+    p = multiprocessing.Process(target=_unsafe_execute, args=(code, timeout, maximum_memory_bytes, result_dict))
    p.start()
    p.join(timeout=timeout + 1)

@ -346,4 +354,3 @@ def execute_code(
        timeout=result_dict["timeout"],
        memory_exceeded=result_dict["memory_exceeded"],
    )
-
--- a/nanochat/gpt.py
+++ b/nanochat/gpt.py
@ -12,16 +12,17 @@ Notable features:
 """

 import math
-from functools import partial
 from dataclasses import dataclass
+from functools import partial

 import torch
 import torch.nn as nn
 import torch.nn.functional as F

-from nanochat.common import get_dist_info, print0
-from nanochat.muon import Muon, DistMuon
 from nanochat.adamw import DistAdamW
+from nanochat.common import get_dist_info
+from nanochat.muon import DistMuon, Muon
+

@dataclass
 class GPTConfig:
@ -48,6 +49,7 @@ def apply_rotary_emb(x, cos, sin):
    out = out.to(x.dtype)  # ensure input/output dtypes match
    return out

+
 class CausalSelfAttention(nn.Module):
    def __init__(self, config, layer_idx):
        super().__init__()
@ -75,7 +77,11 @@ class CausalSelfAttention(nn.Module):
        cos, sin = cos_sin
        q, k = apply_rotary_emb(q, cos, sin), apply_rotary_emb(k, cos, sin)  # QK rotary embedding
        q, k = norm(q), norm(k)  # QK norm
-        q, k, v = q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2) # make head be batch dim, i.e. (B, T, H, D) -> (B, H, T, D)
+        q, k, v = (
+            q.transpose(1, 2),
+            k.transpose(1, 2),
+            v.transpose(1, 2),
+        )  # make head be batch dim, i.e. (B, T, H, D) -> (B, H, T, D)

        # Apply KV cache: insert current k,v into cache, get the full view so far
        if kv_cache is not None:
@ -84,7 +90,9 @@ class CausalSelfAttention(nn.Module):
        Tk = k.size(2)  # number of keys/values in total (in the cache + current forward pass)

        # Attention: queries attend to keys/values autoregressively. A few cases to handle:
-        enable_gqa = self.n_head != self.n_kv_head # Group Query Attention (GQA): duplicate key/value heads to match query heads if desired
+        enable_gqa = (
+            self.n_head != self.n_kv_head
+        )  # Group Query Attention (GQA): duplicate key/value heads to match query heads if desired
        if kv_cache is None or Tq == Tk:
            # During training (no KV cache), attend as usual with causal attention
            # And even if there is KV cache, we can still use this simple version when Tq == Tk
@ -139,10 +147,12 @@ class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
-        self.transformer = nn.ModuleDict({
+        self.transformer = nn.ModuleDict(
+            {
                "wte": nn.Embedding(config.vocab_size, config.n_embd),
                "h": nn.ModuleList([Block(config, layer_idx) for layer_idx in range(config.n_layer)]),
-        })
+            }
+        )
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        # To support meta device initialization, we init the rotary embeddings here, but it's fake
        # As for rotary_seq_len, these rotary embeddings are pretty small/cheap in memory,
@ -203,10 +213,15 @@ class GPT(nn.Module):
        return self.transformer.wte.weight.device

    def estimate_flops(self):
-        """ Return the estimated FLOPs per token for the model. Ref: https://arxiv.org/abs/2204.02311 """
+        """Return the estimated FLOPs per token for the model. Ref: https://arxiv.org/abs/2204.02311"""
        nparams = sum(p.numel() for p in self.parameters())
        nparams_embedding = self.transformer.wte.weight.numel()
-        l, h, q, t = self.config.n_layer, self.config.n_head, self.config.n_embd // self.config.n_head, self.config.sequence_len
+        l, h, q, t = (
+            self.config.n_layer,
+            self.config.n_head,
+            self.config.n_embd // self.config.n_head,
+            self.config.sequence_len,
+        )
        num_flops_per_token = 6 * (nparams - nparams_embedding) + 12 * l * h * q * t
        return num_flops_per_token

@ -245,12 +260,16 @@ class GPT(nn.Module):
        B, T = idx.size()

        # Grab the rotary embeddings for the current sequence length (they are of shape (1, seq_len, 1, head_dim/2))
-        assert T <= self.cos.size(1), f"Sequence length grew beyond the rotary embeddings cache: {T} > {self.cos.size(1)}"
-        assert idx.device == self.cos.device, f"Rotary embeddings and idx are on different devices: {idx.device} != {self.cos.device}"
+        assert T <= self.cos.size(1), (
+            f"Sequence length grew beyond the rotary embeddings cache: {T} > {self.cos.size(1)}"
+        )
+        assert idx.device == self.cos.device, (
+            f"Rotary embeddings and idx are on different devices: {idx.device} != {self.cos.device}"
+        )
        assert self.cos.dtype == torch.bfloat16, "Rotary embeddings must be in bfloat16"
        # if kv cache exists, we need to offset the rotary embeddings to the current position in the cache
        T0 = 0 if kv_cache is None else kv_cache.get_pos()
-        cos_sin = self.cos[:, T0:T0+T], self.sin[:, T0:T0+T] # truncate cache to current sequence length
+        cos_sin = self.cos[:, T0 : T0 + T], self.sin[:, T0 : T0 + T]  # truncate cache to current sequence length

        # Forward the trunk of the Transformer
        x = self.transformer.wte(idx)
@ -267,7 +286,9 @@ class GPT(nn.Module):
            logits = self.lm_head(x)
            logits = softcap * torch.tanh(logits / softcap)  # logits softcap
            logits = logits.float()  # use tf32/fp32 for logits
-            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1, reduction=loss_reduction)
+            loss = F.cross_entropy(
+                logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1, reduction=loss_reduction
+            )
            return loss
        else:
            # inference mode: compute and return the logits
--- a/nanochat/loss_eval.py
+++ b/nanochat/loss_eval.py
@ -1,10 +1,13 @@
 """
 A number of functions that help with evaluating a base model.
 """
+
 import math
+
 import torch
 import torch.distributed as dist

+
@torch.no_grad()
 def evaluate_bpb(model, batches, steps, token_bytes):
    """
@ -39,11 +42,7 @@ def evaluate_bpb(model, batches, steps, token_bytes):
            valid = y >= 0
            y_safe = torch.where(valid, y, torch.zeros_like(y))
            # map valid targets to their byte length; ignored targets contribute 0 bytes
-            num_bytes2d = torch.where(
-                valid,
-                token_bytes[y_safe],
-                torch.zeros_like(y, dtype=token_bytes.dtype)
-            )
+            num_bytes2d = torch.where(valid, token_bytes[y_safe], torch.zeros_like(y, dtype=token_bytes.dtype))
            total_nats += (loss2d * (num_bytes2d > 0)).sum()
            total_bytes += num_bytes2d.sum()
        else:
--- a/nanochat/muon.py
+++ b/nanochat/muon.py
@ -2,9 +2,11 @@
 Muon optimizer from Keller et al.
 Also a lot of borrowing of ideas from modded-nanogpt.
 """
+
 import torch
-from torch import Tensor
 import torch.distributed as dist
+from torch import Tensor
+

@torch.compile
 def zeropower_via_newtonschulz5(G: Tensor, steps: int) -> Tensor:
@ -17,7 +19,9 @@ def zeropower_via_newtonschulz5(G: Tensor, steps: int) -> Tensor:
    where S' is diagonal with S_{ii}' ~ Uniform(0.5, 1.5), which turns out not to hurt model
    performance at all relative to UV^T, where USV^T = G is the SVD.
    """
-    assert G.ndim >= 2 # batched Muon implementation by @scottjmaddox, and put into practice in the record by @YouJiacheng
+    assert (
+        G.ndim >= 2
+    )  # batched Muon implementation by @scottjmaddox, and put into practice in the record by @YouJiacheng
    a, b, c = (3.4445, -4.7750, 2.0315)
    X = G.bfloat16()
    if G.size(-2) > G.size(-1):
@ -28,13 +32,16 @@ def zeropower_via_newtonschulz5(G: Tensor, steps: int) -> Tensor:
    # Perform the NS iterations
    for _ in range(steps):
        A = X @ X.mT
-        B = b * A + c * A @ A # quintic computation strategy adapted from suggestion by @jxbz, @leloykun, and @YouJiacheng
+        B = (
+            b * A + c * A @ A
+        )  # quintic computation strategy adapted from suggestion by @jxbz, @leloykun, and @YouJiacheng
        X = a * X + B @ X

    if G.size(-2) > G.size(-1):
        X = X.mT
    return X

+
 class Muon(torch.optim.Optimizer):
    """
    Muon - MomentUm Orthogonalized by Newton-schulz
@ -57,6 +64,7 @@ class Muon(torch.optim.Optimizer):
        nesterov: Whether to use Nesterov-style momentum in the internal SGD. (recommended)
        ns_steps: The number of Newton-Schulz iteration steps to use.
    """
+
    def __init__(self, params, lr=0.02, momentum=0.95, nesterov=True, ns_steps=5):
        defaults = dict(lr=lr, momentum=momentum, nesterov=nesterov, ns_steps=ns_steps)
        params: list[Tensor] = [*params]
@ -80,7 +88,7 @@ class Muon(torch.optim.Optimizer):
                buf.lerp_(g, 1 - group["momentum"])
                g = g.lerp_(buf, group["momentum"]) if group["nesterov"] else buf
                g = zeropower_via_newtonschulz5(g, steps=group["ns_steps"])
-                p.add_(g, alpha=-group["lr"] * max(1, p.size(-2) / p.size(-1))**0.5)
+                p.add_(g, alpha=-group["lr"] * max(1, p.size(-2) / p.size(-1)) ** 0.5)


 class DistMuon(torch.optim.Optimizer):
@ -104,8 +112,8 @@ class DistMuon(torch.optim.Optimizer):
        nesterov: if True, Nesterov-style update (g <- lerp(g, buf, momentum)); else use buf
        ns_steps: number of Newton–Schulz iterations for the orthogonalization
    """
-    def __init__(self, params, lr: float = 0.02, momentum: float = 0.95,
-                 nesterov: bool = True, ns_steps: int = 5):
+
+    def __init__(self, params, lr: float = 0.02, momentum: float = 0.95, nesterov: bool = True, ns_steps: int = 5):
        defaults = dict(lr=lr, momentum=momentum, nesterov=nesterov, ns_steps=ns_steps)
        params = list(params)
        assert all(p.ndim == 2 for p in params), "Muon expects 2D parameters only"
@ -129,7 +137,9 @@ class DistMuon(torch.optim.Optimizer):
        world_size = dist.get_world_size()

        # Ensure all grads exist
-        assert all(p.grad is not None for group in self.param_groups for p in group["params"]), "All params must have grads"
+        assert all(p.grad is not None for group in self.param_groups for p in group["params"]), (
+            "All params must have grads"
+        )

        # Kick off all the reduce scatter operations to average up the gradients across all ranks
        all_reduce_futures = []
@ -141,7 +151,7 @@ class DistMuon(torch.optim.Optimizer):
                # The compute owner of each param is rank i % world_size
                owner_idx = base_i + rank
                # each rank stacks up its chunk of world_size params into a list
-                rs_input = [p.grad for p in params[base_i:base_i + world_size]]
+                rs_input = [p.grad for p in params[base_i : base_i + world_size]]
                # pad rs_input with the zero buffer to complete the group
                rs_input.extend([zero_buffer] * (world_size - len(rs_input)))
                # the output buffer gets strided across the group based on the rank
@ -174,11 +184,11 @@ class DistMuon(torch.optim.Optimizer):
                    buf.lerp_(g, 1.0 - group["momentum"])
                    g = g.lerp_(buf, group["momentum"]) if group["nesterov"] else buf
                    g = zeropower_via_newtonschulz5(g, steps=group["ns_steps"])
-                    scale = (max(1.0, p.size(-2) / p.size(-1)) ** 0.5)
+                    scale = max(1.0, p.size(-2) / p.size(-1)) ** 0.5
                    p.add_(g, alpha=-group["lr"] * scale)
                # Replicate updated parameters to all ranks
                ag_input = params[owner_idx] if owner_idx < len(params) else zero_buffer
-                ag_output = params[base_i:base_i + world_size]
+                ag_output = params[base_i : base_i + world_size]
                ag_output.extend([torch.empty_like(zero_buffer) for _ in range(world_size - len(ag_output))])  # pad
                work = dist.all_gather(ag_output, ag_input, async_op=True).get_future()
                all_gather_futures.append(work)
--- a/nanochat/report.py
+++ b/nanochat/report.py
@ -2,16 +2,18 @@
 Utilities for generating training report cards. More messy code than usual, will fix.
 """

+import datetime
 import os
+import platform
 import re
 import shutil
-import subprocess
 import socket
-import datetime
-import platform
+import subprocess
+
 import psutil
 import torch

+
 def run_command(cmd):
    """Run a shell command and return output, or None if it fails."""
    try:
@ -22,6 +24,7 @@ def run_command(cmd):
    except:
        return None

+
 def get_git_info():
    """Get current git commit, branch, and dirty status."""
    info = {}
@ -38,18 +41,14 @@ def get_git_info():

    return info

+
 def get_gpu_info():
    """Get GPU information."""
    if not torch.cuda.is_available():
        return {"available": False}

    num_devices = torch.cuda.device_count()
-    info = {
-        "available": True,
-        "count": num_devices,
-        "names": [],
-        "memory_gb": []
-    }
+    info = {"available": True, "count": num_devices, "names": [], "memory_gb": []}

    for i in range(num_devices):
        props = torch.cuda.get_device_properties(i)
@ -61,6 +60,7 @@ def get_gpu_info():

    return info

+
 def get_system_info():
    """Get system information."""
    info = {}
@ -83,6 +83,7 @@ def get_system_info():

    return info

+
 def estimate_cost(gpu_info, runtime_hours=None):
    """Estimate training cost based on GPU type and runtime."""

@ -111,9 +112,10 @@ def estimate_cost(gpu_info, runtime_hours=None):
    return {
        "hourly_rate": hourly_rate,
        "gpu_type": gpu_name,
-        "estimated_total": hourly_rate * runtime_hours if runtime_hours else None
+        "estimated_total": hourly_rate * runtime_hours if runtime_hours else None,
    }

+
 def generate_header():
    """Generate the header for a training report."""
    timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
@ -170,7 +172,7 @@ Generated: {timestamp}
    # count dependencies via uv.lock
    uv_lock_lines = 0
    if os.path.exists('uv.lock'):
-        with open('uv.lock', 'r', encoding='utf-8') as f:
+        with open('uv.lock', encoding='utf-8') as f:
            uv_lock_lines = len(f.readlines())

    header += f"""
@ -184,12 +186,15 @@ Generated: {timestamp}
 """
    return header

+
 # -----------------------------------------------------------------------------

+
 def slugify(text):
    """Slugify a text string."""
    return text.lower().replace(" ", "-")

+
 # the expected files and their order
 EXPECTED_FILES = [
    "tokenizer-training.md",
@ -207,6 +212,7 @@ EXPECTED_FILES = [
 # the metrics we're currently interested in
 chat_metrics = ["ARC-Easy", "ARC-Challenge", "MMLU", "GSM8K", "HumanEval", "ChatCORE"]

+
 def extract(section, keys):
    """simple def to extract a single key from a section"""
    if not isinstance(keys, list):
@ -218,6 +224,7 @@ def extract(section, keys):
                out[key] = line.split(":")[1].strip()
    return out

+
 def extract_timestamp(content, prefix):
    """Extract timestamp from content with given prefix."""
    for line in content.split('\n'):
@ -229,6 +236,7 @@ def extract_timestamp(content, prefix):
                pass
    return None

+
 class Report:
    """Maintains a bunch of logs, generates a final markdown report."""

@ -276,7 +284,7 @@ class Report:
            # write the header first
            header_file = os.path.join(report_dir, "header.md")
            if os.path.exists(header_file):
-                with open(header_file, "r", encoding="utf-8") as f:
+                with open(header_file, encoding="utf-8") as f:
                    header_content = f.read()
                    out_file.write(header_content)
                    start_time = extract_timestamp(header_content, "Run started:")
@ -293,7 +301,7 @@ class Report:
                if not os.path.exists(section_file):
                    print(f"Warning: {section_file} does not exist, skipping")
                    continue
-                with open(section_file, "r", encoding="utf-8") as in_file:
+                with open(section_file, encoding="utf-8") as in_file:
                    section = in_file.read()
                # Extract timestamp from this section (the last section's timestamp will "stick" as end_time)
                if "rl" not in file_name:
@ -354,7 +362,7 @@ class Report:
            else:
                out_file.write("Total wall clock time: unknown\n")
        # also cp the report.md file to current directory
-        print(f"Copying report.md to current directory for convenience")
+        print("Copying report.md to current directory for convenience")
        shutil.copy(report_file, "report.md")
        return report_file

@ -378,18 +386,23 @@ class Report:
            f.write(f"Run started: {start_time}\n\n---\n\n")
        print(f"Reset report and wrote header to {header_file}")

+
 # -----------------------------------------------------------------------------
 # nanochat-specific convenience functions

+
 class DummyReport:
    def log(self, *args, **kwargs):
        pass
+
    def reset(self, *args, **kwargs):
        pass

+
 def get_report():
    # just for convenience, only rank 0 logs to report
    from nanochat.common import get_base_dir, get_dist_info
+
    ddp, ddp_rank, ddp_local_rank, ddp_world_size = get_dist_info()
    if ddp_rank == 0:
        report_dir = os.path.join(get_base_dir(), "report")
@ -397,10 +410,18 @@ def get_report():
    else:
        return DummyReport()

+
 if __name__ == "__main__":
    import argparse
+
    parser = argparse.ArgumentParser(description="Generate or reset nanochat training reports.")
-    parser.add_argument("command", nargs="?", default="generate", choices=["generate", "reset"], help="Operation to perform (default: generate)")
+    parser.add_argument(
+        "command",
+        nargs="?",
+        default="generate",
+        choices=["generate", "reset"],
+        help="Operation to perform (default: generate)",
+    )
    args = parser.parse_args()
    if args.command == "generate":
        get_report().generate()
--- a/nanochat/tokenizer.py
+++ b/nanochat/tokenizer.py
@ -6,8 +6,8 @@ Two implementations are available:
 2) Our own RustBPE Tokenizer for training and tiktoken for efficient inference
 """

-import os
 import copy
+import os
 from functools import lru_cache

 SPECIAL_TOKENS = [
@ -27,15 +27,18 @@ SPECIAL_TOKENS = [
 # NOTE: this split pattern deviates from GPT-4 in that we use \p{N}{1,2} instead of \p{N}{1,3}
 # I did this because I didn't want to "waste" too many tokens on numbers for smaller vocab sizes.
 # I haven't validated that this is actually a good idea, TODO.
-SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,2}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
+SPLIT_PATTERN = (
+    r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,2}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
+)

 # -----------------------------------------------------------------------------
 # Generic GPT-4-style tokenizer based on HuggingFace Tokenizer
+from tokenizers import Regex, decoders, pre_tokenizers
 from tokenizers import Tokenizer as HFTokenizer
-from tokenizers import pre_tokenizers, decoders, Regex
 from tokenizers.models import BPE
 from tokenizers.trainers import BpeTrainer

+
 class HuggingFaceTokenizer:
    """Light wrapper around HuggingFace Tokenizer for some utilities"""

@ -59,11 +62,13 @@ class HuggingFaceTokenizer:
    def train_from_iterator(cls, text_iterator, vocab_size):
        # train from an iterator of text
        # Configure the HuggingFace Tokenizer
-        tokenizer = HFTokenizer(BPE(
+        tokenizer = HFTokenizer(
+            BPE(
                byte_fallback=True,  # needed!
                unk_token=None,
                fuse_unk=False,
-        ))
+            )
+        )
        # Normalizer: None
        tokenizer.normalizer = None
        # Pre-tokenizer: GPT-4 style
@ -72,10 +77,12 @@ class HuggingFaceTokenizer:
        # very small models and smaller vocab sizes, because it is a little bit wasteful in the token space.
        # (but I haven't validated this! TODO)
        gpt4_split_regex = Regex(SPLIT_PATTERN)  # huggingface demands that you wrap it in Regex!!
-        tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
+        tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
+            [
                pre_tokenizers.Split(pattern=gpt4_split_regex, behavior="isolated", invert=False),
-            pre_tokenizers.ByteLevel(add_prefix_space=False, use_regex=False)
-        ])
+                pre_tokenizers.ByteLevel(add_prefix_space=False, use_regex=False),
+            ]
+        )
        # Decoder: ByteLevel (it pairs together with the ByteLevel pre-tokenizer)
        tokenizer.decoder = decoders.ByteLevel()
        # Post-processor: None
@ -146,12 +153,16 @@ class HuggingFaceTokenizer:
        self.tokenizer.save(tokenizer_path)
        print(f"Saved tokenizer to {tokenizer_path}")

+
 # -----------------------------------------------------------------------------
 # Tokenizer based on rustbpe + tiktoken combo
 import pickle
-import rustbpe
+
 import tiktoken

+import rustbpe
+
+
 class RustBPETokenizer:
    """Light wrapper around tiktoken (for efficient inference) but train with rustbpe"""

@ -264,6 +275,7 @@ class RustBPETokenizer:
        """
        # ids, masks that we will return and a helper function to help build them up.
        ids, mask = [], []
+
        def add_tokens(token_ids, mask_val):
            if isinstance(token_ids, int):
                token_ids = [token_ids]
@ -286,17 +298,21 @@ class RustBPETokenizer:
        # fetch all the special tokens we need
        bos = self.get_bos_token_id()
        user_start, user_end = self.encode_special("<|user_start|>"), self.encode_special("<|user_end|>")
-        assistant_start, assistant_end = self.encode_special("<|assistant_start|>"), self.encode_special("<|assistant_end|>")
+        assistant_start, assistant_end = (
+            self.encode_special("<|assistant_start|>"),
+            self.encode_special("<|assistant_end|>"),
+        )
        python_start, python_end = self.encode_special("<|python_start|>"), self.encode_special("<|python_end|>")
        output_start, output_end = self.encode_special("<|output_start|>"), self.encode_special("<|output_end|>")

        # now we can tokenize the conversation
        add_tokens(bos, 0)
        for i, message in enumerate(messages):
-
            # some sanity checking here around assumptions, to prevent footguns
            must_be_from = "user" if i % 2 == 0 else "assistant"
-            assert message["role"] == must_be_from, f"Message {i} is from {message['role']} but should be from {must_be_from}"
+            assert message["role"] == must_be_from, (
+                f"Message {i} is from {message['role']} but should be from {must_be_from}"
+            )

            # content can be either a simple string or a list of parts (e.g. containing tool calls)
            content = message["content"]
@ -376,23 +392,31 @@ class RustBPETokenizer:
        ids.append(assistant_start)
        return ids

+
 # -----------------------------------------------------------------------------
 # nanochat-specific convenience functions

+
 def get_tokenizer():
    from nanochat.common import get_base_dir
+
    base_dir = get_base_dir()
    tokenizer_dir = os.path.join(base_dir, "tokenizer")
    # return HuggingFaceTokenizer.from_directory(tokenizer_dir)
    return RustBPETokenizer.from_directory(tokenizer_dir)

+
 def get_token_bytes(device="cpu"):
    import torch
+
    from nanochat.common import get_base_dir
+
    base_dir = get_base_dir()
    tokenizer_dir = os.path.join(base_dir, "tokenizer")
    token_bytes_path = os.path.join(tokenizer_dir, "token_bytes.pt")
-    assert os.path.exists(token_bytes_path), f"Token bytes not found at {token_bytes_path}? It gets written by tok_train.py"
+    assert os.path.exists(token_bytes_path), (
+        f"Token bytes not found at {token_bytes_path}? It gets written by tok_train.py"
+    )
    with open(token_bytes_path, "rb") as f:
        token_bytes = torch.load(f, map_location=device)
    return token_bytes
--- a/pyproject.toml
+++ b/pyproject.toml
@ -32,6 +32,7 @@ manifest-path = "rustbpe/Cargo.toml"
 dev = [
    "maturin>=1.9.4",
    "pytest>=8.0.0",
+    "pre-commit>=3.8.0",
 ]

 [tool.pytest.ini_options]
@ -75,3 +76,28 @@ conflicts = [
        { extra = "gpu" },
    ],
 ]
+
+[tool.ruff]
+target-version = "py310"
+line-length = 120
+fix = true
+unsafe-fixes = true
+
+[tool.ruff.lint]
+select = [
+    "F",      # Pyflakes (unused imports) - replaces autoflake
+    "I",      # isort - replaces isort
+    "UP",     # pyupgrade - replaces pyupgrade
+]
+
+[tool.ruff.lint.isort]
+known-first-party = ["nanochat"]
+
+[tool.ruff.format]
+quote-style = "preserve"
+
+[tool.codespell]
+write-changes = true
+interactive = 1
+skip = "tests/*,dev/*,scripts/tok_eval.py,tasks/spellingbee.py"
+ignore-words-list = "re-use,astroid"
--- a/scripts/base_eval.py
+++ b/scripts/base_eval.py
@ -9,23 +9,31 @@ torchrun --nproc_per_node=8 -m scripts.base_eval

 The script will print the CORE metric to the console.
 """
-import os
+
 import csv
-import time
 import json
-import yaml
-import shutil
+import os
 import random
-import zipfile
+import shutil
 import tempfile
+import time
+import zipfile
 from contextlib import nullcontext

 import torch
+import yaml

-from nanochat.common import compute_init, compute_cleanup, print0, get_base_dir, autodetect_device_type, download_file_with_lock
-from nanochat.tokenizer import HuggingFaceTokenizer
 from nanochat.checkpoint_manager import load_model
+from nanochat.common import (
+    autodetect_device_type,
+    compute_cleanup,
+    compute_init,
+    download_file_with_lock,
+    get_base_dir,
+    print0,
+)
 from nanochat.core_eval import evaluate_task
+from nanochat.tokenizer import HuggingFaceTokenizer

 # -----------------------------------------------------------------------------
 # nanochat specific function dealing with I/O etc.
@ -33,6 +41,7 @@ from nanochat.core_eval import evaluate_task
 # ~162MB of data needed to evaluate the CORE metric
 EVAL_BUNDLE_URL = "https://karpathy-public.s3.us-west-2.amazonaws.com/eval_bundle.zip"

+
 def place_eval_bundle(file_path):
    # here file_path is the path to the eval_bundle.zip file
    # we need to unzip it and place it in the base directory
@ -45,6 +54,7 @@ def place_eval_bundle(file_path):
        shutil.move(extracted_bundle_dir, eval_bundle_dir)
    print0(f"Placed eval_bundle directory at {eval_bundle_dir}")

+
 def evaluate_model(model, tokenizer, device, max_per_task=-1):
    """
    Evaluate a base model on the CORE benchmark.
@ -59,13 +69,13 @@ def evaluate_model(model, tokenizer, device, max_per_task=-1):
    config_path = os.path.join(eval_bundle_dir, "core.yaml")
    data_base_path = os.path.join(eval_bundle_dir, "eval_data")
    eval_meta_data = os.path.join(eval_bundle_dir, "eval_meta_data.csv")
-    with open(config_path, 'r', encoding='utf-8') as f:
+    with open(config_path, encoding='utf-8') as f:
        config = yaml.safe_load(f)
    tasks = config['icl_tasks']

    # Load random baseline values from eval metadata
    random_baselines = {}
-    with open(eval_meta_data, 'r', encoding='utf-8') as f:
+    with open(eval_meta_data, encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for row in reader:
            task_name = row['Eval Task']
@ -82,13 +92,13 @@ def evaluate_model(model, tokenizer, device, max_per_task=-1):
            'task_type': task['icl_task_type'],
            'dataset_uri': task['dataset_uri'],
            'num_fewshot': task['num_fewshot'][0],
-            'continuation_delimiter': task.get('continuation_delimiter', ' ')
+            'continuation_delimiter': task.get('continuation_delimiter', ' '),
        }
        print0(f"Evaluating: {label} ({task_meta['num_fewshot']}-shot, type: {task_meta['task_type']})... ", end='')

        # Load data for this task
        data_path = os.path.join(data_base_path, task_meta['dataset_uri'])
-        with open(data_path, 'r', encoding='utf-8') as f:
+        with open(data_path, encoding='utf-8') as f:
            data = [json.loads(line.strip()) for line in f]

        # shuffle the data because in many cases it appears ordered but we want
@ -109,18 +119,17 @@ def evaluate_model(model, tokenizer, device, max_per_task=-1):
        print0(f"accuracy: {accuracy:.4f} | centered: {centered_result:.4f} | time: {end_time - start_time:.2f}s")

    core_metric = sum(centered_results.values()) / len(centered_results)
-    out = {
-        "results": results,
-        "centered_results": centered_results,
-        "core_metric": core_metric
-    }
+    out = {"results": results, "centered_results": centered_results, "core_metric": core_metric}
    return out

+
 # -----------------------------------------------------------------------------
 # HuggingFace loading utilities and light wrappers for a model

+
 class ModelWrapper:
    """Lightweight wrapper for a HuggingFace model"""
+
    def __init__(self, model, max_seq_len=None):
        self.model = model
        self.max_seq_len = max_seq_len
@ -130,10 +139,12 @@ class ModelWrapper:
        logits = outputs.logits
        return logits

+
 def load_hf_model(hf_path: str, device):
    print0(f"Loading model from: {hf_path}")
    # Load the model
    from transformers import AutoModelForCausalLM
+
    model = AutoModelForCausalLM.from_pretrained(hf_path)
    model.to(device)
    model.eval()
@ -143,9 +154,11 @@ def load_hf_model(hf_path: str, device):
    tokenizer = HuggingFaceTokenizer.from_pretrained(hf_path)
    return model, tokenizer

+
 # -----------------------------------------------------------------------------
 def main():
    import argparse
+
    parser = argparse.ArgumentParser()
    parser.add_argument('--hf-path', type=str, default=None, help='HuggingFace model path to evaluate')
    parser.add_argument('--max-per-task', type=int, default=-1, help='Max examples per task to evaluate (-1 = disable)')
@ -154,7 +167,9 @@ def main():
    # distributed / precision setup
    device_type = autodetect_device_type()
    ddp, ddp_rank, ddp_local_rank, ddp_world_size, device = compute_init(device_type)
-    autocast_ctx = torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16) if device_type == "cuda" else nullcontext()
+    autocast_ctx = (
+        torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16) if device_type == "cuda" else nullcontext()
+    )

    # Load model and tokenizer from command line or from file system
    if args.hf_path is not None:
@ -190,23 +205,28 @@ def main():
                f.write(f"{label:<35}, {results[label]:<10.6f}, {centered_results[label]:<10.6f}\n")
            f.write(f"{'CORE':<35}, {'':<10}, {core_metric:<10.6f}\n")
        # Print the content of the csv file to console too
-        print0("="*80)
+        print0("=" * 80)
        print0(f"Model: {model_name}")
-        print0("="*80)
-        with open(output_csv_path, 'r', encoding='utf-8') as f:
+        print0("=" * 80)
+        with open(output_csv_path, encoding='utf-8') as f:
            print0(f.read())

    # Log to report
    from nanochat.report import get_report
-    get_report().log(section="Base model evaluation", data=[
+
+    get_report().log(
+        section="Base model evaluation",
+        data=[
            {
                "Model": model_name,
                "CORE metric": core_metric,
            },
            centered_results,  # the full table
-    ])
+        ],
+    )

    compute_cleanup()

+
 if __name__ == "__main__":
    main()
--- a/scripts/base_loss.py
+++ b/scripts/base_loss.py
@ -6,19 +6,22 @@ Loads a checkpoint, and:
 Example run as:
 torchrun --standalone --nproc_per_node=8 -m scripts.base_loss
 """
+
 import os
 from contextlib import nullcontext
+
 import torch
+
 from nanochat.checkpoint_manager import load_model
-from nanochat.common import compute_init, print0, compute_cleanup, autodetect_device_type
+from nanochat.common import autodetect_device_type, compute_cleanup, compute_init, print0
 from nanochat.dataloader import tokenizing_distributed_data_loader
-from nanochat.tokenizer import get_token_bytes
-from nanochat.loss_eval import evaluate_bpb
 from nanochat.engine import Engine
+from nanochat.loss_eval import evaluate_bpb
+from nanochat.tokenizer import get_token_bytes

 # Configuration
 device_batch_size = 32
-split_tokens = 20*524288  # number of tokens to evaluate per split
+split_tokens = 20 * 524288  # number of tokens to evaluate per split
 model_tag = None  # optional model tag for the output directory name
 model_step = None  # optional model step for the output directory name
 device_type = ""  # cuda|cpu|mps (empty => autodetect)
@ -29,7 +32,9 @@ device_type = autodetect_device_type() if device_type == "" else device_type
 ddp, ddp_rank, ddp_local_rank, ddp_world_size, device = compute_init(device_type)
 model, tokenizer, meta = load_model("base", device, phase="eval", model_tag=model_tag, step=model_step)
 sequence_len = meta["model_config"]["sequence_len"]  # could be arbitrary really
-autocast_ctx = torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16) if device_type == "cuda" else nullcontext()
+autocast_ctx = (
+    torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16) if device_type == "cuda" else nullcontext()
+)

 # Evaluate the loss on each split
 tokens_per_step = device_batch_size * sequence_len * ddp_world_size
@ -67,13 +72,17 @@ if ddp_rank == 0:

 # Log to report
 from nanochat.report import get_report
-get_report().log(section="Base model loss", data=[
+
+get_report().log(
+    section="Base model loss",
+    data=[
        {
            "train bpb": bpb_results["train"],
            "val bpb": bpb_results["val"],
        },
        {f"sample {i}": sample for i, sample in enumerate(samples)},
-])
+    ],
+)

 # Cleanup
 compute_cleanup()
--- a/scripts/base_train.py
+++ b/scripts/base_train.py
@ -12,21 +12,31 @@ python -m scripts.base_train --depth=4 --max_seq_len=512 --device_batch_size=1 -
 """

 import os
+
 os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
 import time
 from contextlib import nullcontext

-import wandb
 import torch
+import wandb

-from nanochat.gpt import GPT, GPTConfig
+from nanochat.checkpoint_manager import load_checkpoint, save_checkpoint
+from nanochat.common import (
+    DummyWandb,
+    autodetect_device_type,
+    compute_cleanup,
+    compute_init,
+    get_base_dir,
+    print0,
+    print_banner,
+)
 from nanochat.dataloader import tokenizing_distributed_data_loader, tokenizing_distributed_data_loader_with_state
-from nanochat.common import compute_init, compute_cleanup, print0, DummyWandb, print_banner, get_base_dir, autodetect_device_type
-from nanochat.tokenizer import get_tokenizer, get_token_bytes
-from nanochat.checkpoint_manager import save_checkpoint, load_checkpoint
-from nanochat.loss_eval import evaluate_bpb
 from nanochat.engine import Engine
+from nanochat.gpt import GPT, GPTConfig
+from nanochat.loss_eval import evaluate_bpb
+from nanochat.tokenizer import get_token_bytes, get_tokenizer
 from scripts.base_eval import evaluate_model
+
 print_banner()

 # -----------------------------------------------------------------------------
@ -39,8 +49,12 @@ depth = 20 # the depth of the Transformer model to train, rest of the kwargs are
 max_seq_len = 2048  # max context length
 # Training horizon. Only one of these 3 will be used, in this order of precedence.
 num_iterations = -1  # explicit number of steps of the optimization (-1 = disable)
-target_flops = -1.0 # calculate num_iterations to reach target_flops. Useful for scaling laws experiments (-1 = disable)
-target_param_data_ratio = 20 # calculate num_iterations to maintain fixed data:param ratio (Chinchilla=20) (-1 = disable)
+target_flops = (
+    -1.0
+)  # calculate num_iterations to reach target_flops. Useful for scaling laws experiments (-1 = disable)
+target_param_data_ratio = (
+    20  # calculate num_iterations to maintain fixed data:param ratio (Chinchilla=20) (-1 = disable)
+)
 # Optimization
 device_batch_size = 32  # per-device batch size (set to not OOM)
 total_batch_size = 524288  # total desired batch size, in #tokens
@ -55,7 +69,7 @@ final_lr_frac = 0.0 # final LR is this fraction of the initial LR
 resume_from_step = -1  # resume training from this step of the optimization (-1 = disable)
 # Evaluation
 eval_every = 250  # every how many steps to evaluate the model for val bpb
-eval_tokens = 20*524288 # number of tokens to evaluate val loss on
+eval_tokens = 20 * 524288  # number of tokens to evaluate val loss on
 core_metric_every = 2000  # every how many steps to evaluate the core metric (-1 = disable)
 core_metric_max_per_task = 500  # examples per task in estimating the core metric
 sample_every = 2000  # every how many steps to sample from the model
@ -63,7 +77,7 @@ save_every = -1 # every how many steps to save model checkpoints (-1 = disable,
 # Output
 model_tag = ""  # optionally override the model tag for the output checkpoint directory name
 # now allow CLI to override the settings via the configurator lol
-config_keys = [k for k,v in globals().items() if not k.startswith('_') and isinstance(v, (int, float, bool, str))]
+config_keys = [k for k, v in globals().items() if not k.startswith('_') and isinstance(v, (int, float, bool, str))]
 exec(open(os.path.join('nanochat', 'configurator.py')).read())  # overrides from command line or config file
 user_config = {k: globals()[k] for k in config_keys}  # will be useful for logging
 # -----------------------------------------------------------------------------
@ -72,7 +86,9 @@ user_config = {k: globals()[k] for k in config_keys} # will be useful for loggin
 device_type = autodetect_device_type() if device_type == "" else device_type
 ddp, ddp_rank, ddp_local_rank, ddp_world_size, device = compute_init(device_type)
 master_process = ddp_rank == 0  # this process will do logging, checkpointing etc.
-autocast_ctx = torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16) if device_type == "cuda" else nullcontext()
+autocast_ctx = (
+    torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16) if device_type == "cuda" else nullcontext()
+)
 synchronize = torch.cuda.synchronize if device_type == "cuda" else lambda: None
 get_max_memory = torch.cuda.max_memory_allocated if device_type == "cuda" else lambda: 0

@ -110,7 +126,14 @@ print0(f"Total batch size {total_batch_size:,} => gradient accumulation steps: {
 # Initialize the Model

 # Create a new model with random weights
-model_config_kwargs = dict(sequence_len=max_seq_len, vocab_size=vocab_size, n_layer=num_layers, n_head=num_heads, n_kv_head=num_kv_heads, n_embd=model_dim)
+model_config_kwargs = dict(
+    sequence_len=max_seq_len,
+    vocab_size=vocab_size,
+    n_layer=num_layers,
+    n_head=num_heads,
+    n_kv_head=num_kv_heads,
+    n_embd=model_dim,
+)
 with torch.device("meta"):
    model_config = GPTConfig(**model_config_kwargs)
    model = GPT(model_config)
@ -124,7 +147,9 @@ checkpoint_dir = os.path.join(base_dir, "base_checkpoints", output_dirname)
 resuming = resume_from_step != -1
 if resuming:
    print0(f"Resuming optimization from step {resume_from_step}")
-    model_data, optimizer_data, meta_data = load_checkpoint(checkpoint_dir, resume_from_step, device, load_optimizer=True, rank=ddp_rank)
+    model_data, optimizer_data, meta_data = load_checkpoint(
+        checkpoint_dir, resume_from_step, device, load_optimizer=True, rank=ddp_rank
+    )
    model.load_state_dict(model_data, strict=True, assign=True)
    del model_data  # free up this memory after the copy

@ -157,7 +182,9 @@ print0(f"Total training FLOPs estimate: {num_flops_per_token * total_tokens:e}")

 # -----------------------------------------------------------------------------
 # Initialize the Optimizer (Muon for Linear layers, AdamW for embedding and lm_head)
-optimizers = model.setup_optimizers(unembedding_lr=unembedding_lr, embedding_lr=embedding_lr, matrix_lr=matrix_lr, weight_decay=weight_decay)
+optimizers = model.setup_optimizers(
+    unembedding_lr=unembedding_lr, embedding_lr=embedding_lr, matrix_lr=matrix_lr, weight_decay=weight_decay
+)
 adamw_optimizer, muon_optimizer = optimizers

 if resuming:
@ -169,13 +196,18 @@ if resuming:
 # Initialize the DataLoaders for train/val
 tokens_dir = os.path.join(base_dir, "tokenized_data")
 dataloader_resume_state_dict = None if not resuming else meta_data["dataloader_state_dict"]
-train_loader = tokenizing_distributed_data_loader_with_state(device_batch_size, max_seq_len, split="train", device=device, resume_state_dict=dataloader_resume_state_dict)
-build_val_loader = lambda: tokenizing_distributed_data_loader(device_batch_size, max_seq_len, split="val", device=device)
+train_loader = tokenizing_distributed_data_loader_with_state(
+    device_batch_size, max_seq_len, split="train", device=device, resume_state_dict=dataloader_resume_state_dict
+)
+build_val_loader = lambda: tokenizing_distributed_data_loader(
+    device_batch_size, max_seq_len, split="val", device=device
+)
 x, y, dataloader_state_dict = next(train_loader)  # kick off load of the very first batch of data

 # -----------------------------------------------------------------------------
 # Set up hyperparameter schedulers

+
 # Learning rate scheduler
 def get_lr_multiplier(it):
    warmup_iters = round(warmup_ratio * num_iterations)
@ -188,12 +220,14 @@ def get_lr_multiplier(it):
        progress = (num_iterations - it) / warmdown_iters
        return progress * 1.0 + (1 - progress) * final_lr_frac

+
 # Momentum scheduler for Muon optimizer
 def get_muon_momentum(it):
    frac = min(it / 300, 1)
    momentum = (1 - frac) * 0.85 + frac * 0.95
    return momentum

+
 # -----------------------------------------------------------------------------
 # Loop state (variables updated by the training loop)

@ -225,12 +259,14 @@ while True:
        print0(f"Step {step:05d} | Validation bpb: {val_bpb:.4f}")
        if val_bpb < min_val_bpb:
            min_val_bpb = val_bpb
-        wandb_run.log({
+        wandb_run.log(
+            {
                "step": step,
                "total_training_flops": flops_so_far,
                "total_training_time": total_training_time,
                "val/bpb": val_bpb,
-        })
+            }
+        )
        model.train()

    # once in a while: estimate the CORE metric (all ranks participate)
@ -241,12 +277,14 @@ while True:
        with autocast_ctx:
            results = evaluate_model(orig_model, tokenizer, device, max_per_task=core_metric_max_per_task)
        print0(f"Step {step:05d} | CORE metric: {results['core_metric']:.4f}")
-        wandb_run.log({
+        wandb_run.log(
+            {
                "step": step,
                "total_training_flops": flops_so_far,
                "core_metric": results["core_metric"],
                "centered_results": results["centered_results"],
-        })
+            }
+        )
        model.train()

    # once in a while: sample from the model (only on master process)
@ -309,7 +347,9 @@ while True:
        train_loss = loss.detach()  # for logging
        loss = loss / grad_accum_steps  # each .backward() is a grad sum => normalize loss here
        loss.backward()
-        x, y, dataloader_state_dict = next(train_loader) # prefetch the next batch while the GPU is busy with forward/backward
+        x, y, dataloader_state_dict = next(
+            train_loader
+        )  # prefetch the next batch while the GPU is busy with forward/backward
    # gradient clipping
    grad_clip_enabled = grad_clip > 0.0
    if grad_clip_enabled:
@ -334,7 +374,7 @@ while True:
    # logging
    ema_beta = 0.9  # EMA decay factor for some smoothing just for nicer logging
    smooth_train_loss = ema_beta * smooth_train_loss + (1 - ema_beta) * train_loss.item()  # EMA the training loss
-    debiased_smooth_loss = smooth_train_loss / (1 - ema_beta**(step + 1)) # debias the EMA
+    debiased_smooth_loss = smooth_train_loss / (1 - ema_beta ** (step + 1))  # debias the EMA
    pct_done = 100 * step / num_iterations
    tok_per_sec = int(total_batch_size / dt)
    flops_per_sec = num_flops_per_token * total_batch_size / dt
@ -343,7 +383,9 @@ while True:
    if step > 10:
        total_training_time += dt  # only count the time after the first 10 steps
    print_grad_norm = f" grad norm: {grad_norm:.4f} |" if grad_clip_enabled else ""
-    print0(f"step {step:05d}/{num_iterations:05d} ({pct_done:.2f}%) | loss: {debiased_smooth_loss:.6f} |{print_grad_norm} lrm: {lrm:.2f} | dt: {dt * 1000:.2f}ms | tok/sec: {tok_per_sec:,} | mfu: {mfu:.2f} | total time: {total_training_time/60:.2f}m")
+    print0(
+        f"step {step:05d}/{num_iterations:05d} ({pct_done:.2f}%) | loss: {debiased_smooth_loss:.6f} |{print_grad_norm} lrm: {lrm:.2f} | dt: {dt * 1000:.2f}ms | tok/sec: {tok_per_sec:,} | mfu: {mfu:.2f} | total time: {total_training_time / 60:.2f}m"
+    )
    if step % 100 == 0:
        log_data = {
            "step": step,
@ -364,12 +406,15 @@ while True:

 # print a few more stats
 print0(f"Peak memory usage: {get_max_memory() / 1024 / 1024:.2f}MiB")
-print0(f"Total training time: {total_training_time/60:.2f}m")
+print0(f"Total training time: {total_training_time / 60:.2f}m")
 print0(f"Minimum validation bpb: {min_val_bpb:.4f}")

 # Log to report
 from nanochat.report import get_report
-get_report().log(section="Base model training", data=[
+
+get_report().log(
+    section="Base model training",
+    data=[
        user_config,  # CLI args
        {  # stats about the training setup
            "Number of parameters": num_params,
@ -388,10 +433,11 @@ get_report().log(section="Base model training", data=[
            "CORE metric estimate": results.get("core_metric", None),
            "MFU %": f"{mfu:.2f}%",
            "Total training flops": f"{flops_so_far:e}",
-        "Total training time": f"{total_training_time/60:.2f}m",
+            "Total training time": f"{total_training_time / 60:.2f}m",
            "Peak memory usage": f"{get_max_memory() / 1024 / 1024:.2f}MiB",
-    }
-])
+        },
+    ],
+)

 # cleanup
 wandb_run.finish()  # wandb run finish
--- a/scripts/chat_cli.py
+++ b/scripts/chat_cli.py
@ -4,12 +4,15 @@ New and upgraded chat mode because a lot of the code has changed since the last
 Intended to be run single GPU only atm:
 python -m scripts.chat_cli -i mid
 """
+
 import argparse
-import torch
-from nanochat.common import compute_init, autodetect_device_type
 from contextlib import nullcontext
-from nanochat.engine import Engine
+
+import torch
+
 from nanochat.checkpoint_manager import load_model
+from nanochat.common import autodetect_device_type, compute_init
+from nanochat.engine import Engine

 parser = argparse.ArgumentParser(description='Chat with the model')
 parser.add_argument('-i', '--source', type=str, default="sft", help="Source of the model: sft|mid|rl")
@ -18,7 +21,13 @@ parser.add_argument('-s', '--step', type=int, default=None, help='Step to load')
 parser.add_argument('-p', '--prompt', type=str, default='', help='Prompt the model, get a single response back')
 parser.add_argument('-t', '--temperature', type=float, default=0.6, help='Temperature for generation')
 parser.add_argument('-k', '--top-k', type=int, default=50, help='Top-k sampling parameter')
-parser.add_argument('--device-type', type=str, default='', choices=['cuda', 'cpu', 'mps'], help='Device type for evaluation: cuda|cpu|mps. empty => autodetect')
+parser.add_argument(
+    '--device-type',
+    type=str,
+    default='',
+    choices=['cuda', 'cpu', 'mps'],
+    help='Device type for evaluation: cuda|cpu|mps. empty => autodetect',
+)
 parser.add_argument('-d', '--dtype', type=str, default='bfloat16', choices=['float32', 'bfloat16'])
 args = parser.parse_args()

@ -33,7 +42,10 @@ model, tokenizer, meta = load_model(args.source, device, phase="eval", model_tag
 # Special tokens for the chat state machine
 bos = tokenizer.get_bos_token_id()
 user_start, user_end = tokenizer.encode_special("<|user_start|>"), tokenizer.encode_special("<|user_end|>")
-assistant_start, assistant_end = tokenizer.encode_special("<|assistant_start|>"), tokenizer.encode_special("<|assistant_end|>")
+assistant_start, assistant_end = (
+    tokenizer.encode_special("<|assistant_start|>"),
+    tokenizer.encode_special("<|assistant_end|>"),
+)

 # Create Engine for efficient generation
 engine = Engine(model, tokenizer)
@ -47,7 +59,6 @@ print("-" * 50)
 conversation_tokens = [bos]

 while True:
-
    if args.prompt:
        # Get the prompt from the launch command
        user_input = args.prompt
--- a/scripts/chat_eval.py
+++ b/scripts/chat_eval.py
@ -9,27 +9,28 @@ torchrun --nproc_per_node=8 -m scripts.chat_eval -- -a ARC-Easy
 """

 import argparse
-from functools import partial
 from contextlib import nullcontext
+from functools import partial

 import torch
 import torch.distributed as dist

-from nanochat.common import compute_init, compute_cleanup, get_dist_info, print0, autodetect_device_type
 from nanochat.checkpoint_manager import load_model
+from nanochat.common import autodetect_device_type, compute_cleanup, compute_init, get_dist_info, print0
 from nanochat.engine import Engine
-
-from tasks.humaneval import HumanEval
-from tasks.mmlu import MMLU
 from tasks.arc import ARC
 from tasks.gsm8k import GSM8K
+from tasks.humaneval import HumanEval
+from tasks.mmlu import MMLU
 from tasks.spellingbee import SpellingBee

 # -----------------------------------------------------------------------------
 # Generative evaluation loop (we go one problem at a time, sample, evaluate)

-def run_generative_eval(task_object, tokenizer, model, engine, num_samples, max_new_tokens, temperature, top_k, max_problems=None):

+def run_generative_eval(
+    task_object, tokenizer, model, engine, num_samples, max_new_tokens, temperature, top_k, max_problems=None
+):
    ddp, ddp_rank, ddp_local_rank, ddp_world_size = get_dist_info()
    device = model.get_device()

@ -62,7 +63,7 @@ def run_generative_eval(task_object, tokenizer, model, engine, num_samples, max_
        num_passed += int(passed)

        # Logging (overwrite the same line in the console)
-        print(f"\r\033[KRank {ddp_rank} | {num_passed}/{total} ({100*num_passed/total:.2f}%)", end='', flush=True)
+        print(f"\r\033[KRank {ddp_rank} | {num_passed}/{total} ({100 * num_passed / total:.2f}%)", end='', flush=True)

    # Finish the in-place progress line with a newline before final summary
    print()
@ -77,18 +78,19 @@ def run_generative_eval(task_object, tokenizer, model, engine, num_samples, max_
        total = total_tensor.item()

    print0("=" * 50)
-    print0(f"Final: {num_passed}/{total} ({100*num_passed/total:.2f}%)")
+    print0(f"Final: {num_passed}/{total} ({100 * num_passed / total:.2f}%)")

    # Return the accuracy
-    return num_passed/total
+    return num_passed / total
+

 # -----------------------------------------------------------------------------
 # Categorical evaluation loop
 # A lot easier because we don't have to sample. Therefore, we can actually go
 # batches at a time and just check the logits for correct answer choices.

-def run_categorical_eval(task_object, tokenizer, model, batch_size, max_problems=None):

+def run_categorical_eval(task_object, tokenizer, model, batch_size, max_problems=None):
    ddp, ddp_rank, ddp_local_rank, ddp_world_size = get_dist_info()
    device = model.get_device()
    bos = tokenizer.get_bos_token_id()  # use BOS as pad token is ok, these positions are ignored
@ -106,9 +108,13 @@ def run_categorical_eval(task_object, tokenizer, model, batch_size, max_problems

        # Prepare the batch of problems. They might all be of different length, so we pad/collate them.
        conversations = [task_object[ii] for ii in range(i0, i1)]
-        prompt_ids = [tokenizer.render_for_completion(conversation) for conversation in conversations] # TODO: remake the way this works
+        prompt_ids = [
+            tokenizer.render_for_completion(conversation) for conversation in conversations
+        ]  # TODO: remake the way this works
        max_length = max(len(ids) for ids in prompt_ids)
-        answer_time_positions = [len(ids) - 1 for ids in prompt_ids] # where the last token is (and the predicted answer)
+        answer_time_positions = [
+            len(ids) - 1 for ids in prompt_ids
+        ]  # where the last token is (and the predicted answer)
        padded_prompt_ids = [ids + [bos] * (max_length - len(ids)) for ids in prompt_ids]
        prompt_ids = torch.tensor(padded_prompt_ids, dtype=torch.long, device=device)

@ -150,15 +156,26 @@ def run_categorical_eval(task_object, tokenizer, model, batch_size, max_problems
        num_passed = num_passed_tensor.item()
        total = total_tensor.item()

-    average = num_passed/total
-    print0(f"Final: {num_passed}/{total} ({100*average:.2f}%)")
+    average = num_passed / total
+    print0(f"Final: {num_passed}/{total} ({100 * average:.2f}%)")
    return average

+
 # -----------------------------------------------------------------------------

-def run_chat_eval(task_name, model, tokenizer, engine,
-                   batch_size=1, num_samples=1, max_new_tokens=512, temperature=0.0, top_k=50,
-                   max_problems=None):
+
+def run_chat_eval(
+    task_name,
+    model,
+    tokenizer,
+    engine,
+    batch_size=1,
+    num_samples=1,
+    max_new_tokens=512,
+    temperature=0.0,
+    top_k=50,
+    max_problems=None,
+):
    # Create the evaluation object
    task_module = {
        'HumanEval': HumanEval,
@ -171,20 +188,36 @@ def run_chat_eval(task_name, model, tokenizer, engine,
    task_object = task_module()
    # Run the evaluation
    if task_object.eval_type == 'generative':
-        acc = run_generative_eval(task_object, tokenizer, model, engine, num_samples, max_new_tokens, temperature, top_k, max_problems=max_problems)
+        acc = run_generative_eval(
+            task_object,
+            tokenizer,
+            model,
+            engine,
+            num_samples,
+            max_new_tokens,
+            temperature,
+            top_k,
+            max_problems=max_problems,
+        )
    elif task_object.eval_type == 'categorical':
        acc = run_categorical_eval(task_object, tokenizer, model, batch_size, max_problems=max_problems)
    else:
        raise ValueError(f"Unsupported task evaluation type: {task_object.eval_type}")
    return acc

+
 # -----------------------------------------------------------------------------
 if __name__ == "__main__":
-
    # Parse command-line arguments
    parser = argparse.ArgumentParser()
    parser.add_argument('-i', '--source', type=str, required=True, help="Source of the model: sft|mid|rl")
-    parser.add_argument('-a', '--task-name', type=str, default=None, help="Task name. Default = all tasks. Use | to split multiple tasks.")
+    parser.add_argument(
+        '-a',
+        '--task-name',
+        type=str,
+        default=None,
+        help="Task name. Default = all tasks. Use | to split multiple tasks.",
+    )
    parser.add_argument('-d', '--dtype', type=str, default='bfloat16', choices=['float32', 'bfloat16'])
    parser.add_argument('-t', '--temperature', type=float, default=0.0)
    parser.add_argument('-m', '--max-new-tokens', type=int, default=512)
@ -194,13 +227,21 @@ if __name__ == "__main__":
    parser.add_argument('-g', '--model-tag', type=str, default=None, help='Model tag to load')
    parser.add_argument('-s', '--step', type=int, default=None, help='Step to load')
    parser.add_argument('-x', '--max-problems', type=int, default=None, help='Max problems to evaluate')
-    parser.add_argument('--device-type', type=str, default='', choices=['cuda', 'cpu', 'mps'], help='Device type for evaluation: cuda|cpu|mps. empty => autodetect')
+    parser.add_argument(
+        '--device-type',
+        type=str,
+        default='',
+        choices=['cuda', 'cpu', 'mps'],
+        help='Device type for evaluation: cuda|cpu|mps. empty => autodetect',
+    )
    args = parser.parse_args()

    device_type = autodetect_device_type() if args.device_type == "" else args.device_type
    ddp, ddp_rank, ddp_local_rank, ddp_world_size, device = compute_init(device_type)
    ptdtype = torch.float32 if args.dtype == 'float32' else torch.bfloat16
-    autocast_ctx = torch.amp.autocast(device_type=device_type, dtype=ptdtype) if device_type == "cuda" else nullcontext()
+    autocast_ctx = (
+        torch.amp.autocast(device_type=device_type, dtype=ptdtype) if device_type == "cuda" else nullcontext()
+    )

    model, tokenizer, meta = load_model(args.source, device, phase="eval", model_tag=args.model_tag, step=args.step)
    engine = Engine(model, tokenizer)
@ -223,7 +264,9 @@ if __name__ == "__main__":
        with autocast_ctx:
            acc = run_chat_eval(
                task_name,
-                model, tokenizer, engine,
+                model,
+                tokenizer,
+                engine,
                batch_size=args.batch_size,
                num_samples=args.num_samples,
                max_new_tokens=args.max_new_tokens,
@ -236,6 +279,7 @@ if __name__ == "__main__":

    # Log to report
    from nanochat.report import get_report
+
    all_tasks_were_evaluated = all(task_name in results for task_name in all_tasks)
    # calculate the ChatCORE metric if we can (similar to CORE, it's the mean centered accuracy)
    # this way, ChatCORE ranges from 0 (at random baseline) to 1 (peak performance)
@ -248,10 +292,13 @@ if __name__ == "__main__":
            centered_mean += centered_acc
        chatcore_metric = centered_mean / len(results)
        chatcore_metric_dict = {"ChatCORE metric": chatcore_metric}
-    get_report().log(section="Chat evaluation " + args.source, data=[
+    get_report().log(
+        section="Chat evaluation " + args.source,
+        data=[
            vars(args),  # CLI args
            results,
            chatcore_metric_dict,
-    ])
+        ],
+    )

    compute_cleanup()
--- a/scripts/chat_rl.py
+++ b/scripts/chat_rl.py
@ -16,15 +16,15 @@ python -m scripts.chat_rl
 torchrun --standalone --nproc_per_node=8 -m scripts.chat_rl -- --run=default
 """

-import os
 import itertools
-import re
-import wandb
+import os
+
 import torch
 import torch.distributed as dist
+import wandb

-from nanochat.common import compute_init, compute_cleanup, print0, get_base_dir, DummyWandb
-from nanochat.checkpoint_manager import save_checkpoint, load_model
+from nanochat.checkpoint_manager import load_model, save_checkpoint
+from nanochat.common import DummyWandb, compute_cleanup, compute_init, get_base_dir, print0
 from nanochat.engine import Engine
 from tasks.gsm8k import GSM8K

@ -48,7 +48,7 @@ save_every = 60 # every how many steps to save the model
 eval_every = 60  # every how many steps to evaluate the model for val pass@k
 eval_examples = 400  # number of examples used for evaluating pass@k
 # now allow CLI to override the settings via the configurator lol
-config_keys = [k for k,v in globals().items() if not k.startswith('_') and isinstance(v, (int, float, bool, str))]
+config_keys = [k for k, v in globals().items() if not k.startswith('_') and isinstance(v, (int, float, bool, str))]
 exec(open(os.path.join('nanochat', 'configurator.py')).read())  # overrides from command line or config file
 user_config = {k: globals()[k] for k in config_keys}  # will be useful for logging
 # -----------------------------------------------------------------------------
@ -75,12 +75,16 @@ val_task = GSM8K(subset="main", split="test")
 num_steps = (len(train_task) // examples_per_step) * num_epochs
 print0(f"Calculated number of steps: {num_steps}")

+
@torch.no_grad()
 def get_batch():
-    assistant_end = tokenizer.encode_special("<|assistant_end|>") # ok to use this token, it's only for padding and isn't used in the loss.
-    rank_indices = range(ddp_rank, len(train_task), ddp_world_size) # each rank is responsible for different examples in the training data
+    assistant_end = tokenizer.encode_special(
+        "<|assistant_end|>"
+    )  # ok to use this token, it's only for padding and isn't used in the loss.
+    rank_indices = range(
+        ddp_rank, len(train_task), ddp_world_size
+    )  # each rank is responsible for different examples in the training data
    for example_idx in itertools.cycle(rank_indices):
-
        # First get the full conversation of both user and assistant messages
        conversation = train_task[example_idx]

@ -121,7 +125,9 @@ def get_batch():

        # Pad the sequences so that their lengths (in time) match
        max_length = max(len(seq) for seq in generated_token_sequences)
-        padded_generated_token_sequences = [seq + [assistant_end] * (max_length - len(seq)) for seq in generated_token_sequences]
+        padded_generated_token_sequences = [
+            seq + [assistant_end] * (max_length - len(seq)) for seq in generated_token_sequences
+        ]
        padded_masks = [mask + [0] * (max_length - len(mask)) for mask in masks]
        # Stack up the sequences and masks into PyTorch tensors
        ids = torch.tensor(padded_generated_token_sequences, dtype=torch.long, device=device)
@ -139,14 +145,11 @@ def get_batch():
        # yield inputs/targets as (B, T) of ids and rewards as (B,) of floats
        yield generated_token_sequences, inputs, targets, rewards, advantages

+
 # -----------------------------------------------------------------------------
 # Simple evaluation loop for GSM8K pass@k
-def run_gsm8k_eval(task, tokenizer, engine,
-    max_examples=None,
-    num_samples=1,
-    max_completion_tokens=256,
-    temperature=0.0,
-    top_k=50
+def run_gsm8k_eval(
+    task, tokenizer, engine, max_examples=None, num_samples=1, max_completion_tokens=256, temperature=0.0, top_k=50
 ):
    """
    Evaluates GSM8K task and returns a list of records of evaluation outcomes.
@ -162,11 +165,7 @@ def run_gsm8k_eval(task, tokenizer, engine,
        # Generate k samples using batched generation inside the Engine
        assert num_samples <= device_batch_size  # usually this is true. we can add a loop if not...
        generated_token_sequences, masks = engine.generate_batch(
-            tokens,
-            num_samples=num_samples,
-            max_tokens=max_completion_tokens,
-            temperature=temperature,
-            top_k=top_k
+            tokens, num_samples=num_samples, max_tokens=max_completion_tokens, temperature=temperature, top_k=top_k
        )
        # Check each sample for correctness
        outcomes = []
@ -174,9 +173,7 @@ def run_gsm8k_eval(task, tokenizer, engine,
            generated_tokens = sample_tokens[prefix_length:]
            generated_text = tokenizer.decode(generated_tokens)
            is_correct = task.evaluate(conversation, generated_text)
-            outcomes.append({
-                "is_correct": is_correct
-            })
+            outcomes.append({"is_correct": is_correct})
        # A bit bloated because I wanted to do more complex logging at one point.
        record = {
            "idx": idx,
@ -184,6 +181,7 @@ def run_gsm8k_eval(task, tokenizer, engine,
        }
        yield record

+
 # -----------------------------------------------------------------------------
 # Training loop

@ -201,11 +199,13 @@ for opt in optimizers:
        group["lr"] = group["lr"] * init_lr_frac
        group["initial_lr"] = group["lr"]  # save the initial learning so we can decay easily later

+
 # Learning rate scheduler: simple rampdown to zero over num_steps
 def get_lr_multiplier(it):
    lrm = 1.0 - it / num_steps
    return lrm

+
 # Calculate the number of examples each rank handles to achieve the desired examples_per_step
 print0(f"Total sequences per step: {examples_per_step * num_samples}")  # total batch size in sequences/step
 assert examples_per_step % ddp_world_size == 0, "Desired examples per step must be divisible by the number of ranks"
@ -215,13 +215,14 @@ print0(f"Calculated examples per rank: {examples_per_rank}")
 # Kick off the training loop
 batch_iterator = get_batch()
 for step in range(num_steps):
-
    # Evaluate the model once in a while and log to wandb
    if step % eval_every == 0:
        model.eval()
        passk = torch.zeros(device_batch_size, device=device)  # pass@k for k=1..device_batch_size
        with autocast_ctx:
-            records_iter = run_gsm8k_eval(val_task, tokenizer, engine, num_samples=device_batch_size, max_examples=eval_examples, temperature=1.0)
+            records_iter = run_gsm8k_eval(
+                val_task, tokenizer, engine, num_samples=device_batch_size, max_examples=eval_examples, temperature=1.0
+            )
            records = list(records_iter)  # collect all records
        for k in range(1, device_batch_size + 1):
            passk[k - 1] = sum(any(o["is_correct"] for o in r["outcomes"][:k]) for r in records)
@ -233,10 +234,12 @@ for step in range(num_steps):
        print_passk = [f"Pass@{k}: {passk[k - 1].item():.4f}" for k in range(1, device_batch_size + 1)]
        print0(f"Step {step} | {', '.join(print_passk)}")
        log_passk = {f"pass@{k}": passk[k - 1].item() for k in range(1, device_batch_size + 1)}
-        wandb_run.log({
+        wandb_run.log(
+            {
                "step": step,
                **log_passk,
-        })
+            }
+        )

    # Forward/Backward on rollouts over multiple examples in the dataset
    rewards_list = []
@ -268,7 +271,9 @@ for step in range(num_steps):
            # Finally, formulate the loss that we want to minimize (instead of objective we wish to maximize)
            loss = -pg_obj
            loss.backward()
-            print0(f"Step {step}/{num_steps} | Example step {example_step} | Pass {pass_idx} | loss: {loss.item():.6f} | Average reward: {rewards.mean().item()}")
+            print0(
+                f"Step {step}/{num_steps} | Example step {example_step} | Pass {pass_idx} | loss: {loss.item():.6f} | Average reward: {rewards.mean().item()}"
+            )
        # For logging
        rewards_list.append(rewards_all.mean().item())
        sequence_lengths.extend(len(seq) for seq in sequences_all)
@ -283,12 +288,16 @@ for step in range(num_steps):
        dist.all_reduce(mean_sequence_length_tensor, op=dist.ReduceOp.AVG)
        mean_reward = mean_reward_tensor.item()
        mean_sequence_length = mean_sequence_length_tensor.item()
-    print0(f"Step {step}/{num_steps} | Average reward: {mean_reward} | Average sequence length: {mean_sequence_length:.2f}")
-    wandb_run.log({
+    print0(
+        f"Step {step}/{num_steps} | Average reward: {mean_reward} | Average sequence length: {mean_sequence_length:.2f}"
+    )
+    wandb_run.log(
+        {
            "step": step,
            "reward": mean_reward,
            "sequence_length": mean_sequence_length,
-    })
+        }
+    )

    # Update the model parameters
    lrm = get_lr_multiplier(step)
@ -298,10 +307,12 @@ for step in range(num_steps):
    for opt in optimizers:  # then step the optimizers
        opt.step()
    model.zero_grad(set_to_none=True)
-    wandb_run.log({
+    wandb_run.log(
+        {
            "step": step,
            "lrm": lrm,
-    })
+        }
+    )

    # Master process saves the model once in a while. Skip first step. Save last step.
    if master_process and ((step > 0 and step % save_every == 0) or step == num_steps - 1):
@ -317,15 +328,19 @@ for step in range(num_steps):
            None,  # note: we don't bother to save the optimizer state
            {
                "model_config": model_config_kwargs,
-            }
+            },
        )
        print(f"✅ Saved model checkpoint to {checkpoint_dir}")

 # Log to report
 from nanochat.report import get_report
-get_report().log(section="Chat RL", data=[
+
+get_report().log(
+    section="Chat RL",
+    data=[
        user_config,  # CLI args
-])
+    ],
+)

 wandb_run.finish()  # wandb run finish
 compute_cleanup()
--- a/scripts/chat_sft.py
+++ b/scripts/chat_sft.py
@ -10,24 +10,24 @@ torchrun --standalone --nproc_per_node=8 -m scripts.chat_sft
 """

 import os
+
 os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

-import wandb
-import torch
-import torch.distributed as dist
 from contextlib import nullcontext

-from nanochat.common import compute_init, compute_cleanup, get_base_dir, print0, DummyWandb, autodetect_device_type
-from nanochat.checkpoint_manager import load_model
-from nanochat.checkpoint_manager import save_checkpoint
+import torch
+import torch.distributed as dist
+import wandb
+
+from nanochat.checkpoint_manager import load_model, save_checkpoint
+from nanochat.common import DummyWandb, autodetect_device_type, compute_cleanup, compute_init, get_base_dir, print0
 from nanochat.engine import Engine
 from scripts.chat_eval import run_chat_eval
-
-from tasks.common import TaskMixture
 from tasks.arc import ARC
+from tasks.common import TaskMixture
+from tasks.customjson import CustomJSON
 from tasks.gsm8k import GSM8K
 from tasks.smoltalk import SmolTalk
-from tasks.customjson import CustomJSON
 from tasks.spellingbee import SimpleSpelling, SpellingBee

 # -----------------------------------------------------------------------------
@ -56,7 +56,7 @@ eval_steps = 100
 eval_metrics_every = 200
 eval_metrics_max_problems = 1024
 # now allow CLI to override the settings via the configurator lol
-config_keys = [k for k,v in globals().items() if not k.startswith('_') and isinstance(v, (int, float, bool, str))]
+config_keys = [k for k, v in globals().items() if not k.startswith('_') and isinstance(v, (int, float, bool, str))]
 exec(open(os.path.join('nanochat', 'configurator.py')).read())  # overrides from command line or config file
 user_config = {k: globals()[k] for k in config_keys}  # possibly useful for logging
 # -----------------------------------------------------------------------------
@ -70,7 +70,11 @@ autocast_ctx = torch.amp.autocast(device_type=device_type, dtype=ptdtype) if dev

 # wandb logging init
 use_dummy_wandb = run == "dummy" or not master_process
-wandb_run = DummyWandb() if use_dummy_wandb else wandb.init(project="nanochat-sft", name=run, config=user_config, save_code=True)
+wandb_run = (
+    DummyWandb()
+    if use_dummy_wandb
+    else wandb.init(project="nanochat-sft", name=run, config=user_config, save_code=True)
+)

 # Load the model and tokenizer
 model, tokenizer, meta = load_model(source, device, phase="train", model_tag=model_tag, step=step)
@ -81,7 +85,8 @@ engine = Engine(model, tokenizer) # will be used for inline model evaluation onl
 # -----------------------------------------------------------------------------
 # Task data mixture we'll train on
 identity_conversations_filepath = os.path.join(get_base_dir(), "identity_conversations.jsonl")
-train_ds = TaskMixture([
+train_ds = TaskMixture(
+    [
        ARC(subset="ARC-Easy", split="train"),  # 2.3K rows
        ARC(subset="ARC-Challenge", split="train"),  # 1.1K rows
        GSM8K(subset="main", split="train"),  # 8K rows
@ -89,14 +94,19 @@ train_ds = TaskMixture([
        CustomJSON(filepath=identity_conversations_filepath),  # 1K rows of synthetic identity conversations
        SimpleSpelling(size=300, split="train"),  # 300 rows of Simple Spelling (e.g. spell the word 'apple')
        SpellingBee(size=300, split="train"),  # 300 rows of Spelling Bee (e.g. how many 'r' are in 'strawberry'?)
-]) # 2.3K + 1.1K + 8K + 10K + 1K + 0.3K + 0.3K = 23K rows
+    ]
+)  # 2.3K + 1.1K + 8K + 10K + 1K + 0.3K + 0.3K = 23K rows
 val_ds = SmolTalk(split="test")  # general conversations, 24K rows (though we don't actually use all of it)

 # -----------------------------------------------------------------------------
 # DataLoader

+
 def sft_data_generator(dataset, batch_size):
-    pad_token_id = tokenizer.encode_special("<|assistant_end|>") # use <|assistant_end|> as the pad token is ok, these positions are masked in the loss
+    pad_token_id = tokenizer.encode_special(
+        "<|assistant_end|>"
+    )  # use <|assistant_end|> as the pad token is ok, these positions are masked in the loss
+
    # prepares a list of tokenized conversations into a batch and yields
    def collate_and_yield(batch):
        nrows = len(batch)
@ -106,16 +116,17 @@ def sft_data_generator(dataset, batch_size):
        for i, (ids, mask) in enumerate(batch):
            n = len(ids)
            ids_tensor = torch.tensor(ids, dtype=torch.long)
-            inputs[i, :n-1] = ids_tensor[:-1]
+            inputs[i, : n - 1] = ids_tensor[:-1]
            # recall -1 is the ignore index, so mask out targets where mask is 0
            row_targets = ids_tensor[1:]
            # mask[1:] omits the mask for the BOS token, which is never a target atm so it's ok
            mask_tensor = torch.tensor(mask[1:], dtype=torch.long)
            row_targets[mask_tensor == 0] = -1  # mask out targets where mask is 0
-            targets[i, :n-1] = row_targets
+            targets[i, : n - 1] = row_targets
        inputs = inputs.to(device)  # move to device
        targets = targets.to(device)
        return inputs, targets
+
    # iterates over the dataset in epochs, tokenizes
    batch = []
    while True:
@ -127,11 +138,14 @@ def sft_data_generator(dataset, batch_size):
                yield collate_and_yield(batch)
                batch = []

+
 examples_per_step = device_batch_size * ddp_world_size
 print0(f"Target examples per step: {target_examples_per_step}")
 print0(f"Device batch size: {device_batch_size}")
 print0(f"Examples per step is device_batch_size * ddp_world_size: {examples_per_step}")
-assert target_examples_per_step % examples_per_step == 0, "Target examples per step must be divisible by examples per step"
+assert target_examples_per_step % examples_per_step == 0, (
+    "Target examples per step must be divisible by examples per step"
+)
 grad_accum_steps = target_examples_per_step // examples_per_step
 print0(f"=> Setting grad accum steps: {grad_accum_steps}")

@ -160,11 +174,13 @@ for opt in optimizers:
 # -----------------------------------------------------------------------------
 # Training loop

+
 # Learning rate scheduler
 def get_lr_multiplier(it):
    lrm = 1.0 - it / num_iterations
    return lrm

+
 # Go!
 step = 0
 train_iter = iter(train_loader)
@ -186,10 +202,12 @@ for step in range(num_iterations):
            dist.all_reduce(val_loss, op=dist.ReduceOp.AVG)  # average over ranks
        val_loss = val_loss.item()
        print0(f"Step {step:05d} | Validation loss: {val_loss:.6f}")
-        wandb_run.log({
+        wandb_run.log(
+            {
                "step": step,
                "val_loss": val_loss,
-        })
+            }
+        )
        model.train()

    # evaluate accuracy of the multiple choice tasks (which are quick to run)
@ -198,14 +216,30 @@ for step in range(num_iterations):
        metrics = {}
        with torch.no_grad(), autocast_ctx:
            # note that because these are inside no_grad, we can usually afford to at least ~2X the batch size
-            metrics["mmlu_acc"] = run_chat_eval("MMLU", model, tokenizer, engine, batch_size=device_batch_size*2, max_problems=eval_metrics_max_problems)
-            metrics["arc_easy_acc"] = run_chat_eval("ARC-Easy", model, tokenizer, engine, batch_size=device_batch_size*2, max_problems=eval_metrics_max_problems)
+            metrics["mmlu_acc"] = run_chat_eval(
+                "MMLU",
+                model,
+                tokenizer,
+                engine,
+                batch_size=device_batch_size * 2,
+                max_problems=eval_metrics_max_problems,
+            )
+            metrics["arc_easy_acc"] = run_chat_eval(
+                "ARC-Easy",
+                model,
+                tokenizer,
+                engine,
+                batch_size=device_batch_size * 2,
+                max_problems=eval_metrics_max_problems,
+            )
        metrics_str = ', '.join(f'{k}: {v:.6f}' for k, v in metrics.items())
        print0(f"Step {step:05d} | {metrics_str}")
-        wandb_run.log({
+        wandb_run.log(
+            {
                "step": step,
                **metrics,
-        })
+            }
+        )
        model.train()

    if last_step:
@ -238,13 +272,17 @@ for step in range(num_iterations):
    # logging
    train_loss_item = train_loss.item()
    num_tokens_item = num_tokens.item()
-    print0(f"Step {step:05d}/{num_iterations:05d} | Training loss: {train_loss_item:.6f}| lrm: {lrm:.6f}| num_tokens: {num_tokens_item:,}")
-    wandb_run.log({
+    print0(
+        f"Step {step:05d}/{num_iterations:05d} | Training loss: {train_loss_item:.6f}| lrm: {lrm:.6f}| num_tokens: {num_tokens_item:,}"
+    )
+    wandb_run.log(
+        {
            "step": step,
            "lrm": lrm,
            "train_loss": train_loss_item,
            "num_tokens": num_tokens_item,
-    })
+        }
+    )
    step += 1

 # Save the model at the end of the run
@ -264,13 +302,16 @@ if master_process:
            "val_loss": val_loss,
            **metrics,
            "model_config": model_config_kwargs,
-        }
+        },
    )
    print(f"✅ Saved model checkpoint to {checkpoint_dir}")

 # Log to report
 from nanochat.report import get_report
-get_report().log(section="Chat SFT", data=[
+
+get_report().log(
+    section="Chat SFT",
+    data=[
        user_config,  # CLI args
        {
            "Training rows": len(train_ds),
@ -278,7 +319,8 @@ get_report().log(section="Chat SFT", data=[
            "Training loss": train_loss_item,
            "Validation loss": val_loss,
        },
-])
+    ],
+)

 # Cleanup
 wandb_run.finish()
--- a/scripts/chat_web.py
+++ b/scripts/chat_web.py
@ -31,22 +31,23 @@ Abuse Prevention:
 """

 import argparse
-import json
-import os
-import torch
 import asyncio
+import json
 import logging
+import os
 import random
-from contextlib import asynccontextmanager
+from collections.abc import AsyncGenerator
+from contextlib import asynccontextmanager, nullcontext
+from dataclasses import dataclass
+
+import torch
 from fastapi import FastAPI, HTTPException
 from fastapi.middleware.cors import CORSMiddleware
-from fastapi.responses import StreamingResponse, HTMLResponse, FileResponse
+from fastapi.responses import FileResponse, HTMLResponse, StreamingResponse
 from pydantic import BaseModel
-from typing import List, Optional, AsyncGenerator
-from dataclasses import dataclass
-from contextlib import nullcontext
-from nanochat.common import compute_init, autodetect_device_type
+
 from nanochat.checkpoint_manager import load_model
+from nanochat.common import autodetect_device_type, compute_init
 from nanochat.engine import Engine

 # Abuse prevention limits
@ -70,52 +71,56 @@ parser.add_argument('-g', '--model-tag', type=str, default=None, help='Model tag
 parser.add_argument('-s', '--step', type=int, default=None, help='Step to load')
 parser.add_argument('-p', '--port', type=int, default=8000, help='Port to run the server on')
 parser.add_argument('-d', '--dtype', type=str, default='bfloat16', choices=['float32', 'bfloat16'])
-parser.add_argument('--device-type', type=str, default='', choices=['cuda', 'cpu', 'mps'], help='Device type for evaluation: cuda|cpu|mps. empty => autodetect')
+parser.add_argument(
+    '--device-type',
+    type=str,
+    default='',
+    choices=['cuda', 'cpu', 'mps'],
+    help='Device type for evaluation: cuda|cpu|mps. empty => autodetect',
+)
 parser.add_argument('--host', type=str, default='0.0.0.0', help='Host to bind the server to')
 args = parser.parse_args()

 # Configure logging for conversation traffic
-logging.basicConfig(
-    level=logging.INFO,
-    format='%(asctime)s - %(message)s',
-    datefmt='%Y-%m-%d %H:%M:%S'
-)
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S')
 logger = logging.getLogger(__name__)

 device_type = autodetect_device_type() if args.device_type == "" else args.device_type
 ddp, ddp_rank, ddp_local_rank, ddp_world_size, device = compute_init(device_type)
 ptdtype = torch.float32 if args.dtype == 'float32' else torch.bfloat16

+
@dataclass
 class Worker:
    """A worker with a model loaded on a specific GPU."""
+
    gpu_id: int
    device: torch.device
    engine: Engine
    tokenizer: object
    autocast_ctx: torch.amp.autocast

+
 class WorkerPool:
    """Pool of workers, each with a model replica on a different GPU."""

-    def __init__(self, num_gpus: Optional[int] = None):
+    def __init__(self, num_gpus: int | None = None):
        if num_gpus is None:
            if device_type == "cuda":
                num_gpus = torch.cuda.device_count()
            else:
                num_gpus = 1  # e.g. cpu|mps
        self.num_gpus = num_gpus
-        self.workers: List[Worker] = []
+        self.workers: list[Worker] = []
        self.available_workers: asyncio.Queue = asyncio.Queue()

-    async def initialize(self, source: str, model_tag: Optional[str] = None, step: Optional[int] = None):
+    async def initialize(self, source: str, model_tag: str | None = None, step: int | None = None):
        """Load model on each GPU."""
        print(f"Initializing worker pool with {self.num_gpus} GPUs...")
        if self.num_gpus > 1:
            assert device_type == "cuda", "Only CUDA supports multiple workers/GPUs. cpu|mps does not."

        for gpu_id in range(self.num_gpus):
-
            if device_type == "cuda":
                device = torch.device(f"cuda:{gpu_id}")
                print(f"Loading model on GPU {gpu_id}...")
@ -125,15 +130,11 @@ class WorkerPool:

            model, tokenizer, _ = load_model(source, device, phase="eval", model_tag=model_tag, step=step)
            engine = Engine(model, tokenizer)
-            autocast_ctx = torch.amp.autocast(device_type=device_type, dtype=ptdtype) if device_type == "cuda" else nullcontext()
-
-            worker = Worker(
-                gpu_id=gpu_id,
-                device=device,
-                engine=engine,
-                tokenizer=tokenizer,
-                autocast_ctx=autocast_ctx
+            autocast_ctx = (
+                torch.amp.autocast(device_type=device_type, dtype=ptdtype) if device_type == "cuda" else nullcontext()
            )
+
+            worker = Worker(gpu_id=gpu_id, device=device, engine=engine, tokenizer=tokenizer, autocast_ctx=autocast_ctx)
            self.workers.append(worker)
            await self.available_workers.put(worker)

@ -147,15 +148,18 @@ class WorkerPool:
        """Return a worker to the pool."""
        await self.available_workers.put(worker)

+
 class ChatMessage(BaseModel):
    role: str
    content: str

+
 class ChatRequest(BaseModel):
-    messages: List[ChatMessage]
-    temperature: Optional[float] = None
-    max_tokens: Optional[int] = None
-    top_k: Optional[int] = None
+    messages: list[ChatMessage]
+    temperature: float | None = None
+    max_tokens: int | None = None
+    top_k: int | None = None
+

 def validate_chat_request(request: ChatRequest):
    """Validate chat request to prevent abuse."""
@ -165,7 +169,7 @@ def validate_chat_request(request: ChatRequest):
    if len(request.messages) > MAX_MESSAGES_PER_REQUEST:
        raise HTTPException(
            status_code=400,
-            detail=f"Too many messages. Maximum {MAX_MESSAGES_PER_REQUEST} messages allowed per request"
+            detail=f"Too many messages. Maximum {MAX_MESSAGES_PER_REQUEST} messages allowed per request",
        )

    # Check individual message lengths and total conversation length
@ -178,48 +182,43 @@ def validate_chat_request(request: ChatRequest):
        if msg_length > MAX_MESSAGE_LENGTH:
            raise HTTPException(
                status_code=400,
-                detail=f"Message {i} is too long. Maximum {MAX_MESSAGE_LENGTH} characters allowed per message"
+                detail=f"Message {i} is too long. Maximum {MAX_MESSAGE_LENGTH} characters allowed per message",
            )
        total_length += msg_length

    if total_length > MAX_TOTAL_CONVERSATION_LENGTH:
        raise HTTPException(
            status_code=400,
-            detail=f"Total conversation is too long. Maximum {MAX_TOTAL_CONVERSATION_LENGTH} characters allowed"
+            detail=f"Total conversation is too long. Maximum {MAX_TOTAL_CONVERSATION_LENGTH} characters allowed",
        )

    # Validate role values
    for i, message in enumerate(request.messages):
        if message.role not in ["user", "assistant"]:
            raise HTTPException(
-                status_code=400,
-                detail=f"Message {i} has invalid role. Must be 'user', 'assistant', or 'system'"
+                status_code=400, detail=f"Message {i} has invalid role. Must be 'user', 'assistant', or 'system'"
            )

    # Validate temperature
    if request.temperature is not None:
        if not (MIN_TEMPERATURE <= request.temperature <= MAX_TEMPERATURE):
            raise HTTPException(
-                status_code=400,
-                detail=f"Temperature must be between {MIN_TEMPERATURE} and {MAX_TEMPERATURE}"
+                status_code=400, detail=f"Temperature must be between {MIN_TEMPERATURE} and {MAX_TEMPERATURE}"
            )

    # Validate top_k
    if request.top_k is not None:
        if not (MIN_TOP_K <= request.top_k <= MAX_TOP_K):
-            raise HTTPException(
-                status_code=400,
-                detail=f"top_k must be between {MIN_TOP_K} and {MAX_TOP_K}"
-            )
+            raise HTTPException(status_code=400, detail=f"top_k must be between {MIN_TOP_K} and {MAX_TOP_K}")

    # Validate max_tokens
    if request.max_tokens is not None:
        if not (MIN_MAX_TOKENS <= request.max_tokens <= MAX_MAX_TOKENS):
            raise HTTPException(
-                status_code=400,
-                detail=f"max_tokens must be between {MIN_MAX_TOKENS} and {MAX_MAX_TOKENS}"
+                status_code=400, detail=f"max_tokens must be between {MIN_MAX_TOKENS} and {MAX_MAX_TOKENS}"
            )

+
@asynccontextmanager
 async def lifespan(app: FastAPI):
    """Load models on all GPUs on startup."""
@ -229,6 +228,7 @@ async def lifespan(app: FastAPI):
    print(f"Server ready at http://localhost:{args.port}")
    yield

+
 app = FastAPI(lifespan=lifespan)

 app.add_middleware(
@ -239,16 +239,16 @@ app.add_middleware(
    allow_headers=["*"],
 )

+
@app.get("/")
 async def root():
    """Serve the chat UI."""
    ui_html_path = os.path.join("nanochat", "ui.html")
-    with open(ui_html_path, "r", encoding="utf-8") as f:
+    with open(ui_html_path, encoding="utf-8") as f:
        html_content = f.read()
    # Replace the API_URL to use the same origin
    html_content = html_content.replace(
-        "const API_URL = `http://${window.location.hostname}:8000`;",
-        "const API_URL = '';"
+        "const API_URL = `http://${window.location.hostname}:8000`;", "const API_URL = '';"
    )
    return HTMLResponse(content=html_content)

@ -259,12 +259,9 @@ async def logo():
    logo_path = os.path.join("nanochat", "logo.svg")
    return FileResponse(logo_path, media_type="image/svg+xml")

+
 async def generate_stream(
-    worker: Worker,
-    tokens,
-    temperature=None,
-    max_new_tokens=None,
-    top_k=None
+    worker: Worker, tokens, temperature=None, max_new_tokens=None, top_k=None
 ) -> AsyncGenerator[str, None]:
    """Generate assistant response with streaming."""
    temperature = temperature if temperature is not None else args.temperature
@ -286,7 +283,7 @@ async def generate_stream(
            max_tokens=max_new_tokens,
            temperature=temperature,
            top_k=top_k,
-            seed=random.randint(0, 2**31 - 1)
+            seed=random.randint(0, 2**31 - 1),
        ):
            token = token_column[0]

@ -303,13 +300,14 @@ async def generate_stream(
            # This ensures we don't emit incomplete UTF-8 sequences
            if not current_text.endswith('<EFBFBD>'):
                # Extract only the new text since last clean decode
-                new_text = current_text[len(last_clean_text):]
+                new_text = current_text[len(last_clean_text) :]
                if new_text:  # Only yield if there's new content
                    yield f"data: {json.dumps({'token': new_text, 'gpu': worker.gpu_id}, ensure_ascii=False)}\n\n"
                    last_clean_text = current_text

    yield f"data: {json.dumps({'done': True})}\n\n"

+
@app.post("/chat/completions")
 async def chat_completions(request: ChatRequest):
    """Chat completion endpoint (streaming only) - uses worker pool for multi-GPU."""
@ -318,10 +316,10 @@ async def chat_completions(request: ChatRequest):
    validate_chat_request(request)

    # Log incoming conversation to console
-    logger.info("="*20)
+    logger.info("=" * 20)
    for i, message in enumerate(request.messages):
        logger.info(f"[{message.role.upper()}]: {message.content}")
-    logger.info("-"*20)
+    logger.info("-" * 20)

    # Acquire a worker from the pool (will wait if all are busy)
    worker_pool = app.state.worker_pool
@ -350,6 +348,7 @@ async def chat_completions(request: ChatRequest):

        # Streaming response with worker release after completion
        response_tokens = []
+
        async def stream_and_release():
            try:
                async for chunk in generate_stream(
@ -357,7 +356,7 @@ async def chat_completions(request: ChatRequest):
                    conversation_tokens,
                    temperature=request.temperature,
                    max_new_tokens=request.max_tokens,
-                    top_k=request.top_k
+                    top_k=request.top_k,
                ):
                    # Accumulate response for logging
                    chunk_data = json.loads(chunk.replace("data: ", "").strip())
@ -368,19 +367,17 @@ async def chat_completions(request: ChatRequest):
                # Log the assistant response to console
                full_response = "".join(response_tokens)
                logger.info(f"[ASSISTANT] (GPU {worker.gpu_id}): {full_response}")
-                logger.info("="*20)
+                logger.info("=" * 20)
                # Release worker back to pool after streaming is done
                await worker_pool.release_worker(worker)

-        return StreamingResponse(
-            stream_and_release(),
-            media_type="text/event-stream"
-        )
+        return StreamingResponse(stream_and_release(), media_type="text/event-stream")
    except Exception as e:
        # Make sure to release worker even on error
        await worker_pool.release_worker(worker)
        raise e

+
@app.get("/health")
 async def health():
    """Health check endpoint."""
@ -389,9 +386,10 @@ async def health():
        "status": "ok",
        "ready": worker_pool is not None and len(worker_pool.workers) > 0,
        "num_gpus": worker_pool.num_gpus if worker_pool else 0,
-        "available_workers": worker_pool.available_workers.qsize() if worker_pool else 0
+        "available_workers": worker_pool.available_workers.qsize() if worker_pool else 0,
    }

+
@app.get("/stats")
 async def stats():
    """Get worker pool statistics."""
@ -400,16 +398,13 @@ async def stats():
        "total_workers": len(worker_pool.workers),
        "available_workers": worker_pool.available_workers.qsize(),
        "busy_workers": len(worker_pool.workers) - worker_pool.available_workers.qsize(),
-        "workers": [
-            {
-                "gpu_id": w.gpu_id,
-                "device": str(w.device)
-            } for w in worker_pool.workers
-        ]
+        "workers": [{"gpu_id": w.gpu_id, "device": str(w.device)} for w in worker_pool.workers],
    }

+
 if __name__ == "__main__":
    import uvicorn
-    print(f"Starting NanoChat Web Server")
+
+    print("Starting NanoChat Web Server")
    print(f"Temperature: {args.temperature}, Top-k: {args.top_k}, Max tokens: {args.max_tokens}")
    uvicorn.run(app, host=args.host, port=args.port)
--- a/scripts/mid_train.py
+++ b/scripts/mid_train.py
@ -9,25 +9,26 @@ Or torchrun for training:
 torchrun --standalone --nproc_per_node=8 -m scripts.mid_train -- --device_batch_size=16
 """

-from collections import deque
 import os
+from collections import deque
+
 os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
 import time
-import wandb
-import torch
 from contextlib import nullcontext
-from nanochat.common import compute_init, compute_cleanup, print0, DummyWandb, get_base_dir, autodetect_device_type
-from nanochat.tokenizer import get_token_bytes
-from nanochat.checkpoint_manager import save_checkpoint
-from nanochat.loss_eval import evaluate_bpb
-from nanochat.checkpoint_manager import load_model
-import torch.distributed as dist

+import torch
+import torch.distributed as dist
+import wandb
+
+from nanochat.checkpoint_manager import load_model, save_checkpoint
+from nanochat.common import DummyWandb, autodetect_device_type, compute_cleanup, compute_init, get_base_dir, print0
+from nanochat.loss_eval import evaluate_bpb
+from nanochat.tokenizer import get_token_bytes
 from tasks.common import TaskMixture
+from tasks.customjson import CustomJSON
 from tasks.gsm8k import GSM8K
 from tasks.mmlu import MMLU
 from tasks.smoltalk import SmolTalk
-from tasks.customjson import CustomJSON
 from tasks.spellingbee import SimpleSpelling, SpellingBee

 # -----------------------------------------------------------------------------
@ -45,10 +46,10 @@ matrix_lr = 0.02
 init_lr_frac = 1.0  # initial learning rate is this fraction of the base learning rate
 weight_decay = 0.0
 eval_every = 150  # -1 = disable
-eval_tokens = 20*524288
+eval_tokens = 20 * 524288
 total_batch_size = 524288
 dry_run = 0  # dry_run=1 is for experiments: we will log to wandb but we won't write checkpoints or report
-config_keys = [k for k,v in globals().items() if not k.startswith('_') and isinstance(v, (int, float, bool, str))]
+config_keys = [k for k, v in globals().items() if not k.startswith('_') and isinstance(v, (int, float, bool, str))]
 exec(open(os.path.join('nanochat', 'configurator.py')).read())  # overrides from command line or config file
 user_config = {k: globals()[k] for k in config_keys}  # possibly useful for logging
 # -----------------------------------------------------------------------------
@ -57,7 +58,9 @@ user_config = {k: globals()[k] for k in config_keys} # possibly useful for loggi
 device_type = autodetect_device_type() if device_type == "" else device_type
 ddp, ddp_rank, ddp_local_rank, ddp_world_size, device = compute_init(device_type)
 master_process = ddp_rank == 0
-autocast_ctx = torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16) if device_type == "cuda" else nullcontext()
+autocast_ctx = (
+    torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16) if device_type == "cuda" else nullcontext()
+)
 synchronize = torch.cuda.synchronize if device_type == "cuda" else lambda: None
 get_max_memory = torch.cuda.max_memory_allocated if device_type == "cuda" else lambda: 0

@ -69,7 +72,9 @@ wandb_run = DummyWandb() if use_dummy_wandb else wandb.init(project="nanochat-mi
 model, tokenizer, meta = load_model("base", device, phase="train", model_tag=model_tag, step=step)
 pretrain_batch_size = meta.get("device_batch_size", None)
 if pretrain_batch_size is not None and device_batch_size > pretrain_batch_size:
-    print0(f"FOOTGUN WARNING: base model training used device_batch_size {pretrain_batch_size}, did you pass in a good --device_batch_size to this script?")
+    print0(
+        f"FOOTGUN WARNING: base model training used device_batch_size {pretrain_batch_size}, did you pass in a good --device_batch_size to this script?"
+    )
 orig_model = model
 model = torch.compile(model, dynamic=False)
 depth = model.config.n_layer
@ -84,7 +89,9 @@ print0(f"Total batch size {total_batch_size:,} => gradient accumulation steps: {
 token_bytes = get_token_bytes(device=device)

 # Initialize the Optimizer (Muon for Linear layers, AdamW for embedding and lm_head)
-optimizers = model.setup_optimizers(unembedding_lr=unembedding_lr, embedding_lr=embedding_lr, matrix_lr=matrix_lr, weight_decay=weight_decay)
+optimizers = model.setup_optimizers(
+    unembedding_lr=unembedding_lr, embedding_lr=embedding_lr, matrix_lr=matrix_lr, weight_decay=weight_decay
+)
 adamw_optimizer, muon_optimizer = optimizers
 # Override the initial learning rate as a fraction of the base learning rate
 for opt in optimizers:
@ -95,25 +102,33 @@ for opt in optimizers:
 # Midtraining data mixture and DataLoader
 base_dir = get_base_dir()
 identity_conversations_filepath = os.path.join(base_dir, "identity_conversations.jsonl")
-train_dataset = TaskMixture([
+train_dataset = TaskMixture(
+    [
        SmolTalk(split="train"),  # 460K rows of general conversations
-    MMLU(subset="auxiliary_train", split="train"), # 100K rows of multiple choice problems drawn from ARC, MC_TEST, OBQA, RACE
+        MMLU(
+            subset="auxiliary_train", split="train"
+        ),  # 100K rows of multiple choice problems drawn from ARC, MC_TEST, OBQA, RACE
        GSM8K(subset="main", split="train"),  # 8K rows teaching simple math and (calculator) tool use
        CustomJSON(filepath=identity_conversations_filepath),  # 1000 rows of synthetic identity conversations
        CustomJSON(filepath=identity_conversations_filepath),  # let's do 2 epochs of these
        SimpleSpelling(size=200000, split="train"),  # 200K rows of Simple Spelling (e.g. spell the word 'apple')
        SpellingBee(size=80000, split="train"),  # 80K rows of Spelling Bee (e.g. how many 'r' are in 'strawberry'?)
-]) # total: 460K + 100K + 8K + 200K + 80K = 848K rows
-val_dataset = TaskMixture([
+    ]
+)  # total: 460K + 100K + 8K + 200K + 80K = 848K rows
+val_dataset = TaskMixture(
+    [
        SmolTalk(split="test"),  # 24K rows in test set
        MMLU(subset="all", split="test", stop=5200),  # 14K rows in test set, use only 5.2K to match the train ratios
        GSM8K(subset="main", split="test", stop=420),  # 1.32K rows in test set, use only 420 to match the train ratios
-]) # total: 24K + 14K + 1.32K ~= 39K rows
+    ]
+)  # total: 24K + 14K + 1.32K ~= 39K rows
 # DataLoader is defined here, it emits inputs, targets : 2D tensors of shape (device_batch_size, max_seq_len)
 # A big problem is that we don't know the final num_iterations in advance. So we create
 # these two global variables and update them from within the data generator.
 last_step = False  # we will toggle this to True when we reach the end of the dataset
 approx_progress = 0.0  # will go from 0 to 1 over the course of the epoch
+
+
 def mid_data_generator(split):
    global last_step, approx_progress
    assert split in {"train", "val"}, "split must be 'train' or 'val'"
@ -147,7 +162,9 @@ def mid_data_generator(split):
        inputs_cpu = scratch[:-1].to(dtype=torch.int32)
        targets_cpu = scratch[1:]
        inputs = inputs_cpu.view(device_batch_size, max_seq_len).to(device=device, dtype=torch.int32, non_blocking=True)
-        targets = targets_cpu.view(device_batch_size, max_seq_len).to(device=device, dtype=torch.int64, non_blocking=True)
+        targets = targets_cpu.view(device_batch_size, max_seq_len).to(
+            device=device, dtype=torch.int64, non_blocking=True
+        )
        if split == "train":
            if num_iterations > 0:
                approx_progress = it / num_iterations  # calculate progress from the max number of iterations
@ -155,21 +172,25 @@ def mid_data_generator(split):
                approx_progress = cursor / dataset_size  # approximate progress as a fraction of the dataset
        yield inputs, targets

+
 train_loader = mid_data_generator("train")
 build_val_loader = lambda: mid_data_generator("val")
 progress = 0  # will go from 0 to 1 over the course of the epoch

+
 # Learning rate scheduler
 def get_lr_multiplier(progress):
    # first 80% of training: no decay, then linearly ramp down to 0.
    return 1 if progress < 0.8 else 1 - (progress - 0.8) / 0.2

+
 # Momentum scheduler for Muon optimizer
 def get_muon_momentum(it):
    frac = min(it / 300, 1)
    momentum = (1 - frac) * 0.85 + frac * 0.95
    return momentum

+
 # -----------------------------------------------------------------------------
 # Training loop
 x, y = next(train_loader)  # prefetch the very first batch of data
@ -197,12 +218,14 @@ while True:
        print0(f"Step {step:05d} | Validation bpb: {val_bpb:.4f}")
        if val_bpb < min_val_bpb:
            min_val_bpb = val_bpb
-        wandb_run.log({
+        wandb_run.log(
+            {
                "step": step,
                "total_training_flops": flops_so_far,
                "total_training_time": total_training_time,
                "val/bpb": val_bpb,
-        })
+            }
+        )
        model.train()

    # save checkpoint at the end of the run (only on master process)
@ -226,7 +249,7 @@ while True:
                    "n_embd": model.config.n_embd,
                },
                "user_config": user_config,  # inputs to the training script
-            }
+            },
        )

    if last_step:
@ -266,7 +289,7 @@ while True:

    # logging
    smooth_train_loss = ema_beta * smooth_train_loss + (1 - ema_beta) * train_loss.item()  # EMA the training loss
-    debiased_smooth_loss = smooth_train_loss / (1 - ema_beta**(step + 1)) # debias the EMA
+    debiased_smooth_loss = smooth_train_loss / (1 - ema_beta ** (step + 1))  # debias the EMA
    pct_done = 100 * progress
    tok_per_sec = int(total_batch_size / dt)
    flops_per_sec = num_flops_per_token * total_batch_size / dt
@ -274,9 +297,12 @@ while True:
    mfu = 100 * flops_per_sec / promised_flops_per_sec_h100  # in %
    if step > 10:
        total_training_time += dt  # only count the time after the first 10 steps
-    print0(f"step {step:05d} ({pct_done:.2f}%) | loss: {debiased_smooth_loss:.6f} | lrm: {lrm:.2f} | dt: {dt * 1000:.2f}ms | tok/sec: {tok_per_sec:,} | mfu: {mfu:.2f} | total time: {total_training_time/60:.2f}m")
+    print0(
+        f"step {step:05d} ({pct_done:.2f}%) | loss: {debiased_smooth_loss:.6f} | lrm: {lrm:.2f} | dt: {dt * 1000:.2f}ms | tok/sec: {tok_per_sec:,} | mfu: {mfu:.2f} | total time: {total_training_time / 60:.2f}m"
+    )
    if step % 10 == 0:
-        wandb_run.log({
+        wandb_run.log(
+            {
                "step": step,
                "total_training_flops": flops_so_far,
                "total_training_time": total_training_time,
@ -285,17 +311,21 @@ while True:
                "train/dt": dt,
                "train/tok_per_sec": tok_per_sec,
                "train/mfu": mfu,
-        })
+            }
+        )

 # print a few more stats
 print0(f"Peak memory usage: {get_max_memory() / 1024 / 1024:.2f}MiB")
-print0(f"Total training time: {total_training_time/60:.2f}m")
+print0(f"Total training time: {total_training_time / 60:.2f}m")
 print0(f"Minimum validation bpb: {min_val_bpb:.4f}")

 # Log to report
 if not dry_run:
    from nanochat.report import get_report
-    get_report().log(section="Midtraining", data=[
+
+    get_report().log(
+        section="Midtraining",
+        data=[
            user_config,  # CLI args
            {  # stats about the training setup
                "Number of iterations": step,
@ -303,8 +333,9 @@ if not dry_run:
            },
            {  # stats about training outcomes
                "Minimum validation bpb": min_val_bpb,
-        }
-    ])
+            },
+        ],
+    )

 # cleanup
 wandb_run.finish()  # wandb run finish
--- a/scripts/tok_eval.py
+++ b/scripts/tok_eval.py
@ -2,8 +2,8 @@
 Evaluate compression ratio of the tokenizer.
 """

-from nanochat.tokenizer import get_tokenizer, RustBPETokenizer
 from nanochat.dataset import parquets_iter_batched
+from nanochat.tokenizer import RustBPETokenizer, get_tokenizer

 # Random text I got from a random website this morning
 news_text = r"""
@ -165,7 +165,6 @@ tokenizer_results = {}
 vocab_sizes = {}

 for tokenizer_name in ["gpt2", "gpt4", "ours"]:
-
    if tokenizer_name == "gpt2":
        tokenizer = RustBPETokenizer.from_pretrained("gpt2")  # gpt-2 base model tokenizer
    elif tokenizer_name == "gpt4":
@ -183,11 +182,7 @@ for tokenizer_name in ["gpt2", "gpt4", "ours"]:

        encoded_bytes = text.encode('utf-8')
        ratio = len(encoded_bytes) / len(encoded)
-        tokenizer_results[tokenizer_name][name] = {
-            'bytes': len(encoded_bytes),
-            'tokens': len(encoded),
-            'ratio': ratio
-        }
+        tokenizer_results[tokenizer_name][name] = {'bytes': len(encoded_bytes), 'tokens': len(encoded), 'ratio': ratio}

 # ANSI color codes
 GREEN = '\033[92m'
@ -195,11 +190,12 @@ RED = '\033[91m'
 RESET = '\033[0m'

 # Print vocab sizes
-print(f"\nVocab sizes:")
+print("\nVocab sizes:")
 print(f"GPT-2: {vocab_sizes['gpt2']}")
 print(f"GPT-4: {vocab_sizes['gpt4']}")
 print(f"Ours: {vocab_sizes['ours']}")

+
 def print_comparison(baseline_name, baseline_results, ours_results, all_text):
    """Print comparison table between baseline tokenizer and ours."""
    print(f"\nComparison with {baseline_name}:")
@ -230,13 +226,16 @@ def print_comparison(baseline_name, baseline_results, ours_results, all_text):
            better = "Tie"
            diff_color = ""

-        print(f"{name:<10} {baseline_data['bytes']:<8} "
+        print(
+            f"{name:<10} {baseline_data['bytes']:<8} "
            f"{baseline_color}{baseline_data['tokens']:<7}{RESET} "
            f"{baseline_color}{baseline_data['ratio']:<7.2f}{RESET} "
            f"{ours_color}{ours_data['tokens']:<7}{RESET} "
            f"{ours_color}{ours_data['ratio']:<7.2f}{RESET} "
            f"{diff_color}{relative_diff:+7.1f}%{RESET}     "
-              f"{better:<10}")
+            f"{better:<10}"
+        )
+

 # Print comparisons
 print_comparison("GPT-2", tokenizer_results['gpt2'], tokenizer_results['ours'], all_text)
@ -244,6 +243,7 @@ print_comparison("GPT-4", tokenizer_results['gpt4'], tokenizer_results['ours'],

 # Log to report
 from nanochat.report import get_report
+
 lines = []
 for baseline_name in ["GPT-2", "GPT-4"]:
    baseline_key = baseline_name.lower().replace('-', '')
@ -251,15 +251,26 @@ for baseline_name in ["GPT-2", "GPT-4"]:
    ours_results = tokenizer_results['ours']
    lines.append(f"### Comparison with {baseline_name}")
    lines.append("")
-    lines.append("| Text Type | Bytes | " + baseline_name + " Tokens | " + baseline_name + " Ratio | Ours Tokens | Ours Ratio | Relative Diff % |")
+    lines.append(
+        "| Text Type | Bytes | "
+        + baseline_name
+        + " Tokens | "
+        + baseline_name
+        + " Ratio | Ours Tokens | Ours Ratio | Relative Diff % |"
+    )
    lines.append("|-----------|-------|--------------|--------------|-------------|------------|-----------------|")
    for name, text in all_text:
        baseline_data = baseline_results[name]
        ours_data = ours_results[name]
        relative_diff = ((baseline_data['tokens'] - ours_data['tokens']) / baseline_data['tokens']) * 100
-        lines.append(f"| {name} | {baseline_data['bytes']} | {baseline_data['tokens']} | {baseline_data['ratio']:.2f} | {ours_data['tokens']} | {ours_data['ratio']:.2f} | {relative_diff:+.1f}% |")
+        lines.append(
+            f"| {name} | {baseline_data['bytes']} | {baseline_data['tokens']} | {baseline_data['ratio']:.2f} | {ours_data['tokens']} | {ours_data['ratio']:.2f} | {relative_diff:+.1f}% |"
+        )
    lines.append("")
 report_markdown = "\n".join(lines)
-get_report().log(section="Tokenizer evaluation", data=[
+get_report().log(
+    section="Tokenizer evaluation",
+    data=[
        report_markdown,
-])
+    ],
+)
--- a/scripts/tok_train.py
+++ b/scripts/tok_train.py
@ -2,19 +2,24 @@
 Train a tokenizer using the HuggingFace Tokenizers library.
 In the style of GPT-4 tokenizer.
 """
+
+import argparse
 import os
 import time
-import argparse
+
 import torch
-from nanochat.tokenizer import RustBPETokenizer
+
 from nanochat.common import get_base_dir
 from nanochat.dataset import parquets_iter_batched
+from nanochat.tokenizer import RustBPETokenizer

 # -----------------------------------------------------------------------------
 # Parse command line arguments

 parser = argparse.ArgumentParser(description='Train a BPE tokenizer')
-parser.add_argument('--max_chars', type=int, default=10_000_000_000, help='Maximum characters to train on (default: 10B)')
+parser.add_argument(
+    '--max_chars', type=int, default=10_000_000_000, help='Maximum characters to train on (default: 10B)'
+)
 parser.add_argument('--doc_cap', type=int, default=10_000, help='Maximum characters per document (default: 10,000)')
 parser.add_argument('--vocab_size', type=int, default=65536, help='Vocabulary size (default: 65536 = 2^16)')
 args = parser.parse_args()
@ -25,6 +30,7 @@ print(f"vocab_size: {args.vocab_size:,}")
 # -----------------------------------------------------------------------------
 # Text iterator

+
 def text_iterator():
    """
    1) Flatten the batches into a single iterator
@ -36,11 +42,13 @@ def text_iterator():
        for doc in batch:
            doc_text = doc
            if len(doc_text) > args.doc_cap:
-                doc_text = doc_text[:args.doc_cap]
+                doc_text = doc_text[: args.doc_cap]
            nchars += len(doc_text)
            yield doc_text
            if nchars > args.max_chars:
                return
+
+
 text_iter = text_iterator()

 # -----------------------------------------------------------------------------
@ -92,8 +100,11 @@ print(f"Saved token_bytes to {token_bytes_path}")

 # Log to report
 from nanochat.report import get_report
+
 token_bytes_nonzero = (token_bytes[token_bytes > 0]).to(dtype=torch.float32)
-get_report().log(section="Tokenizer training", data=[
+get_report().log(
+    section="Tokenizer training",
+    data=[
        vars(args),  # argparse command line arguments
        {"train_time": train_time},
        {"num_special_tokens": len(special_set)},
@ -102,5 +113,6 @@ get_report().log(section="Tokenizer training", data=[
            "token_bytes_max": int(token_bytes_nonzero.max().item()),
            "token_bytes_mean": token_bytes_nonzero.mean().item(),
            "token_bytes_std": token_bytes_nonzero.std().item(),
-    }
-])
+        },
+    ],
+)
--- a/tasks/arc.py
+++ b/tasks/arc.py
@ -4,10 +4,11 @@ https://huggingface.co/datasets/allenai/ai2_arc
 """

 from datasets import load_dataset
+
 from tasks.common import Task, render_mc

-class ARC(Task):

+class ARC(Task):
    def __init__(self, subset, split, **kwargs):
        super().__init__(**kwargs)
        assert subset in ["ARC-Easy", "ARC-Challenge"], "ARC subset must be ARC-Easy or ARC-Challenge"
@ -30,10 +31,7 @@ class ARC(Task):
        assert answer_string in letters, f"ARC answer {answer_string} must be one of {letters}"  # sanity check
        # create and return the Conversation object
        user_message = render_mc(question, letters, choices)
-        messages = [
-            {"role": "user", "content": user_message},
-            {"role": "assistant", "content": answer_string}
-        ]
+        messages = [{"role": "user", "content": user_message}, {"role": "assistant", "content": answer_string}]
        conversation = {
            "messages": messages,
            "letters": letters,  # useful during evaluation, so we can narrow and clamp the assistant prediction to one of the letters
@ -43,6 +41,8 @@ class ARC(Task):
    def evaluate(self, conversation, assistant_response):
        # the assert here is not strictly speaking needed, but currently the way we eval, we expect this to be true
        # I'm going to leave the assert here to prevent footguns, but possibly in the future can remove it.
-        assert assistant_response in conversation['letters'], f"ARC answer {assistant_response} is expected to be one of {conversation['letters']}"
+        assert assistant_response in conversation['letters'], (
+            f"ARC answer {assistant_response} is expected to be one of {conversation['letters']}"
+        )
        assistant_message = conversation['messages'][-1]['content']  # e.g. "A"
        return assistant_response == assistant_message
--- a/tasks/common.py
+++ b/tasks/common.py
@ -7,6 +7,7 @@ Example tasks: MMLU, ARC-Easy, ARC-Challenge, GSM8K, HumanEval, SmolTalk.

 import random

+
 class Task:
    """
    Base class of a Task. Allows for lightweight slicing of the underlying dataset.
@ -81,7 +82,9 @@ class TaskMixture(Task):
        Access conversations according to a deterministic shuffle of all examples.
        This ensures tasks are mixed throughout training, regardless of dataset size.
        """
-        assert 0 <= index < self.num_conversations, f"Index {index} out of range for mixture with {self.num_conversations} conversations"
+        assert 0 <= index < self.num_conversations, (
+            f"Index {index} out of range for mixture with {self.num_conversations} conversations"
+        )
        task_idx, local_idx = self.index_map[index]
        return self.tasks[task_idx][local_idx]

@ -102,7 +105,9 @@ class TaskSequence(Task):
        return self.num_conversations

    def get_example(self, index):
-        assert 0 <= index < self.num_conversations, f"Index {index} out of range for sequence with {self.num_conversations} conversations"
+        assert 0 <= index < self.num_conversations, (
+            f"Index {index} out of range for sequence with {self.num_conversations} conversations"
+        )
        for task_idx, task_length in enumerate(self.lengths):
            if index < task_length:
                return self.tasks[task_idx][index]
--- a/tasks/customjson.py
+++ b/tasks/customjson.py
@ -3,10 +3,12 @@ CustomJSON task for loading conversations from JSONL files.
 Each line in the JSONL file should be a JSON array of messages.
 """

-import os
 import json
+import os
+
 from tasks.common import Task

+
 class CustomJSON(Task):
    """
    Load conversations from a JSONL file.
@ -25,14 +27,18 @@ class CustomJSON(Task):
            print("-" * 80)
            print(f"Warning: File {filepath} does not exist")
            print("HINT (Oct 21 2025)")
-            print("If you recently did a git pull and suddely see this, it might be due to the new addition of identity conversations")
+            print(
+                "If you recently did a git pull and suddely see this, it might be due to the new addition of identity conversations"
+            )
            print("See this discussion for more details: https://github.com/karpathy/nanochat/discussions/139")
            print("Quick fix: simply run the following command to download the file and you're done:")
-            print(f"curl -L -o {filepath} https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl")
+            print(
+                f"curl -L -o {filepath} https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl"
+            )
            print("-" * 80)

        else:
-            with open(filepath, 'r', encoding='utf-8') as f:
+            with open(filepath, encoding='utf-8') as f:
                for line in f:
                    line = line.strip()
                    if not line:  # skip empty lines
@ -46,7 +52,9 @@ class CustomJSON(Task):
                        assert "role" in message, f"Message {i} missing 'role' field"
                        assert "content" in message, f"Message {i} missing 'content' field"
                        expected_role = "user" if i % 2 == 0 else "assistant"
-                        assert message["role"] == expected_role, f"Message {i} has role {message['role']} but should be {expected_role}"
+                        assert message["role"] == expected_role, (
+                            f"Message {i} has role {message['role']} but should be {expected_role}"
+                        )
                        assert isinstance(message["content"], str), f"Message {i} content must be a string"

                    self.conversations.append(messages)
@ -62,4 +70,3 @@ class CustomJSON(Task):
            "messages": messages,
        }
        return conversation
-
--- a/tasks/gsm8k.py
+++ b/tasks/gsm8k.py
@ -15,11 +15,14 @@ Notice that GSM8K uses tool calls inside << >> tags.
 """

 import re
+
 from datasets import load_dataset
+
 from tasks.common import Task

-
 GSM_RE = re.compile(r"#### (\-?[0-9\.\,]+)")
+
+
 def extract_answer(completion):
    """
    Extract the numerical answer after #### marker.
@ -35,7 +38,6 @@ def extract_answer(completion):


 class GSM8K(Task):
-
    def __init__(self, subset, split, **kwargs):
        super().__init__(**kwargs)
        assert subset in ["main", "socratic"], "GSM8K subset must be main|socratic"
@ -50,7 +52,7 @@ class GSM8K(Task):
        return len(self.ds)

    def get_example(self, index):
-        """ Get a single problem from the dataset. """
+        """Get a single problem from the dataset."""
        row = self.ds[index]
        question = row['question']  # string of the question prompt
        answer = row['answer']  # string of the full solution and the answer after #### marker
--- a/tasks/humaneval.py
+++ b/tasks/humaneval.py
@ -5,10 +5,13 @@ It is a coding benchmark.
 """

 import re
+
 from datasets import load_dataset
+
 from nanochat.execution import execute_code
 from tasks.common import Task

+
 def extract_imports(prompt):
    """Extract import statements from the beginning of a code block."""
    imports = []
@ -21,6 +24,7 @@ def extract_imports(prompt):
            break
    return '\n'.join(imports)

+
 def extract_program(completion):
    """
    Extract Python code from LLM completion.
@ -44,8 +48,8 @@ def extract_program(completion):
    # No code blocks found, return the whole completion
    return completion.strip()

-class HumanEval(Task):

+class HumanEval(Task):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.ds = load_dataset("openai/openai_humaneval", split="test").shuffle(seed=42)
@ -58,7 +62,7 @@ class HumanEval(Task):
        return len(self.ds)

    def get_example(self, index):
-        """ Get a single problem from the dataset. """
+        """Get a single problem from the dataset."""
        row = self.ds[index]
        prompt = row['prompt']  # prompts in HumanEval are the beginning of the program
        solution = row['canonical_solution']  # the correct continuation of the program
@ -77,7 +81,7 @@ class HumanEval(Task):
        return conversation

    def evaluate(self, conversation, completion):
-        """ Given (conversation, completion), return boolean success of the completion. """
+        """Given (conversation, completion), return boolean success of the completion."""
        # the prompt will contain the imports and the function signature
        imports = extract_imports(conversation['messages'][0]['content'])
        # the completion will usually contain the whole function
--- a/tasks/mmlu.py
+++ b/tasks/mmlu.py
@ -4,12 +4,71 @@ https://huggingface.co/datasets/cais/mmlu
 """

 from datasets import load_dataset
+
 from tasks.common import Task, render_mc

-class MMLU(Task):

+class MMLU(Task):
    letters = ('A', 'B', 'C', 'D')
-    groups = ('abstract_algebra', 'anatomy', 'astronomy', 'business_ethics', 'clinical_knowledge', 'college_biology', 'college_chemistry', 'college_computer_science', 'college_mathematics', 'college_medicine', 'college_physics', 'computer_security', 'conceptual_physics', 'econometrics', 'electrical_engineering', 'elementary_mathematics', 'formal_logic', 'global_facts', 'high_school_biology', 'high_school_chemistry', 'high_school_computer_science', 'high_school_european_history', 'high_school_geography', 'high_school_government_and_politics', 'high_school_macroeconomics', 'high_school_mathematics', 'high_school_microeconomics', 'high_school_physics', 'high_school_psychology', 'high_school_statistics', 'high_school_us_history', 'high_school_world_history', 'human_aging', 'human_sexuality', 'international_law', 'jurisprudence', 'logical_fallacies', 'machine_learning', 'management', 'marketing', 'medical_genetics', 'miscellaneous', 'moral_disputes', 'moral_scenarios', 'nutrition', 'philosophy', 'prehistory', 'professional_accounting', 'professional_law', 'professional_medicine', 'professional_psychology', 'public_relations', 'security_studies', 'sociology', 'us_foreign_policy', 'virology', 'world_religions')
+    groups = (
+        'abstract_algebra',
+        'anatomy',
+        'astronomy',
+        'business_ethics',
+        'clinical_knowledge',
+        'college_biology',
+        'college_chemistry',
+        'college_computer_science',
+        'college_mathematics',
+        'college_medicine',
+        'college_physics',
+        'computer_security',
+        'conceptual_physics',
+        'econometrics',
+        'electrical_engineering',
+        'elementary_mathematics',
+        'formal_logic',
+        'global_facts',
+        'high_school_biology',
+        'high_school_chemistry',
+        'high_school_computer_science',
+        'high_school_european_history',
+        'high_school_geography',
+        'high_school_government_and_politics',
+        'high_school_macroeconomics',
+        'high_school_mathematics',
+        'high_school_microeconomics',
+        'high_school_physics',
+        'high_school_psychology',
+        'high_school_statistics',
+        'high_school_us_history',
+        'high_school_world_history',
+        'human_aging',
+        'human_sexuality',
+        'international_law',
+        'jurisprudence',
+        'logical_fallacies',
+        'machine_learning',
+        'management',
+        'marketing',
+        'medical_genetics',
+        'miscellaneous',
+        'moral_disputes',
+        'moral_scenarios',
+        'nutrition',
+        'philosophy',
+        'prehistory',
+        'professional_accounting',
+        'professional_law',
+        'professional_medicine',
+        'professional_psychology',
+        'public_relations',
+        'security_studies',
+        'sociology',
+        'us_foreign_policy',
+        'virology',
+        'world_religions',
+    )

    def __init__(self, subset, split, **kwargs):
        super().__init__(**kwargs)
@ -41,10 +100,7 @@ class MMLU(Task):
        # create and return the Conversation object
        user_message = render_mc(question, self.letters, choices)
        assistant_message = self.letters[answer]
-        messages = [
-            {"role": "user", "content": user_message},
-            {"role": "assistant", "content": assistant_message}
-        ]
+        messages = [{"role": "user", "content": user_message}, {"role": "assistant", "content": assistant_message}]
        conversation = {
            "messages": messages,
            "subject": subject,  # might be useful later for grouping metrics by subject
@ -55,6 +111,8 @@ class MMLU(Task):
    def evaluate(self, conversation, assistant_response):
        # the assert here is not strictly speaking needed, but currently the way we eval, we expect this to be true
        # I'm going to leave the assert here to prevent footguns, but possibly in the future can remove it.
-        assert assistant_response in self.letters, f"MMLU answer {assistant_response} is expected to be one of {self.letters}"
+        assert assistant_response in self.letters, (
+            f"MMLU answer {assistant_response} is expected to be one of {self.letters}"
+        )
        assistant_message = conversation['messages'][-1]['content']  # e.g. "A"
        return assistant_response == assistant_message
--- a/tasks/smoltalk.py
+++ b/tasks/smoltalk.py
@ -5,10 +5,12 @@ We use the "smol" version, which is more appropriate for smaller models.
 """

 from datasets import load_dataset
+
 from tasks.common import Task

+
 class SmolTalk(Task):
-    """ smol-smoltalk dataset. train is 460K rows, test is 24K rows. """
+    """smol-smoltalk dataset. train is 460K rows, test is 24K rows."""

    def __init__(self, split, **kwargs):
        super().__init__(**kwargs)
@ -36,7 +38,9 @@ class SmolTalk(Task):
        for i, message in enumerate(rest_messages):
            # user and assistant alternate as user,assistant,user,assistant,...
            expected_role = "user" if i % 2 == 0 else "assistant"
-            assert message["role"] == expected_role, f"Message {i} has role {message['role']} but should be {expected_role}"
+            assert message["role"] == expected_role, (
+                f"Message {i} has role {message['role']} but should be {expected_role}"
+            )
            assert isinstance(message["content"], str), "Content must be a string"
        # ---------------------------------------------------------------------
        # create and return the Conversation object (ok to emit the system message too)
--- a/tasks/spellingbee.py
+++ b/tasks/spellingbee.py
@ -26,10 +26,11 @@ To preview a few example conversations, run:
 python -m tasks.spellingbee
 """

-import re
 import random
-from tasks.common import Task
+import re
+
 from nanochat.common import download_file_with_lock
+from tasks.common import Task

 # Letters of the alphabet
 LETTERS = "abcdefghijklmnopqrstuvwxyz"
@ -38,6 +39,8 @@ WORD_LIST_URL = "https://raw.githubusercontent.com/dwyl/english-words/refs/heads

 # Identical to gsm8k's answer extraction
 ANSWER_RE = re.compile(r"#### (\-?[0-9\.\,]+)")
+
+
 def extract_answer(completion):
    """
    Extract the numerical answer after #### marker.
@ -49,6 +52,7 @@ def extract_answer(completion):
        return match_str
    return None

+
 # User message templates for data augmentation
 USER_MSG_TEMPLATES = [
    "How many {letter} are in the word {word}",
@ -110,8 +114,8 @@ USER_MSG_TEMPLATES = [
    "{word}に{letter}が何回出てくる",
 ]

-class SpellingBee(Task):

+class SpellingBee(Task):
    def __init__(self, size=1000, split="train", **kwargs):
        super().__init__(**kwargs)
        assert split in ["train", "test"], "SpellingBee split must be train|test"
@ -119,7 +123,7 @@ class SpellingBee(Task):
        self.split = split
        filename = WORD_LIST_URL.split("/")[-1]
        word_list_path = download_file_with_lock(WORD_LIST_URL, filename)
-        with open(word_list_path, 'r', encoding='utf-8') as f:
+        with open(word_list_path, encoding='utf-8') as f:
            words = [line.strip() for line in f]
        self.words = words

@ -190,13 +194,12 @@ Then count the occurrences of '{letter}':
        # Part 4: Python output
        assistant_parts.append({"type": "python_output", "text": str(count)})
        # Part 5: Final answer
-        assistant_parts.append({"type": "text", "text": f"\n\nPython gives us {count}.\n\nMy final answer is:\n\n#### {count}"})
+        assistant_parts.append(
+            {"type": "text", "text": f"\n\nPython gives us {count}.\n\nMy final answer is:\n\n#### {count}"}
+        )

        # return the full conversation
-        messages = [
-            {"role": "user", "content": user_msg},
-            {"role": "assistant", "content": assistant_parts}
-        ]
+        messages = [{"role": "user", "content": user_msg}, {"role": "assistant", "content": assistant_parts}]
        conversation = {
            "messages": messages,
        }
@ -222,7 +225,7 @@ Then count the occurrences of '{letter}':
        return is_correct

    def reward(self, conversation, assistant_response):
-        """ Use simple 0-1 reward just like gsm8k."""
+        """Use simple 0-1 reward just like gsm8k."""
        is_correct = self.evaluate(conversation, assistant_response)
        is_correct_float = float(is_correct)
        return is_correct_float
@ -238,7 +241,7 @@ class SimpleSpelling(Task):
        self.split = split
        filename = WORD_LIST_URL.split("/")[-1]
        word_list_path = download_file_with_lock(WORD_LIST_URL, filename)
-        with open(word_list_path, 'r', encoding='utf-8') as f:
+        with open(word_list_path, encoding='utf-8') as f:
            words = [line.strip() for line in f]
        rng = random.Random(42)
        rng.shuffle(words)  # use a different word order than the SpellingBee task
@ -260,7 +263,7 @@ class SimpleSpelling(Task):
        # return the full conversation
        messages = [
            {"role": "user", "content": f"Spell the word: {word}"},
-            {"role": "assistant", "content": f"{word}:{word_letters}"}
+            {"role": "assistant", "content": f"{word}:{word_letters}"},
        ]
        conversation = {
            "messages": messages,
@ -269,7 +272,6 @@ class SimpleSpelling(Task):


 if __name__ == "__main__":
-
    # preview the SpellingBee task, first 10 examples
    task = SpellingBee()
    for i in range(10):
--- a/tests/test_engine.py
+++ b/tests/test_engine.py
@ -5,8 +5,10 @@ python -m pytest tests/test_engine.py -v
 """

 import torch
+
 from nanochat.engine import KVCache

+
 def test_kv_cache_resize():
    """
    The KV cache was not resized correctly, more information here:
@ -21,11 +23,7 @@ def test_kv_cache_resize():
    num_layers = 6

    kv_cache = KVCache(
-        batch_size=batch_size,
-        num_heads=num_heads,
-        seq_len=seq_len,
-        head_dim=head_dim,
-        num_layers=num_layers
+        batch_size=batch_size, num_heads=num_heads, seq_len=seq_len, head_dim=head_dim, num_layers=num_layers
    )

    # Insert a single token with a distinct fill value to all layers
@ -47,7 +45,9 @@ def test_kv_cache_resize():
    insert_token(4)
    # Verify that the cache actually resized
    new_seq_len = kv_cache.kv_cache.shape[4]
-    assert new_seq_len > original_seq_len, f"Cache did not resize: original seq_len={original_seq_len}, new seq_len={new_seq_len}"
+    assert new_seq_len > original_seq_len, (
+        f"Cache did not resize: original seq_len={original_seq_len}, new seq_len={new_seq_len}"
+    )

    # Verify that the original 4 tokens are still intact after resize
    for layer_idx in range(num_layers):
@ -57,8 +57,12 @@ def test_kv_cache_resize():
            expected_v = float(token_idx * 100)
            actual_k = kv_cache.kv_cache[layer_idx, 0, :, :, token_idx, :]
            actual_v = kv_cache.kv_cache[layer_idx, 1, :, :, token_idx, :]
-            assert (actual_k == expected_k).all(), f"Layer {layer_idx}, token {token_idx}: key corrupted, expected {expected_k}"
-            assert (actual_v == expected_v).all(), f"Layer {layer_idx}, token {token_idx}: value corrupted, expected {expected_v}"
+            assert (actual_k == expected_k).all(), (
+                f"Layer {layer_idx}, token {token_idx}: key corrupted, expected {expected_k}"
+            )
+            assert (actual_v == expected_v).all(), (
+                f"Layer {layer_idx}, token {token_idx}: value corrupted, expected {expected_v}"
+            )
            # And that the original cache matches resized cache
            original_k = original_cache[layer_idx, 0, :, :, token_idx, :]
            original_v = original_cache[layer_idx, 1, :, :, token_idx, :]
--- a/tests/test_rustbpe.py
+++ b/tests/test_rustbpe.py
@ -18,18 +18,23 @@ python -m pytest tests/test_rustbpe.py -v -s
 -v is verbose, -s is show prints
 """

-import regex as re
-from collections import Counter, defaultdict
 import time
-import rustbpe
-import tiktoken
-import pytest
+from collections import Counter, defaultdict

-GPT4_SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
+import pytest
+import regex as re
+import tiktoken
+
+import rustbpe
+
+GPT4_SPLIT_PATTERN = (
+    r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
+)

 # -----------------------------------------------------------------------------
 # Reference tokenizer, pretty much copy pasted and pruned a bit from minbpe

+
 def get_stats(ids, counts=None):
    """
    Given a list of integers, return a dictionary of counts of consecutive pairs
@ -41,6 +46,7 @@ def get_stats(ids, counts=None):
        counts[pair] = counts.get(pair, 0) + 1
    return counts

+
 def merge(ids, pair, idx):
    """
    In the list of integers (ids), replace all consecutive occurrences
@ -51,7 +57,7 @@ def merge(ids, pair, idx):
    i = 0
    while i < len(ids):
        # if not at the very last position AND the pair matches, replace it
-        if ids[i] == pair[0] and i < len(ids) - 1 and ids[i+1] == pair[1]:
+        if ids[i] == pair[0] and i < len(ids) - 1 and ids[i + 1] == pair[1]:
            newids.append(idx)
            i += 2
        else:
@ -59,8 +65,8 @@ def merge(ids, pair, idx):
            i += 1
    return newids

-class RegexTokenizer:

+class RegexTokenizer:
    def __init__(self, pattern=None):
        """
        - pattern: optional string to override the default (GPT-4 split pattern)
@ -125,7 +131,7 @@ class RegexTokenizer:
            vocab[idx] = vocab[pair[0]] + vocab[pair[1]]
            # prints
            if verbose:
-                print(f"merge {i+1}/{num_merges}: {pair} -> {idx} ({vocab[idx]}) had {stats[pair]} occurrences")
+                print(f"merge {i + 1}/{num_merges}: {pair} -> {idx} ({vocab[idx]}) had {stats[pair]} occurrences")

        # save class variables
        self.merges = merges  # used in encode()
@ -163,9 +169,11 @@ class RegexTokenizer:
            ids.extend(chunk_ids)
        return ids

+
 # -----------------------------------------------------------------------------
 # Faster Python tokenizer, optimized version of the reference tokenizer

+
 def fast_merge_inplace(ids, pair, idx):
    """
    In the list of integers (ids), replace all consecutive occurrences
@ -175,16 +183,15 @@ def fast_merge_inplace(ids, pair, idx):
    # Find all positions where the pair occurs
    i = 0
    while i < len(ids) - 1:
-        if ids[i] == pair[0] and ids[i+1] == pair[1]:
+        if ids[i] == pair[0] and ids[i + 1] == pair[1]:
            ids[i] = idx
-            ids.pop(i+1)
+            ids.pop(i + 1)
        else:
            i += 1
    return ids


 class FastRegexTokenizer:
-
    def __init__(self, pattern=None):
        """
        - pattern: optional string to override the default (GPT-4 split pattern)
@ -262,31 +269,31 @@ class FastRegexTokenizer:
                chunk_count = chunk_counts[chunk_idx]
                ix = 0
                while ix < len(chunk_ids) - 1:
-                    if chunk_ids[ix] == pair[0] and chunk_ids[ix+1] == pair[1]:
+                    if chunk_ids[ix] == pair[0] and chunk_ids[ix + 1] == pair[1]:
                        # Track what pairs are being removed/added
                        # Remove: (prev, A), (A, B), (B, next)
                        if ix > 0:
-                            old_left = (chunk_ids[ix-1], chunk_ids[ix])
+                            old_left = (chunk_ids[ix - 1], chunk_ids[ix])
                            count_changes[old_left] -= chunk_count

                        # The merged pair disappears
                        count_changes[pair] -= chunk_count

                        if ix + 2 < len(chunk_ids):
-                            old_right = (chunk_ids[ix+1], chunk_ids[ix+2])
+                            old_right = (chunk_ids[ix + 1], chunk_ids[ix + 2])
                            count_changes[old_right] -= chunk_count

                        # Apply the merge
                        chunk_ids[ix] = idx
-                        chunk_ids.pop(ix+1)
+                        chunk_ids.pop(ix + 1)

                        # Add: (prev, C), (C, next)
                        if ix > 0:
-                            new_left = (chunk_ids[ix-1], chunk_ids[ix])
+                            new_left = (chunk_ids[ix - 1], chunk_ids[ix])
                            count_changes[new_left] += chunk_count

                        if ix + 1 < len(chunk_ids):
-                            new_right = (chunk_ids[ix], chunk_ids[ix+1])
+                            new_right = (chunk_ids[ix], chunk_ids[ix + 1])
                            count_changes[new_right] += chunk_count
                    else:
                        ix += 1
@ -302,8 +309,9 @@ class FastRegexTokenizer:
                # Update positions for changed pairs - only check affected chunks
                for chunk_idx in affected_chunks:
                    chunk_ids = ids[chunk_idx]
-                    contains_pair = any((chunk_ids[j], chunk_ids[j+1]) == changed_pair
-                                      for j in range(len(chunk_ids) - 1))
+                    contains_pair = any(
+                        (chunk_ids[j], chunk_ids[j + 1]) == changed_pair for j in range(len(chunk_ids) - 1)
+                    )
                    if contains_pair:
                        positions[changed_pair].add(chunk_idx)
                    else:
@ -372,13 +380,15 @@ class FastRegexTokenizer:
            ids.extend(chunk_ids)
        return ids

+
 # -----------------------------------------------------------------------------
 # HuggingFace tokenizer
+from tokenizers import Regex, decoders, pre_tokenizers
 from tokenizers import Tokenizer as HFTokenizer
-from tokenizers import pre_tokenizers, decoders, Regex
 from tokenizers.models import BPE
 from tokenizers.trainers import BpeTrainer

+
 class HuggingFaceTokenizer:
    """Light wrapper around HuggingFace Tokenizer for some utilities"""

@ -389,19 +399,23 @@ class HuggingFaceTokenizer:
    def train_from_iterator(cls, text_iterator, vocab_size):
        # train from an iterator of text
        # Configure the HuggingFace Tokenizer
-        tokenizer = HFTokenizer(BPE(
+        tokenizer = HFTokenizer(
+            BPE(
                byte_fallback=True,  # needed!
                unk_token=None,
                fuse_unk=False,
-        ))
+            )
+        )
        # Normalizer: None
        tokenizer.normalizer = None
        # Pre-tokenizer: GPT-4 style
        gpt4_split_regex = Regex(GPT4_SPLIT_PATTERN)  # huggingface demands that you wrap it in Regex!!
-        tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
+        tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
+            [
                pre_tokenizers.Split(pattern=gpt4_split_regex, behavior="isolated", invert=False),
-            pre_tokenizers.ByteLevel(add_prefix_space=False, use_regex=False)
-        ])
+                pre_tokenizers.ByteLevel(add_prefix_space=False, use_regex=False),
+            ]
+        )
        # Decoder: ByteLevel (it pairs together with the ByteLevel pre-tokenizer)
        tokenizer.decoder = decoders.ByteLevel()
        # Post-processor: None
@ -422,15 +436,19 @@ class HuggingFaceTokenizer:
        ids = self.tokenizer.encode(text, add_special_tokens=False).ids
        return ids

+
 # -----------------------------------------------------------------------------
 # Test all of the above

+
@pytest.fixture(scope="module")
 def enwik8_path():
    """Fixture to download and cache enwik8 dataset."""
    import os
    import zipfile
+
    from nanochat.common import get_base_dir
+
    base_dir = get_base_dir()
    # download and unzip enwik8 to .cache directory
    enwik8_url = "https://mattmahoney.net/dc/enwik8.zip"
@ -439,6 +457,7 @@ def enwik8_path():
    if not os.path.exists(enwik8_local_path):
        print(f"Downloading enwik8 to {enwik8_local_path_zip}")
        import requests
+
        response = requests.get(enwik8_url)
        with open(enwik8_local_path_zip, "wb") as f:
            f.write(response.content)
@ -455,15 +474,17 @@ def enwik8_path():
@pytest.fixture(scope="module")
 def enwik8_small(enwik8_path):
    """Fixture providing 100KB of enwik8 for quick tests."""
-    with open(enwik8_path, "r", encoding="utf-8") as f:
+    with open(enwik8_path, encoding="utf-8") as f:
        return f.read(100_000)

+
@pytest.fixture(scope="module")
 def enwik8_large(enwik8_path):
    """Fixture providing 10MB of enwik8 for performance tests."""
-    with open(enwik8_path, "r", encoding="utf-8") as f:
+    with open(enwik8_path, encoding="utf-8") as f:
        return f.read(10**7)

+
 def time_function(func, *args, **kwargs):
    """Time a function call and return the result and elapsed time"""
    start_time = time.time()
@ -472,6 +493,7 @@ def time_function(func, *args, **kwargs):
    elapsed = end_time - start_time
    return result, elapsed

+
 def test_correctness(enwik8_small):
    """Test that all tokenizer implementations produce the same results."""
    text = enwik8_small
@ -482,7 +504,9 @@ def test_correctness(enwik8_small):
    print("\nTraining slow reference...")
    slow_reference_tokenizer = RegexTokenizer()
    ambiguous_flag, slow_reference_train_time = time_function(slow_reference_tokenizer.train, text, vocab_size)
-    slow_reference_ids, slow_reference_encode_time = time_function(slow_reference_tokenizer.encode_ordinary, encode_text)
+    slow_reference_ids, slow_reference_encode_time = time_function(
+        slow_reference_tokenizer.encode_ordinary, encode_text
+    )
    print(f"Slow reference train time: {slow_reference_train_time:.4f}s")
    print(f"Slow reference encode time: {slow_reference_encode_time:.4f}s")
    print(slow_reference_ids[:20])
@ -497,7 +521,9 @@ def test_correctness(enwik8_small):
    print("\nTraining fast reference...")
    fast_reference_tokenizer = FastRegexTokenizer()
    _, fast_reference_train_time = time_function(fast_reference_tokenizer.train, text, vocab_size)
-    fast_reference_ids, fast_reference_encode_time = time_function(fast_reference_tokenizer.encode_ordinary, encode_text)
+    fast_reference_ids, fast_reference_encode_time = time_function(
+        fast_reference_tokenizer.encode_ordinary, encode_text
+    )
    print(f"Fast reference train time: {fast_reference_train_time:.4f}s")
    print(f"Fast reference encode time: {fast_reference_encode_time:.4f}s")
    print(fast_reference_ids[:20])
@ -589,14 +615,16 @@ def test_training_performance(enwik8_large):
    assert hf_train_time > 0, "Training should take some time"

    # Print comparison
-    print(f"\n📊 Performance comparison:")
+    print("\n📊 Performance comparison:")
    print(f"   RustBPE: {rustbpe_train_time:.4f}s")
    print(f"   HuggingFace: {hf_train_time:.4f}s")
-    print(f"   Speedup: {hf_train_time/rustbpe_train_time:.2f}x")
+    print(f"   Speedup: {hf_train_time / rustbpe_train_time:.2f}x")
+

 def test_interface(enwik8_small):
    """Test the RustBPETokenizer interface for training, encoding, decoding, and serialization."""
    import tempfile
+
    from nanochat.tokenizer import RustBPETokenizer

    # Simple train test
--- a/uv.lock
+++ b/uv.lock
@ -188,6 +188,15 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/e5/48/1549795ba7742c948d2ad169c1c8cdbae65bc450d6cd753d124b17c8cd32/certifi-2025.8.3-py3-none-any.whl", hash = "sha256:f6c12493cfb1b06ba2ff328595af9350c65d6644968e5d3a2ffd78699af217a5", size = 161216, upload-time = "2025-08-03T03:07:45.777Z" },
 ]

+[[package]]
+name = "cfgv"
+version = "3.5.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/4e/b5/721b8799b04bf9afe054a3899c6cf4e880fcf8563cc71c15610242490a0c/cfgv-3.5.0.tar.gz", hash = "sha256:d5b1034354820651caa73ede66a6294d6e95c1b00acc5e9b098e917404669132", size = 7334, upload-time = "2025-11-19T20:55:51.612Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/db/3c/33bac158f8ab7f89b2e59426d5fe2e4f63f7ed25df84c036890172b412b5/cfgv-3.5.0-py2.py3-none-any.whl", hash = "sha256:a8dc6b26ad22ff227d2634a65cb388215ce6cc96bbcc5cfde7641ae87e8dacc0", size = 7445, upload-time = "2025-11-19T20:55:50.744Z" },
+]
+
 [[package]]
 name = "charset-normalizer"
 version = "3.4.3"
@ -306,6 +315,15 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/c9/7a/cef76fd8438a42f96db64ddaa85280485a9c395e7df3db8158cfec1eee34/dill-0.3.8-py3-none-any.whl", hash = "sha256:c36ca9ffb54365bdd2f8eb3eff7d2a21237f8452b57ace88b1ac615b7e815bd7", size = 116252, upload-time = "2024-01-27T23:42:14.239Z" },
 ]

+[[package]]
+name = "distlib"
+version = "0.4.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/96/8e/709914eb2b5749865801041647dc7f4e6d00b549cfe88b65ca192995f07c/distlib-0.4.0.tar.gz", hash = "sha256:feec40075be03a04501a973d81f633735b4b69f98b05450592310c0f401a4e0d", size = 614605, upload-time = "2025-07-17T16:52:00.465Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/33/6b/e0547afaf41bf2c42e52430072fa5658766e3d65bd4b03a563d1b6336f57/distlib-0.4.0-py2.py3-none-any.whl", hash = "sha256:9659f7d87e46584a30b5780e43ac7a2143098441670ff0a49d5f9034c54a6c16", size = 469047, upload-time = "2025-07-17T16:51:58.613Z" },
+]
+
 [[package]]
 name = "exceptiongroup"
 version = "1.3.0"
@ -528,6 +546,15 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/39/7b/bb06b061991107cd8783f300adff3e7b7f284e330fd82f507f2a1417b11d/huggingface_hub-0.34.4-py3-none-any.whl", hash = "sha256:9b365d781739c93ff90c359844221beef048403f1bc1f1c123c191257c3c890a", size = 561452, upload-time = "2025-08-08T09:14:50.159Z" },
 ]

+[[package]]
+name = "identify"
+version = "2.6.15"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/ff/e7/685de97986c916a6d93b3876139e00eef26ad5bbbd61925d670ae8013449/identify-2.6.15.tar.gz", hash = "sha256:e4f4864b96c6557ef2a1e1c951771838f4edc9df3a72ec7118b338801b11c7bf", size = 99311, upload-time = "2025-10-02T17:43:40.631Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/0f/1c/e5fd8f973d4f375adb21565739498e2e9a1e54c858a97b9a8ccfdc81da9b/identify-2.6.15-py2.py3-none-any.whl", hash = "sha256:1181ef7608e00704db228516541eb83a88a9f94433a8c80bb9b5bd54b1d81757", size = 99183, upload-time = "2025-10-02T17:43:39.137Z" },
+]
+
 [[package]]
 name = "idna"
 version = "3.10"
@ -802,6 +829,7 @@ gpu = [
 [package.dev-dependencies]
 dev = [
    { name = "maturin" },
+    { name = "pre-commit" },
    { name = "pytest" },
 ]

@ -826,6 +854,7 @@ provides-extras = ["cpu", "gpu"]
 [package.metadata.requires-dev]
 dev = [
    { name = "maturin", specifier = ">=1.9.4" },
+    { name = "pre-commit", specifier = ">=3.8.0" },
    { name = "pytest", specifier = ">=8.0.0" },
 ]

@ -872,6 +901,15 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/eb/8d/776adee7bbf76365fdd7f2552710282c79a4ead5d2a46408c9043a2b70ba/networkx-3.5-py3-none-any.whl", hash = "sha256:0030d386a9a06dee3565298b4a734b68589749a544acbb6c412dc9e2489ec6ec", size = 2034406, upload-time = "2025-05-29T11:35:04.961Z" },
 ]

+[[package]]
+name = "nodeenv"
+version = "1.9.1"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/43/16/fc88b08840de0e0a72a2f9d8c6bae36be573e475a6326ae854bcc549fc45/nodeenv-1.9.1.tar.gz", hash = "sha256:6ec12890a2dab7946721edbfbcd91f3319c6ccc9aec47be7c7e6b7011ee6645f", size = 47437, upload-time = "2024-06-04T18:44:11.171Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/d2/1d/1b658dbd2b9fa9c4c9f32accbfc0205d532c8c6194dc0f2a4c0428e7128a/nodeenv-1.9.1-py2.py3-none-any.whl", hash = "sha256:ba11c9782d29c27c70ffbdda2d7415098754709be8a7056d79a737cd901155c9", size = 22314, upload-time = "2024-06-04T18:44:08.352Z" },
+]
+
 [[package]]
 name = "numpy"
 version = "1.26.4"
@ -1131,6 +1169,22 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/54/20/4d324d65cc6d9205fabedc306948156824eb9f0ee1633355a8f7ec5c66bf/pluggy-1.6.0-py3-none-any.whl", hash = "sha256:e920276dd6813095e9377c0bc5566d94c932c33b27a3e3945d8389c374dd4746", size = 20538, upload-time = "2025-05-15T12:30:06.134Z" },
 ]

+[[package]]
+name = "pre-commit"
+version = "4.5.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "cfgv" },
+    { name = "identify" },
+    { name = "nodeenv" },
+    { name = "pyyaml" },
+    { name = "virtualenv" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/f4/9b/6a4ffb4ed980519da959e1cf3122fc6cb41211daa58dbae1c73c0e519a37/pre_commit-4.5.0.tar.gz", hash = "sha256:dc5a065e932b19fc1d4c653c6939068fe54325af8e741e74e88db4d28a4dd66b", size = 198428, upload-time = "2025-11-22T21:02:42.304Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/5d/c4/b2d28e9d2edf4f1713eb3c29307f1a63f3d67cf09bdda29715a36a68921a/pre_commit-4.5.0-py2.py3-none-any.whl", hash = "sha256:25e2ce09595174d9c97860a95609f9f852c0614ba602de3561e267547f2335e1", size = 226429, upload-time = "2025-11-22T21:02:40.836Z" },
+]
+
 [[package]]
 name = "propcache"
 version = "0.3.2"
@ -2016,6 +2070,21 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/96/06/5cc0542b47c0338c1cb676b348e24a1c29acabc81000bced518231dded6f/uvicorn-0.36.0-py3-none-any.whl", hash = "sha256:6bb4ba67f16024883af8adf13aba3a9919e415358604ce46780d3f9bdc36d731", size = 67675, upload-time = "2025-09-20T01:07:12.984Z" },
 ]

+[[package]]
+name = "virtualenv"
+version = "20.35.4"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "distlib" },
+    { name = "filelock" },
+    { name = "platformdirs" },
+    { name = "typing-extensions", marker = "python_full_version < '3.11' or (extra == 'extra-8-nanochat-cpu' and extra == 'extra-8-nanochat-gpu')" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/20/28/e6f1a6f655d620846bd9df527390ecc26b3805a0c5989048c210e22c5ca9/virtualenv-20.35.4.tar.gz", hash = "sha256:643d3914d73d3eeb0c552cbb12d7e82adf0e504dbf86a3182f8771a153a1971c", size = 6028799, upload-time = "2025-10-29T06:57:40.511Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/79/0c/c05523fa3181fdf0c9c52a6ba91a23fbf3246cc095f26f6516f9c60e6771/virtualenv-20.35.4-py3-none-any.whl", hash = "sha256:c21c9cede36c9753eeade68ba7d523529f228a403463376cf821eaae2b650f1b", size = 6005095, upload-time = "2025-10-29T06:57:37.598Z" },
+]
+
 [[package]]
 name = "wandb"
 version = "0.21.3"