From d6a169b3290cd60a7bd84a254d98d96676180dc8 Mon Sep 17 00:00:00 2001
From: Codex <codex@openai.com>
Date: Thu, 7 May 2026 12:15:53 +0000
Subject: [PATCH] Remove dev PR notes

---
 dev/bigram_minimal_pr_changes.md | 219 -------------------------------
 dev/bigram_speedrun_results.md   |  83 ------------
 2 files changed, 302 deletions(-)
 delete mode 100644 dev/bigram_minimal_pr_changes.md
 delete mode 100644 dev/bigram_speedrun_results.md

diff --git a/dev/bigram_minimal_pr_changes.md b/dev/bigram_minimal_pr_changes.md
deleted file mode 100644
index 59c3a2d1..00000000
--- a/dev/bigram_minimal_pr_changes.md
+++ /dev/null
@@ -1,219 +0,0 @@
-# Minimal Bigram Speedrun PR Changes
-
-This branch is based on upstream nanochat master at `dc54a1a`. The goal is to
-keep the submission patch limited to the changes needed to reproduce the
-best-performing speedrun recipe. These are the `scripts/base_train.py` defaults:
-
-```bash
---fp8
---depth=22
---num-iterations=11600
---total-batch-size=524288
---bigram-embed-factor=5
---muon-plus
---muon-eq=row
---scalar-lr=0.3
---train-log-every=50
---compile-mode=max-autotune-no-cudagraphs
---eval-every=250
---core-metric-every=5800
-```
-
-It does not include the experimental branches that were tested and rejected:
-sparse architecture changes, MoE/TOP auxiliary losses, train-time logit-bias
-losses, post-hoc calibration, NorMuon variants, checkpoint merging, or d22/d24
-run-management scripts.
-
-## `nanochat/gpt.py`
-
-### Hashed Bigram Residual Embedding
-
-Adds two config fields:
-
-- `bigram_embed_factor`, default `0`
-- `bigram_lambda_init`, default `0.05`
-
-When `bigram_embed_factor > 0`, the model creates a separate bigram embedding
-table with `vocab_size * bigram_embed_factor` entries. For each token position,
-the current token id and previous token id are hashed into that table. The
-resulting embedding is added as a residual input before every transformer block:
-
-```python
-x = x + bigram_lambdas[i] * x0_bigram
-```
-
-The first token in each sequence uses a sentinel bucket because it has no
-previous token. During KV-cache decoding, the previous token is read from the
-cache so generation matches the training-time bigram definition.
-
-Why this helps: it gives the model a cheap, direct representation of adjacent
-token pairs without adding attention or MLP compute. The bigram table is
-zero-initialized, so the model starts from the original network function, while
-the per-layer `bigram_lambdas` start at `0.05` to let the residual learn quickly.
-
-### Parameter Counting and FLOP Accounting
-
-The bigram embedding table and bigram lambdas are excluded from the main matmul
-FLOP/scaling parameter count. They are not transformer matrix weights, and
-including them would distort the target param/data ratio logic.
-
-### Optimizer Groups
-
-Adds dedicated optimizer groups for:
-
-- `bigram_embed`
-- `bigram_lambdas`
-
-The bigram embedding uses AdamW with a configurable multiplier relative to the
-main embedding LR. The layer lambdas use a small AdamW LR. This keeps the bigram
-residual trainable without mixing it into the Muon-managed transformer matrices.
-
-### Muon Options Plumbed Through
-
-`setup_optimizer()` accepts:
-
-- `muon_plus`
-- `muon_eq_axis`
-
-These are forwarded into the Muon parameter groups so the optimizer can apply
-the selected Muon variants to matrix weights.
-
-## `nanochat/optim.py`
-
-### Muon+ Renormalization
-
-After Newton-Schulz orthogonalization, Muon+ rescales the update by its
-Frobenius norm. This is a small post-processing step on the Muon update and was
-the strongest optimizer-side change in the experiments.
-
-Why this helps: it stabilizes update scale after orthogonalization without
-changing the model architecture or adding optimizer state.
-
-### Row/Column Equilibration
-
-Adds optional row or column norm equilibration before orthogonalization:
-
-- `muon_eq_axis=1`: row equilibration
-- `muon_eq_axis=2`: column equilibration
-- `muon_eq_axis=0`: disabled
-
-The speedrun recipe uses row equilibration. It normalizes rows toward a common
-target norm before the polar/Newton-Schulz step, then continues through the
-existing Muon update path.
-
-Why this helps: row equilibration was a small but positive companion to Muon+ in
-the winning recipe, with minimal extra code and no extra persistent optimizer
-state.
-
-## `nanochat/engine.py`
-
-### Previous Token in KV Cache
-
-Adds `prev_token` to `KVCache`, resets it with the rest of the cache, and copies
-it during prefill expansion.
-
-Why this is needed: full-sequence training can compute bigram hashes from
-`idx[:, :-1]`, but one-token decode does not have the previous token in the
-current input tensor. Keeping `prev_token` in the cache makes generation use the
-same bigram feature as training.
-
-## `scripts/base_train.py`
-
-### Bigram CLI Flags
-
-Adds:
-
-- `--bigram-embed-factor`
-- `--bigram-lambda-init`
-- `--bigram-embedding-lr-mult`
-- `--bigram-lambda-lr`
-
-These configure the bigram residual and its optimizer treatment from the
-training script. The submission default is `--bigram-embed-factor=5`.
-
-### Muon Variant Flags
-
-Adds:
-
-- `--muon-plus`
-- `--no-muon-plus`
-- `--muon-eq`
-
-These expose the optimizer variants used in the recipe. The submission defaults
-are Muon+ enabled and `--muon-eq=row`. `--no-muon-plus --muon-eq=none` restores
-the original Muon path.
-
-### Train Logging Cadence
-
-Adds `--train-log-every`. Values greater than 1 avoid converting the loss tensor
-to a Python scalar every step.
-
-Why this helps: per-step logging creates extra synchronization overhead. The
-submission default is `--train-log-every=50`, which keeps useful progress reporting
-while reducing logging overhead.
-
-### Compile Mode
-
-Adds `--compile-mode` so the speedrun can request:
-
-```bash
---compile-mode=max-autotune-no-cudagraphs
-```
-
-Why this helps: on the d16 probe, this compile mode was about 2.5% faster than
-default `torch.compile` for the candidate recipe. It is now the submission
-default.
-
-### Skip Initial Eval
-
-Adds `--skip-initial-eval` and `--initial-eval`. The submission default skips
-the step-0 validation pass; `--initial-eval` restores the original behavior.
-
-## `runs/speedrun.sh`
-
-Uses the `scripts/base_train.py` submission defaults:
-
-- FP8
-- depth `22`
-- fixed `11600` optimizer steps
-- total batch size `524288`
-- Muon+
-- row equilibration
-- bigram factor 5
-- scalar LR `0.3`
-- log every 50 training steps
-- `max-autotune-no-cudagraphs` compile mode
-- validation every 250 steps
-- one CORE metric pass halfway through at step 5800
-
-This script is the intended entry point for reproducing the submitted run.
-
-## `tests/test_engine.py`
-
-Adds coverage for preserving `prev_token` through KV-cache prefill/expansion.
-
-Why this matters: the bigram feature must behave consistently during generation.
-The test guards the cache state required for single-token decode.
-
-## `dev/bigram_speedrun_results.md`
-
-Records the validation and throughput evidence used to justify the recipe:
-
-- minimal branch sanity check against the prior candidate branch
-- full d16 comparison against upstream dense
-- controlled d16 throughput comparison
-- compile-mode probe
-- test status
-
-This is supporting documentation for the PR, not code required at runtime.
-
-## Submission Readiness
-
-Completed checks:
-
-- `python -m pytest tests/test_engine.py -q`
-- `python -m py_compile nanochat/gpt.py nanochat/optim.py scripts/base_train.py nanochat/engine.py`
-- `git diff --check`
-
-The remaining work is operational: run the final benchmark on the 8xH100 system
-from this branch and include the measured result in the submission PR.
diff --git a/dev/bigram_speedrun_results.md b/dev/bigram_speedrun_results.md
deleted file mode 100644
index 436da647..00000000
--- a/dev/bigram_speedrun_results.md
+++ /dev/null
@@ -1,83 +0,0 @@
-# Bigram Speedrun Verification Notes
-
-This branch is based on upstream nanochat master at `dc54a1a` and keeps the
-submission implementation focused on the winning recipe:
-
-- per-layer hashed bigram residual embeddings
-- Muon+ post-orthogonalization normalization
-- row equilibration before Muon orthogonalization
-- lower scalar LR (`--scalar-lr=0.3`)
-- batched training logging (`--train-log-every=50`)
-- `torch.compile(..., mode="max-autotune-no-cudagraphs")` for the speedrun script
-
-It intentionally excludes the experimental branches that were not part of the
-final candidate: sparse layers, MoE/TOP losses, train-time logit bias losses,
-post-hoc fitting, NorMuon, and checkpoint merging.
-
-## Reproduction Sanity Check
-
-Minimal branch d4/20 matched the prior experimental branch:
-
-| Run | Step 0 BPB | Step 10 BPB | Final BPB |
-| --- | ---: | ---: | ---: |
-| Prior candidate branch | `3.237224` | `3.234722` | `3.223259` |
-| Minimal PR branch | `3.237224` | `3.234722` | `3.223286` |
-
-The final difference is `0.000027` BPB on a tiny run, consistent with small
-compile/graph differences after removing unused experimental code.
-
-## Full d16 Verification
-
-Both runs used d16, FP8, target param/data ratio 8, total batch `524288`, and
-device batch `32` on the same machine.
-
-| Run | Final BPB | Train time | Avg logged tok/s, excluding first | Avg logged step time, excluding first |
-| --- | ---: | ---: | ---: | ---: |
-| Upstream master dense | `0.800673` | `94.64m` | `329,904` | `1589.232ms` |
-| Bigram/Muon+ candidate | `0.798000` | `93.61m` | `333,507` | `1572.058ms` |
-
-Candidate delta versus upstream master dense:
-
-- BPB: `-0.002673`
-- train time: `-1.03m` (`1.09%` faster)
-- logged throughput: `+3,603 tok/s` (`1.09%` higher)
-
-Important caveat: this is a full recipe comparison, not an architecture-only
-comparison. The candidate also uses `--train-log-every=50` and
-`--compile-mode=max-autotune-no-cudagraphs`, while upstream master logs every
-step and uses the default compile mode.
-
-## Controlled d16 Throughput
-
-A denser control run with the same log50/compile-control style is the better
-way to estimate the per-step overhead of the bigram path.
-
-| Run | Final BPB | Train time | Avg logged tok/s, excluding first | Avg logged step time, excluding first |
-| --- | ---: | ---: | ---: | ---: |
-| Dense log50 compile control | `0.800604` | `92.85m` | `336,247` | `1559.258ms` |
-| Bigram/Muon+ candidate, full 3584 | `0.798000` | `93.61m` | `333,507` | `1572.058ms` |
-
-Against this controlled dense run, the bigram candidate is about `0.81%` slower
-per step, but `0.002604` BPB better at the same horizon.
-
-A shortened bigram run at 3400 steps landed at `0.800232` BPB in `88.92m`,
-which is `0.000372` BPB better than the dense log50 compile control while using
-about `4.23%` less training time.
-
-## Compile Mode Probe
-
-Short d16/40 throughput probes on the minimal branch:
-
-| Compile mode | Avg logged tok/s, excluding first | Avg logged step time, excluding first | Total time |
-| --- | ---: | ---: | ---: |
-| default `torch.compile` | `324,995` | `1613.250ms` | `0.78m` |
-| `max-autotune-no-cudagraphs` | `333,261` | `1573.250ms` | `0.76m` |
-
-On this d16 probe, `max-autotune-no-cudagraphs` was about `2.5%` faster than
-the default compile mode. The speedrun script keeps this compile mode for that
-reason.
-
-## Test Status
-
-- `python -m pytest tests/test_engine.py -q`: `9 passed`
-- `python -m py_compile nanochat/gpt.py nanochat/optim.py scripts/base_train.py nanochat/engine.py`: passed