From d6a169b3290cd60a7bd84a254d98d96676180dc8 Mon Sep 17 00:00:00 2001 From: Codex Date: Thu, 7 May 2026 12:15:53 +0000 Subject: [PATCH] Remove dev PR notes --- dev/bigram_minimal_pr_changes.md | 219 ------------------------------- dev/bigram_speedrun_results.md | 83 ------------ 2 files changed, 302 deletions(-) delete mode 100644 dev/bigram_minimal_pr_changes.md delete mode 100644 dev/bigram_speedrun_results.md diff --git a/dev/bigram_minimal_pr_changes.md b/dev/bigram_minimal_pr_changes.md deleted file mode 100644 index 59c3a2d1..00000000 --- a/dev/bigram_minimal_pr_changes.md +++ /dev/null @@ -1,219 +0,0 @@ -# Minimal Bigram Speedrun PR Changes - -This branch is based on upstream nanochat master at `dc54a1a`. The goal is to -keep the submission patch limited to the changes needed to reproduce the -best-performing speedrun recipe. These are the `scripts/base_train.py` defaults: - -```bash ---fp8 ---depth=22 ---num-iterations=11600 ---total-batch-size=524288 ---bigram-embed-factor=5 ---muon-plus ---muon-eq=row ---scalar-lr=0.3 ---train-log-every=50 ---compile-mode=max-autotune-no-cudagraphs ---eval-every=250 ---core-metric-every=5800 -``` - -It does not include the experimental branches that were tested and rejected: -sparse architecture changes, MoE/TOP auxiliary losses, train-time logit-bias -losses, post-hoc calibration, NorMuon variants, checkpoint merging, or d22/d24 -run-management scripts. - -## `nanochat/gpt.py` - -### Hashed Bigram Residual Embedding - -Adds two config fields: - -- `bigram_embed_factor`, default `0` -- `bigram_lambda_init`, default `0.05` - -When `bigram_embed_factor > 0`, the model creates a separate bigram embedding -table with `vocab_size * bigram_embed_factor` entries. For each token position, -the current token id and previous token id are hashed into that table. The -resulting embedding is added as a residual input before every transformer block: - -```python -x = x + bigram_lambdas[i] * x0_bigram -``` - -The first token in each sequence uses a sentinel bucket because it has no -previous token. During KV-cache decoding, the previous token is read from the -cache so generation matches the training-time bigram definition. - -Why this helps: it gives the model a cheap, direct representation of adjacent -token pairs without adding attention or MLP compute. The bigram table is -zero-initialized, so the model starts from the original network function, while -the per-layer `bigram_lambdas` start at `0.05` to let the residual learn quickly. - -### Parameter Counting and FLOP Accounting - -The bigram embedding table and bigram lambdas are excluded from the main matmul -FLOP/scaling parameter count. They are not transformer matrix weights, and -including them would distort the target param/data ratio logic. - -### Optimizer Groups - -Adds dedicated optimizer groups for: - -- `bigram_embed` -- `bigram_lambdas` - -The bigram embedding uses AdamW with a configurable multiplier relative to the -main embedding LR. The layer lambdas use a small AdamW LR. This keeps the bigram -residual trainable without mixing it into the Muon-managed transformer matrices. - -### Muon Options Plumbed Through - -`setup_optimizer()` accepts: - -- `muon_plus` -- `muon_eq_axis` - -These are forwarded into the Muon parameter groups so the optimizer can apply -the selected Muon variants to matrix weights. - -## `nanochat/optim.py` - -### Muon+ Renormalization - -After Newton-Schulz orthogonalization, Muon+ rescales the update by its -Frobenius norm. This is a small post-processing step on the Muon update and was -the strongest optimizer-side change in the experiments. - -Why this helps: it stabilizes update scale after orthogonalization without -changing the model architecture or adding optimizer state. - -### Row/Column Equilibration - -Adds optional row or column norm equilibration before orthogonalization: - -- `muon_eq_axis=1`: row equilibration -- `muon_eq_axis=2`: column equilibration -- `muon_eq_axis=0`: disabled - -The speedrun recipe uses row equilibration. It normalizes rows toward a common -target norm before the polar/Newton-Schulz step, then continues through the -existing Muon update path. - -Why this helps: row equilibration was a small but positive companion to Muon+ in -the winning recipe, with minimal extra code and no extra persistent optimizer -state. - -## `nanochat/engine.py` - -### Previous Token in KV Cache - -Adds `prev_token` to `KVCache`, resets it with the rest of the cache, and copies -it during prefill expansion. - -Why this is needed: full-sequence training can compute bigram hashes from -`idx[:, :-1]`, but one-token decode does not have the previous token in the -current input tensor. Keeping `prev_token` in the cache makes generation use the -same bigram feature as training. - -## `scripts/base_train.py` - -### Bigram CLI Flags - -Adds: - -- `--bigram-embed-factor` -- `--bigram-lambda-init` -- `--bigram-embedding-lr-mult` -- `--bigram-lambda-lr` - -These configure the bigram residual and its optimizer treatment from the -training script. The submission default is `--bigram-embed-factor=5`. - -### Muon Variant Flags - -Adds: - -- `--muon-plus` -- `--no-muon-plus` -- `--muon-eq` - -These expose the optimizer variants used in the recipe. The submission defaults -are Muon+ enabled and `--muon-eq=row`. `--no-muon-plus --muon-eq=none` restores -the original Muon path. - -### Train Logging Cadence - -Adds `--train-log-every`. Values greater than 1 avoid converting the loss tensor -to a Python scalar every step. - -Why this helps: per-step logging creates extra synchronization overhead. The -submission default is `--train-log-every=50`, which keeps useful progress reporting -while reducing logging overhead. - -### Compile Mode - -Adds `--compile-mode` so the speedrun can request: - -```bash ---compile-mode=max-autotune-no-cudagraphs -``` - -Why this helps: on the d16 probe, this compile mode was about 2.5% faster than -default `torch.compile` for the candidate recipe. It is now the submission -default. - -### Skip Initial Eval - -Adds `--skip-initial-eval` and `--initial-eval`. The submission default skips -the step-0 validation pass; `--initial-eval` restores the original behavior. - -## `runs/speedrun.sh` - -Uses the `scripts/base_train.py` submission defaults: - -- FP8 -- depth `22` -- fixed `11600` optimizer steps -- total batch size `524288` -- Muon+ -- row equilibration -- bigram factor 5 -- scalar LR `0.3` -- log every 50 training steps -- `max-autotune-no-cudagraphs` compile mode -- validation every 250 steps -- one CORE metric pass halfway through at step 5800 - -This script is the intended entry point for reproducing the submitted run. - -## `tests/test_engine.py` - -Adds coverage for preserving `prev_token` through KV-cache prefill/expansion. - -Why this matters: the bigram feature must behave consistently during generation. -The test guards the cache state required for single-token decode. - -## `dev/bigram_speedrun_results.md` - -Records the validation and throughput evidence used to justify the recipe: - -- minimal branch sanity check against the prior candidate branch -- full d16 comparison against upstream dense -- controlled d16 throughput comparison -- compile-mode probe -- test status - -This is supporting documentation for the PR, not code required at runtime. - -## Submission Readiness - -Completed checks: - -- `python -m pytest tests/test_engine.py -q` -- `python -m py_compile nanochat/gpt.py nanochat/optim.py scripts/base_train.py nanochat/engine.py` -- `git diff --check` - -The remaining work is operational: run the final benchmark on the 8xH100 system -from this branch and include the measured result in the submission PR. diff --git a/dev/bigram_speedrun_results.md b/dev/bigram_speedrun_results.md deleted file mode 100644 index 436da647..00000000 --- a/dev/bigram_speedrun_results.md +++ /dev/null @@ -1,83 +0,0 @@ -# Bigram Speedrun Verification Notes - -This branch is based on upstream nanochat master at `dc54a1a` and keeps the -submission implementation focused on the winning recipe: - -- per-layer hashed bigram residual embeddings -- Muon+ post-orthogonalization normalization -- row equilibration before Muon orthogonalization -- lower scalar LR (`--scalar-lr=0.3`) -- batched training logging (`--train-log-every=50`) -- `torch.compile(..., mode="max-autotune-no-cudagraphs")` for the speedrun script - -It intentionally excludes the experimental branches that were not part of the -final candidate: sparse layers, MoE/TOP losses, train-time logit bias losses, -post-hoc fitting, NorMuon, and checkpoint merging. - -## Reproduction Sanity Check - -Minimal branch d4/20 matched the prior experimental branch: - -| Run | Step 0 BPB | Step 10 BPB | Final BPB | -| --- | ---: | ---: | ---: | -| Prior candidate branch | `3.237224` | `3.234722` | `3.223259` | -| Minimal PR branch | `3.237224` | `3.234722` | `3.223286` | - -The final difference is `0.000027` BPB on a tiny run, consistent with small -compile/graph differences after removing unused experimental code. - -## Full d16 Verification - -Both runs used d16, FP8, target param/data ratio 8, total batch `524288`, and -device batch `32` on the same machine. - -| Run | Final BPB | Train time | Avg logged tok/s, excluding first | Avg logged step time, excluding first | -| --- | ---: | ---: | ---: | ---: | -| Upstream master dense | `0.800673` | `94.64m` | `329,904` | `1589.232ms` | -| Bigram/Muon+ candidate | `0.798000` | `93.61m` | `333,507` | `1572.058ms` | - -Candidate delta versus upstream master dense: - -- BPB: `-0.002673` -- train time: `-1.03m` (`1.09%` faster) -- logged throughput: `+3,603 tok/s` (`1.09%` higher) - -Important caveat: this is a full recipe comparison, not an architecture-only -comparison. The candidate also uses `--train-log-every=50` and -`--compile-mode=max-autotune-no-cudagraphs`, while upstream master logs every -step and uses the default compile mode. - -## Controlled d16 Throughput - -A denser control run with the same log50/compile-control style is the better -way to estimate the per-step overhead of the bigram path. - -| Run | Final BPB | Train time | Avg logged tok/s, excluding first | Avg logged step time, excluding first | -| --- | ---: | ---: | ---: | ---: | -| Dense log50 compile control | `0.800604` | `92.85m` | `336,247` | `1559.258ms` | -| Bigram/Muon+ candidate, full 3584 | `0.798000` | `93.61m` | `333,507` | `1572.058ms` | - -Against this controlled dense run, the bigram candidate is about `0.81%` slower -per step, but `0.002604` BPB better at the same horizon. - -A shortened bigram run at 3400 steps landed at `0.800232` BPB in `88.92m`, -which is `0.000372` BPB better than the dense log50 compile control while using -about `4.23%` less training time. - -## Compile Mode Probe - -Short d16/40 throughput probes on the minimal branch: - -| Compile mode | Avg logged tok/s, excluding first | Avg logged step time, excluding first | Total time | -| --- | ---: | ---: | ---: | -| default `torch.compile` | `324,995` | `1613.250ms` | `0.78m` | -| `max-autotune-no-cudagraphs` | `333,261` | `1573.250ms` | `0.76m` | - -On this d16 probe, `max-autotune-no-cudagraphs` was about `2.5%` faster than -the default compile mode. The speedrun script keeps this compile mode for that -reason. - -## Test Status - -- `python -m pytest tests/test_engine.py -q`: `9 passed` -- `python -m py_compile nanochat/gpt.py nanochat/optim.py scripts/base_train.py nanochat/engine.py`: passed