mirror of
https://github.com/karpathy/nanochat.git
synced 2026-05-20 22:57:57 +00:00
Remove dev PR notes
This commit is contained in:
parent
0393a2c13f
commit
d6a169b329
|
|
@ -1,219 +0,0 @@
|
|||
# Minimal Bigram Speedrun PR Changes
|
||||
|
||||
This branch is based on upstream nanochat master at `dc54a1a`. The goal is to
|
||||
keep the submission patch limited to the changes needed to reproduce the
|
||||
best-performing speedrun recipe. These are the `scripts/base_train.py` defaults:
|
||||
|
||||
```bash
|
||||
--fp8
|
||||
--depth=22
|
||||
--num-iterations=11600
|
||||
--total-batch-size=524288
|
||||
--bigram-embed-factor=5
|
||||
--muon-plus
|
||||
--muon-eq=row
|
||||
--scalar-lr=0.3
|
||||
--train-log-every=50
|
||||
--compile-mode=max-autotune-no-cudagraphs
|
||||
--eval-every=250
|
||||
--core-metric-every=5800
|
||||
```
|
||||
|
||||
It does not include the experimental branches that were tested and rejected:
|
||||
sparse architecture changes, MoE/TOP auxiliary losses, train-time logit-bias
|
||||
losses, post-hoc calibration, NorMuon variants, checkpoint merging, or d22/d24
|
||||
run-management scripts.
|
||||
|
||||
## `nanochat/gpt.py`
|
||||
|
||||
### Hashed Bigram Residual Embedding
|
||||
|
||||
Adds two config fields:
|
||||
|
||||
- `bigram_embed_factor`, default `0`
|
||||
- `bigram_lambda_init`, default `0.05`
|
||||
|
||||
When `bigram_embed_factor > 0`, the model creates a separate bigram embedding
|
||||
table with `vocab_size * bigram_embed_factor` entries. For each token position,
|
||||
the current token id and previous token id are hashed into that table. The
|
||||
resulting embedding is added as a residual input before every transformer block:
|
||||
|
||||
```python
|
||||
x = x + bigram_lambdas[i] * x0_bigram
|
||||
```
|
||||
|
||||
The first token in each sequence uses a sentinel bucket because it has no
|
||||
previous token. During KV-cache decoding, the previous token is read from the
|
||||
cache so generation matches the training-time bigram definition.
|
||||
|
||||
Why this helps: it gives the model a cheap, direct representation of adjacent
|
||||
token pairs without adding attention or MLP compute. The bigram table is
|
||||
zero-initialized, so the model starts from the original network function, while
|
||||
the per-layer `bigram_lambdas` start at `0.05` to let the residual learn quickly.
|
||||
|
||||
### Parameter Counting and FLOP Accounting
|
||||
|
||||
The bigram embedding table and bigram lambdas are excluded from the main matmul
|
||||
FLOP/scaling parameter count. They are not transformer matrix weights, and
|
||||
including them would distort the target param/data ratio logic.
|
||||
|
||||
### Optimizer Groups
|
||||
|
||||
Adds dedicated optimizer groups for:
|
||||
|
||||
- `bigram_embed`
|
||||
- `bigram_lambdas`
|
||||
|
||||
The bigram embedding uses AdamW with a configurable multiplier relative to the
|
||||
main embedding LR. The layer lambdas use a small AdamW LR. This keeps the bigram
|
||||
residual trainable without mixing it into the Muon-managed transformer matrices.
|
||||
|
||||
### Muon Options Plumbed Through
|
||||
|
||||
`setup_optimizer()` accepts:
|
||||
|
||||
- `muon_plus`
|
||||
- `muon_eq_axis`
|
||||
|
||||
These are forwarded into the Muon parameter groups so the optimizer can apply
|
||||
the selected Muon variants to matrix weights.
|
||||
|
||||
## `nanochat/optim.py`
|
||||
|
||||
### Muon+ Renormalization
|
||||
|
||||
After Newton-Schulz orthogonalization, Muon+ rescales the update by its
|
||||
Frobenius norm. This is a small post-processing step on the Muon update and was
|
||||
the strongest optimizer-side change in the experiments.
|
||||
|
||||
Why this helps: it stabilizes update scale after orthogonalization without
|
||||
changing the model architecture or adding optimizer state.
|
||||
|
||||
### Row/Column Equilibration
|
||||
|
||||
Adds optional row or column norm equilibration before orthogonalization:
|
||||
|
||||
- `muon_eq_axis=1`: row equilibration
|
||||
- `muon_eq_axis=2`: column equilibration
|
||||
- `muon_eq_axis=0`: disabled
|
||||
|
||||
The speedrun recipe uses row equilibration. It normalizes rows toward a common
|
||||
target norm before the polar/Newton-Schulz step, then continues through the
|
||||
existing Muon update path.
|
||||
|
||||
Why this helps: row equilibration was a small but positive companion to Muon+ in
|
||||
the winning recipe, with minimal extra code and no extra persistent optimizer
|
||||
state.
|
||||
|
||||
## `nanochat/engine.py`
|
||||
|
||||
### Previous Token in KV Cache
|
||||
|
||||
Adds `prev_token` to `KVCache`, resets it with the rest of the cache, and copies
|
||||
it during prefill expansion.
|
||||
|
||||
Why this is needed: full-sequence training can compute bigram hashes from
|
||||
`idx[:, :-1]`, but one-token decode does not have the previous token in the
|
||||
current input tensor. Keeping `prev_token` in the cache makes generation use the
|
||||
same bigram feature as training.
|
||||
|
||||
## `scripts/base_train.py`
|
||||
|
||||
### Bigram CLI Flags
|
||||
|
||||
Adds:
|
||||
|
||||
- `--bigram-embed-factor`
|
||||
- `--bigram-lambda-init`
|
||||
- `--bigram-embedding-lr-mult`
|
||||
- `--bigram-lambda-lr`
|
||||
|
||||
These configure the bigram residual and its optimizer treatment from the
|
||||
training script. The submission default is `--bigram-embed-factor=5`.
|
||||
|
||||
### Muon Variant Flags
|
||||
|
||||
Adds:
|
||||
|
||||
- `--muon-plus`
|
||||
- `--no-muon-plus`
|
||||
- `--muon-eq`
|
||||
|
||||
These expose the optimizer variants used in the recipe. The submission defaults
|
||||
are Muon+ enabled and `--muon-eq=row`. `--no-muon-plus --muon-eq=none` restores
|
||||
the original Muon path.
|
||||
|
||||
### Train Logging Cadence
|
||||
|
||||
Adds `--train-log-every`. Values greater than 1 avoid converting the loss tensor
|
||||
to a Python scalar every step.
|
||||
|
||||
Why this helps: per-step logging creates extra synchronization overhead. The
|
||||
submission default is `--train-log-every=50`, which keeps useful progress reporting
|
||||
while reducing logging overhead.
|
||||
|
||||
### Compile Mode
|
||||
|
||||
Adds `--compile-mode` so the speedrun can request:
|
||||
|
||||
```bash
|
||||
--compile-mode=max-autotune-no-cudagraphs
|
||||
```
|
||||
|
||||
Why this helps: on the d16 probe, this compile mode was about 2.5% faster than
|
||||
default `torch.compile` for the candidate recipe. It is now the submission
|
||||
default.
|
||||
|
||||
### Skip Initial Eval
|
||||
|
||||
Adds `--skip-initial-eval` and `--initial-eval`. The submission default skips
|
||||
the step-0 validation pass; `--initial-eval` restores the original behavior.
|
||||
|
||||
## `runs/speedrun.sh`
|
||||
|
||||
Uses the `scripts/base_train.py` submission defaults:
|
||||
|
||||
- FP8
|
||||
- depth `22`
|
||||
- fixed `11600` optimizer steps
|
||||
- total batch size `524288`
|
||||
- Muon+
|
||||
- row equilibration
|
||||
- bigram factor 5
|
||||
- scalar LR `0.3`
|
||||
- log every 50 training steps
|
||||
- `max-autotune-no-cudagraphs` compile mode
|
||||
- validation every 250 steps
|
||||
- one CORE metric pass halfway through at step 5800
|
||||
|
||||
This script is the intended entry point for reproducing the submitted run.
|
||||
|
||||
## `tests/test_engine.py`
|
||||
|
||||
Adds coverage for preserving `prev_token` through KV-cache prefill/expansion.
|
||||
|
||||
Why this matters: the bigram feature must behave consistently during generation.
|
||||
The test guards the cache state required for single-token decode.
|
||||
|
||||
## `dev/bigram_speedrun_results.md`
|
||||
|
||||
Records the validation and throughput evidence used to justify the recipe:
|
||||
|
||||
- minimal branch sanity check against the prior candidate branch
|
||||
- full d16 comparison against upstream dense
|
||||
- controlled d16 throughput comparison
|
||||
- compile-mode probe
|
||||
- test status
|
||||
|
||||
This is supporting documentation for the PR, not code required at runtime.
|
||||
|
||||
## Submission Readiness
|
||||
|
||||
Completed checks:
|
||||
|
||||
- `python -m pytest tests/test_engine.py -q`
|
||||
- `python -m py_compile nanochat/gpt.py nanochat/optim.py scripts/base_train.py nanochat/engine.py`
|
||||
- `git diff --check`
|
||||
|
||||
The remaining work is operational: run the final benchmark on the 8xH100 system
|
||||
from this branch and include the measured result in the submission PR.
|
||||
|
|
@ -1,83 +0,0 @@
|
|||
# Bigram Speedrun Verification Notes
|
||||
|
||||
This branch is based on upstream nanochat master at `dc54a1a` and keeps the
|
||||
submission implementation focused on the winning recipe:
|
||||
|
||||
- per-layer hashed bigram residual embeddings
|
||||
- Muon+ post-orthogonalization normalization
|
||||
- row equilibration before Muon orthogonalization
|
||||
- lower scalar LR (`--scalar-lr=0.3`)
|
||||
- batched training logging (`--train-log-every=50`)
|
||||
- `torch.compile(..., mode="max-autotune-no-cudagraphs")` for the speedrun script
|
||||
|
||||
It intentionally excludes the experimental branches that were not part of the
|
||||
final candidate: sparse layers, MoE/TOP losses, train-time logit bias losses,
|
||||
post-hoc fitting, NorMuon, and checkpoint merging.
|
||||
|
||||
## Reproduction Sanity Check
|
||||
|
||||
Minimal branch d4/20 matched the prior experimental branch:
|
||||
|
||||
| Run | Step 0 BPB | Step 10 BPB | Final BPB |
|
||||
| --- | ---: | ---: | ---: |
|
||||
| Prior candidate branch | `3.237224` | `3.234722` | `3.223259` |
|
||||
| Minimal PR branch | `3.237224` | `3.234722` | `3.223286` |
|
||||
|
||||
The final difference is `0.000027` BPB on a tiny run, consistent with small
|
||||
compile/graph differences after removing unused experimental code.
|
||||
|
||||
## Full d16 Verification
|
||||
|
||||
Both runs used d16, FP8, target param/data ratio 8, total batch `524288`, and
|
||||
device batch `32` on the same machine.
|
||||
|
||||
| Run | Final BPB | Train time | Avg logged tok/s, excluding first | Avg logged step time, excluding first |
|
||||
| --- | ---: | ---: | ---: | ---: |
|
||||
| Upstream master dense | `0.800673` | `94.64m` | `329,904` | `1589.232ms` |
|
||||
| Bigram/Muon+ candidate | `0.798000` | `93.61m` | `333,507` | `1572.058ms` |
|
||||
|
||||
Candidate delta versus upstream master dense:
|
||||
|
||||
- BPB: `-0.002673`
|
||||
- train time: `-1.03m` (`1.09%` faster)
|
||||
- logged throughput: `+3,603 tok/s` (`1.09%` higher)
|
||||
|
||||
Important caveat: this is a full recipe comparison, not an architecture-only
|
||||
comparison. The candidate also uses `--train-log-every=50` and
|
||||
`--compile-mode=max-autotune-no-cudagraphs`, while upstream master logs every
|
||||
step and uses the default compile mode.
|
||||
|
||||
## Controlled d16 Throughput
|
||||
|
||||
A denser control run with the same log50/compile-control style is the better
|
||||
way to estimate the per-step overhead of the bigram path.
|
||||
|
||||
| Run | Final BPB | Train time | Avg logged tok/s, excluding first | Avg logged step time, excluding first |
|
||||
| --- | ---: | ---: | ---: | ---: |
|
||||
| Dense log50 compile control | `0.800604` | `92.85m` | `336,247` | `1559.258ms` |
|
||||
| Bigram/Muon+ candidate, full 3584 | `0.798000` | `93.61m` | `333,507` | `1572.058ms` |
|
||||
|
||||
Against this controlled dense run, the bigram candidate is about `0.81%` slower
|
||||
per step, but `0.002604` BPB better at the same horizon.
|
||||
|
||||
A shortened bigram run at 3400 steps landed at `0.800232` BPB in `88.92m`,
|
||||
which is `0.000372` BPB better than the dense log50 compile control while using
|
||||
about `4.23%` less training time.
|
||||
|
||||
## Compile Mode Probe
|
||||
|
||||
Short d16/40 throughput probes on the minimal branch:
|
||||
|
||||
| Compile mode | Avg logged tok/s, excluding first | Avg logged step time, excluding first | Total time |
|
||||
| --- | ---: | ---: | ---: |
|
||||
| default `torch.compile` | `324,995` | `1613.250ms` | `0.78m` |
|
||||
| `max-autotune-no-cudagraphs` | `333,261` | `1573.250ms` | `0.76m` |
|
||||
|
||||
On this d16 probe, `max-autotune-no-cudagraphs` was about `2.5%` faster than
|
||||
the default compile mode. The speedrun script keeps this compile mode for that
|
||||
reason.
|
||||
|
||||
## Test Status
|
||||
|
||||
- `python -m pytest tests/test_engine.py -q`: `9 passed`
|
||||
- `python -m py_compile nanochat/gpt.py nanochat/optim.py scripts/base_train.py nanochat/engine.py`: passed
|
||||
Loading…
Reference in New Issue
Block a user