2.6 KiB
Bigram Speedrun Verification Notes
This branch is based on upstream nanochat master at dc54a1a and keeps the
submission implementation focused on the winning recipe:
- per-layer hashed bigram residual embeddings
- Muon+ post-orthogonalization normalization
- row equilibration before Muon orthogonalization
- lower scalar LR (
--scalar-lr=0.3) - batched training logging (
--train-log-every=50) torch.compile(..., mode="max-autotune-no-cudagraphs")for the speedrun script
It intentionally excludes the experimental branches that were not part of the final candidate: sparse layers, MoE/TOP losses, train-time logit bias losses, post-hoc fitting, NorMuon, and checkpoint merging.
Reproduction Sanity Check
Minimal branch d4/20 matched the prior experimental branch:
| Run | Step 0 BPB | Step 10 BPB | Final BPB |
|---|---|---|---|
| Prior candidate branch | 3.237224 |
3.234722 |
3.223259 |
| Minimal PR branch | 3.237224 |
3.234722 |
3.223286 |
The final difference is 0.000027 BPB on a tiny run, consistent with small
compile/graph differences after removing unused experimental code.
Full d16 Verification
Both runs used d16, FP8, target param/data ratio 8, total batch 524288, and
device batch 32 on the same machine.
| Run | Final BPB | Train time | Avg logged tok/s, excluding first | Avg logged step time, excluding first |
|---|---|---|---|---|
| Upstream master dense | 0.800673 |
94.64m |
329,904 |
1589.232ms |
| Bigram/Muon+ candidate | 0.798000 |
93.61m |
333,507 |
1572.058ms |
Candidate delta versus upstream master dense:
- BPB:
-0.002673 - train time:
-1.03m(1.09%faster) - logged throughput:
+3,603 tok/s(1.09%higher)
Important caveat: this is a full recipe comparison, not an architecture-only
comparison. The candidate also uses --train-log-every=50 and
--compile-mode=max-autotune-no-cudagraphs, while upstream master logs every
step and uses the default compile mode.
Compile Mode Probe
Short d16/40 throughput probes on the minimal branch:
| Compile mode | Avg logged tok/s, excluding first | Avg logged step time, excluding first | Total time |
|---|---|---|---|
default torch.compile |
324,995 |
1613.250ms |
0.78m |
max-autotune-no-cudagraphs |
333,261 |
1573.250ms |
0.76m |
On this d16 probe, max-autotune-no-cudagraphs was about 2.5% faster than
the default compile mode. The speedrun script keeps this compile mode for that
reason.
Test Status
python -m pytest tests/test_engine.py -q:9 passedpython -m py_compile nanochat/gpt.py nanochat/optim.py scripts/base_train.py nanochat/engine.py: passed