nanochat/dev/bigram_speedrun_results.md
2026-05-06 12:19:07 +00:00

2.6 KiB

Bigram Speedrun Verification Notes

This branch is based on upstream nanochat master at dc54a1a and keeps the submission implementation focused on the winning recipe:

  • per-layer hashed bigram residual embeddings
  • Muon+ post-orthogonalization normalization
  • row equilibration before Muon orthogonalization
  • lower scalar LR (--scalar-lr=0.3)
  • batched training logging (--train-log-every=50)
  • torch.compile(..., mode="max-autotune-no-cudagraphs") for the speedrun script

It intentionally excludes the experimental branches that were not part of the final candidate: sparse layers, MoE/TOP losses, train-time logit bias losses, post-hoc fitting, NorMuon, and checkpoint merging.

Reproduction Sanity Check

Minimal branch d4/20 matched the prior experimental branch:

Run Step 0 BPB Step 10 BPB Final BPB
Prior candidate branch 3.237224 3.234722 3.223259
Minimal PR branch 3.237224 3.234722 3.223286

The final difference is 0.000027 BPB on a tiny run, consistent with small compile/graph differences after removing unused experimental code.

Full d16 Verification

Both runs used d16, FP8, target param/data ratio 8, total batch 524288, and device batch 32 on the same machine.

Run Final BPB Train time Avg logged tok/s, excluding first Avg logged step time, excluding first
Upstream master dense 0.800673 94.64m 329,904 1589.232ms
Bigram/Muon+ candidate 0.798000 93.61m 333,507 1572.058ms

Candidate delta versus upstream master dense:

  • BPB: -0.002673
  • train time: -1.03m (1.09% faster)
  • logged throughput: +3,603 tok/s (1.09% higher)

Important caveat: this is a full recipe comparison, not an architecture-only comparison. The candidate also uses --train-log-every=50 and --compile-mode=max-autotune-no-cudagraphs, while upstream master logs every step and uses the default compile mode.

Compile Mode Probe

Short d16/40 throughput probes on the minimal branch:

Compile mode Avg logged tok/s, excluding first Avg logged step time, excluding first Total time
default torch.compile 324,995 1613.250ms 0.78m
max-autotune-no-cudagraphs 333,261 1573.250ms 0.76m

On this d16 probe, max-autotune-no-cudagraphs was about 2.5% faster than the default compile mode. The speedrun script keeps this compile mode for that reason.

Test Status

  • python -m pytest tests/test_engine.py -q: 9 passed
  • python -m py_compile nanochat/gpt.py nanochat/optim.py scripts/base_train.py nanochat/engine.py: passed