diff --git a/README.md b/README.md index 6283437..e96b5a7 100644 --- a/README.md +++ b/README.md @@ -14,37 +14,15 @@ For questions about the repo, I recommend either using [DeepWiki](https://deepwi ## Leaderboard -| # | Record time | Description | Date | Commit | Contributors | -|---|-------------|-------------|------|--------|--------------| -| 1 | 3.04 hours | d24 baseline, slightly overtrained | Jan 29 2026 | 348fbb3 | @karpathy | +| # | Record time | val_bpb | CORE | Description | Date | Commit | Contributors | +|---|-------------|---------|------|-------------|------|--------|--------------| +| 0 | 168 hours | - | 0.256525 | Original OpenAI GPT-2 checkpoint | 2019 | - | OpenAI | +| 1 | 3.04 | 0.74833 | 0.25851 | d24 baseline, slightly overtrained | Jan 29 2026 | 348fbb3 | @karpathy | +| 2 | 2.91 | 0.74504 | 0.2578 | d26 slightly undertrained **+fp8** | Feb 2 2026 | TODO | @karpathy | -The primary metric we care about is "time to GPT-2" - the wall clock time needed to outperform the GPT-2 (1.6B) CORE metric on an 8XH100 GPU node. In 2019, the training of GPT-2 cost approximately $50,000 so it is incredible that due to many advances over 7 years across the stack, we can now do so in 3 hours or less, for ~$73 and below. Once your repo is set up (see the [runs/speedrun.sh](runs/speedrun.sh) script for reference), e.g. the way I kicked off the jan29 run is as follows: +The primary metric we care about is "time to GPT-2" - the wall clock time needed to outperform the GPT-2 (1.6B) CORE metric on an 8XH100 GPU node. The GPT-2 CORE score is 0.256525. In 2019, the training of GPT-2 cost approximately $50,000 so it is incredible that due to many advances over 7 years across the stack, we can now do so much faster and for well below $100 (e.g. at the current ~$3/GPU/hr, an 8XH100 node is ~$24/hr, so 3 hours is ~$72). -``` -OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \ - --depth=24 \ - --run=d24-jan29 \ - --model-tag=d24_jan29 \ - --device-batch-size=16 \ - --sample-every=-1 \ - --save-every=-1 \ - --core-metric-max-per-task=-1 \ - --core-metric-every=3000 \ - --target-param-data-ratio=12 -``` - -After 3 hours we get output like this: - -``` -... -wandb: Run summary: -wandb: core_metric 0.25851 -wandb: step 16704 -wandb: total_training_flops 4.330784131228946e+19 -wandb: total_training_time 10949.46713 -``` - -The GPT-2 CORE score (i.e. the target to beat) is 0.256525. So we see that this d24 CORE score is higher (0.25851). Then we look at the `total_training_time`, which is the time of the training iterations alone, excluding all the evaluations and logging, in seconds. We get: `10949/60/60 ~= 3.04` hours, the current record. +See [dev/LEADERBOARD.md](dev/LEADERBOARD.md) for more docs on how to interpret and contribute to the leaderboard. ## Getting started diff --git a/dev/LOG.md b/dev/LOG.md index dd11b42..8cdef87 100644 --- a/dev/LOG.md +++ b/dev/LOG.md @@ -4,6 +4,74 @@ A running summary documenting some experiments and findings. Started ~Jan 7 2026 --- +## 2026-02-02: FP8 Training with torchao + +Integrated FP8 training using `torchao.float8` to accelerate Linear layer matmuls on H100 GPUs. + +### Background + +FP8 (8-bit floating point) uses H100's FP8 tensor cores for ~2x theoretical matmul throughput. The tradeoff is quantization overhead: computing scales and casting tensors to/from FP8. Still, as an example torchtitan (Meta's distributed training framework) reports 25-28% speedups with FP8 for some of their experiments. + +**Previous attempt (Jan 2026):** FP8 on just `lm_head` following modded-nanogpt with custom ops → 1% speedup, +2GB memory. Failed due to fragile torch.compile interaction. But this experiment was also done on ~d12 scale back then instead of the bigger model that gets GPT-2 capability of approx d24. + +**This attempt:** Use torchao's `convert_to_float8_training()` on ALL Linear layers, increase model size to d24. The core snippet is: + +```python +from torchao.float8 import Float8LinearConfig, convert_to_float8_training +config = Float8LinearConfig.from_recipe_name("tensorwise") +convert_to_float8_training(model, config=config) +``` + +But in practice it's more involved (see base_train.py). + +### Results + +**Microbenchmark (d26 MLP, 65536x1664 @ 1664x6656):** + +| Method | Forward | Fwd+Bwd | Speedup | +|--------|---------|---------|---------| +| BF16 + compile | 2.00ms | 4.79ms | 1.00x | +| FP8 rowwise + compile | 1.84ms | 4.55ms | 1.08x | +| FP8 tensorwise + compile | 1.45ms | 4.06ms | **1.38x** | +| FP8 rowwise (no compile) | 2.89ms | 21.86ms | 0.23x ❌ | + +torch.compile is MANDATORY. Without it, FP8 is 4x slower due to unfused scaling ops. + +**Full training (d26):** + +| Config | tok/sec | vs baseline | +|--------|---------|-------------| +| BF16 baseline | 630K | 1.00x | +| FP8 rowwise | 564K | 0.90x ❌ | +| FP8 tensorwise | 740K | **1.17x** ✓ | + +Memory usage also decreases quite a bit, by ~9GB (activations stored as FP8 instead of BF16). + +Seeing 17% speedup is encouraging but we're still not done yet because each step is now in lower precision and less powerful individually, so to make up for the precision drop we have to train longer. Empirically, running some sweeps overnight on d24 scale, I saw that the actual speedup (when you match performance) is closer to 5%. It's possible that our LLMs at ~d24 scale are still too small to confidently enjoy the speedups that come from fp8 for bigger models. + +### Key Learnings + +For nanochat at approximate scale of interest (~GPT-2 capability, ~d24): + +1. **Tensorwise >> Rowwise** - Rowwise computes per-row scales, overhead exceeds benefit. Tensorwise uses one scale per tensor. +2. **Filter small layers** - Layers with dims not divisible by 16 must be skipped (FP8 hardware requirement) +3. **Larger models benefit more** - d12 was still slower with FP8; d26+ shows gains. Therefore, in some depths there is a benefit to fp8 and in some there isn't. Keeping it configurable for now, passed in via kwargs and default off. +4. **The effective, capability-matched speedup is lower still** - because each step is of slightly lower precision/quality. + +### Integration + +Added `--fp8` flag to `base_train.py`, default recipe is "tensorwise", example of turning on: + +```bash +torchrun --nproc_per_node=8 -m scripts.base_train --depth=24 --fp8 +``` + +Uses tensorwise by default. Requires `torchao==0.15.0` (compatible with torch 2.9.1), which was added to dependencies. + +**TLDR**: turning on fp8 for GPT-2 capability nanochat model gives approx +5% capability-matched speedup. + +--- + ## 2026-01-29: Hyperball/MuonH Experiments (Negative Result) Explored Hyperball optimization from [this post](https://psychedelic-sunstone-851.notion.site/Fantastic-Pretraining-Optimizers-and-Where-to-Find-Them-2-1-Hyperball-Optimization-2e924306e6f280e7a5ffee00eb40a0dd) (saved to `knowledge/muonh.md`). Constrains weights to sphere of radius R (initial norm): `W_{t+1} = R · Normalize(W_t - η·R · Normalize(u_t))`. Had to change a number of details in a branch, e.g. not use zero init for our projections (or the initial norm would be zero), keep track of the initial norm, adjust Muon -> MuonH for the update.