From 5fdd5cdb246d2e82a1fcc05fd4c0468df824d875 Mon Sep 17 00:00:00 2001 From: Andrej Karpathy Date: Thu, 5 Feb 2026 20:11:32 +0000 Subject: [PATCH] new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier --- README.md | 1 + dev/LEADERBOARD.md | 32 +++++++++++++++++++++++++++++++- 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 49110901..182d2730 100644 --- a/README.md +++ b/README.md @@ -19,6 +19,7 @@ For questions about the repo, I recommend either using [DeepWiki](https://deepwi | 0 | 168 hours | - | 0.2565 | Original OpenAI GPT-2 checkpoint | 2019 | - | OpenAI | | 1 | 3.04 | 0.74833 | 0.2585 | d24 baseline, slightly overtrained | Jan 29 2026 | 348fbb3 | @karpathy | | 2 | 2.91 | 0.74504 | 0.2578 | d26 slightly undertrained **+fp8** | Feb 2 2026 | a67eba3 | @karpathy | +| 3 | 2.76 | 0.74645 | 0.2602 | bump total batch size to 1M tokens | Feb 5 2026 | 2c062aa | @karpathy | The primary metric we care about is "time to GPT-2" - the wall clock time needed to outperform the GPT-2 (1.6B) CORE metric on an 8XH100 GPU node. The GPT-2 CORE score is 0.256525. In 2019, the training of GPT-2 cost approximately $50,000 so it is incredible that due to many advances over 7 years across the stack, we can now do so much faster and for well below $100 (e.g. at the current ~$3/GPU/hr, an 8XH100 node is ~$24/hr, so 3 hours is ~$72). diff --git a/dev/LEADERBOARD.md b/dev/LEADERBOARD.md index 6c2720cb..b8a727fb 100644 --- a/dev/LEADERBOARD.md +++ b/dev/LEADERBOARD.md @@ -89,7 +89,7 @@ Detailed writeup: [Beating GPT-2 for <<$100: the nanochat journey](https://githu ## Run 2 -Achieved Feb 2 2026 on commit `8309b83`. The launch command was +Achieved Feb 2 2026 on commit `a67eba3`. The launch command was ``` OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \ @@ -117,3 +117,33 @@ Minimum validation bpb: 0.745036 The big change in this run is `--fp8`, which causes all Linear layers (other than the gates) to be switched to fp8 training using `torchao` with tensorwise fp8 scaling. Each step is of slightly lower quality, but we are taking them a lot faster, coming out net ahead. Anyone who does not have fp8 (e.g. using a GPU without it) can simply leave out the `--fp8` flag to train in bfloat16. This will work just fine but it will produce a slightly stronger model than GPT-2 because of the fp8 -> bf16 precision upgrade. It's possible that one can further tune which layers to include in the fp8 conversion and that e.g. some of the smaller matmuls should be just kept in bf16 etc. Previous record was 3.04 hours, so 2.91 hours is `(3.04 - 2.91)/3.04*100` ~= 4.3% speed improvement. + +## Run 3 + +Achieved Feb 5 2026 on commit `2c062aa`. Launch command: + +``` +OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \ + --depth=26 \ + --run="d26_feb4_double_batch_ratio8.25" \ + --model-tag="d26_feb4_double_batch_ratio8.25" \ + --device-batch-size=16 \ + --total-batch-size=1048576 \ + --sample-every=-1 \ + --save-every=-1 \ + --core-metric-max-per-task=-1 \ + --core-metric-every=999999 \ + --target-param-data-ratio=8.25 \ + --fp8 +``` + +Result: + +``` +core_metric 0.26024 +step 7226 +total_training_time 9922 +Minimum validation bpb: 0.74645 +``` + +The big change here is that the batch size was doubled from 0.5M to 1M, which works better for a d26 model and allowed me to decrease the number of optimization steps a bit via `--target-param-data-ratio` from 8.5 to 8.25. The TLDR is that the original batch size of 0.5M was tuned for d12, but bigger models (e.g. d26) prefer larger total batch size. I determined in experiments that d26 prefers 1M. Then I implemented and merged a principled way to calculate the optimal batch size given depth so that all nanochat models of all depths benefit. See [dev/LOG.md](dev/LOG.md) entry "2026-02-05: Auto Batch Size Scaling" for more detail.