mirror of
https://github.com/karpathy/nanochat.git
synced 2026-04-04 22:55:27 +00:00
Merge 00932d1955 into 8180e1d8c1
This commit is contained in:
commit
9b686ac8ab
|
|
@ -17,6 +17,7 @@ Presently, the main focus of development is on tuning the pretraining stage, whi
|
|||
| 1 | 3.04 | 0.74833 | 0.2585 | d24 baseline, slightly overtrained | Jan 29 2026 | 348fbb3 | @karpathy |
|
||||
| 2 | 2.91 | 0.74504 | 0.2578 | d26 slightly undertrained **+fp8** | Feb 2 2026 | a67eba3 | @karpathy |
|
||||
| 3 | 2.76 | 0.74645 | 0.2602 | bump total batch size to 1M tokens | Feb 5 2026 | 2c062aa | @karpathy |
|
||||
| 4 | 2.49 | 0.75001 | 0.2626 | d26 0.5M batch, ratio 7.25 | Feb 8 2026 | TBD | @imxj |
|
||||
|
||||
The primary metric we care about is "time to GPT-2" - the wall clock time needed to outperform the GPT-2 (1.6B) CORE metric on an 8XH100 GPU node. The GPT-2 CORE score is 0.256525. In 2019, the training of GPT-2 cost approximately $43,000 so it is incredible that due to many advances over 7 years across the stack, we can now do so much faster and for well below $100 (e.g. at the current ~$3/GPU/hr, an 8XH100 node is ~$24/hr, so 3 hours is ~$72).
|
||||
|
||||
|
|
|
|||
|
|
@ -147,3 +147,41 @@ Minimum validation bpb: 0.74645
|
|||
```
|
||||
|
||||
The big change here is that the batch size was doubled from 0.5M to 1M, which works better for a d26 model and allowed me to decrease the number of optimization steps a bit via `--target-param-data-ratio` from 8.5 to 8.25. The TLDR is that the original batch size of 0.5M was tuned for d12, but bigger models (e.g. d26) prefer larger total batch size. I determined in experiments that d26 prefers 1M. Then I implemented and merged a principled way to calculate the optimal batch size given depth so that all nanochat models of all depths benefit. See [dev/LOG.md](dev/LOG.md) entry "2026-02-05: Auto Batch Size Scaling" for more detail.
|
||||
|
||||
## Run 4
|
||||
|
||||
Achieved Feb 8 2026. Launch command:
|
||||
|
||||
```
|
||||
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
|
||||
--depth=26 \
|
||||
--run="d26_r725_05M" \
|
||||
--model-tag="d26_r725_05M" \
|
||||
--device-batch-size=16 \
|
||||
--total-batch-size=524288 \
|
||||
--sample-every=-1 \
|
||||
--save-every=-1 \
|
||||
--core-metric-max-per-task=-1 \
|
||||
--core-metric-every=999999 \
|
||||
--target-param-data-ratio=7.25 \
|
||||
--fp8
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
```
|
||||
core_metric 0.2626
|
||||
step 12700
|
||||
total_training_time 8967
|
||||
Minimum validation bpb: 0.750008
|
||||
```
|
||||
|
||||
Reproduced twice: Run 1 CORE 0.2729 (8985s), Run 2 CORE 0.2626 (8967s). Both comfortably clear the 0.256525 threshold.
|
||||
|
||||
The key change is reverting the total batch size for d26 from 1M back to 0.5M (524,288 tokens) and lowering the param-data ratio from 8.25 to 7.25. While Run 3 showed that the auto-computed 1M batch is optimal for *compute-optimal* training of d26, "compute-optimal" and "speedrun-optimal" are different objectives. In the speedrun's undertraining regime, the model benefits more from additional optimization steps at smaller batch size (12,700 steps at 0.5M) than from seeing more data at larger batch size (7,226 steps at 1M). Direct evidence: at ratio 7.5, the 1M batch fails CORE (0.2534) while 0.5M passes (0.2577), despite training on the same total tokens.
|
||||
|
||||
We swept ratios extensively at 0.5M batch. CORE scores exhibit run-to-run variance of ±0.01-0.02 (e.g. r7.125 produced 0.2571, 0.2472, and 0.2406 across three identical runs). Ratio 7.25 was chosen for its reproducibility — both runs clear the threshold with comfortable margin.
|
||||
|
||||
Previous record was 2.76 hours, so 2.49 hours is `(2.76 - 2.49)/2.76*100` ~= 9.8% speed improvement.
|
||||
|
||||
AI disclosure: experimental design and hyperparameter search were conducted using Claude Code.
|
||||
|
|
|
|||
|
|
@ -69,8 +69,8 @@ python -m scripts.tok_eval
|
|||
echo "Waiting for dataset download to complete..."
|
||||
wait $DATASET_DOWNLOAD_PID
|
||||
|
||||
# d24 model (slightly overtrained is enough to beat GPT-2 => increase data:params ratio from compute optimal 10.5 (default) to 12)
|
||||
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26 --target-param-data-ratio=8.25 --device-batch-size=16 --fp8 --run=$WANDB_RUN
|
||||
# d26 model with 0.5M batch size and ratio 7.25 (faster than 1M batch for the speedrun regime)
|
||||
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26 --target-param-data-ratio=7.25 --total-batch-size=524288 --device-batch-size=16 --fp8 --run=$WANDB_RUN
|
||||
# evaluate the model: CORE metric, BPB on train/val, and draw samples
|
||||
torchrun --standalone --nproc_per_node=8 -m scripts.base_eval -- --device-batch-size=16
|
||||
|
||||
|
|
|
|||
Loading…
Reference in New Issue
Block a user