mirror of
https://github.com/karpathy/nanochat.git
synced 2026-04-05 07:05:28 +00:00
Revert d26 batch size from 1M to 0.5M and lower param-data ratio from 8.25 to 7.25. In the speedrun's undertraining regime, smaller batch with more optimization steps (12,700 vs 7,226) is more efficient than larger batch with fewer steps. Result: CORE 0.2626, time 8967s (2.49h), val_bpb 0.750008 Reproduced: CORE 0.2729/0.2626 across two runs, both pass. AI disclosure: experimental design and hyperparameter search were conducted using Claude Code. |
||
|---|---|---|
| .. | ||
| miniseries.sh | ||
| runcpu.sh | ||
| scaling_laws.sh | ||
| speedrun.sh | ||