From e527521a3fd91e8f3a2016b10db21e5742aa41fe Mon Sep 17 00:00:00 2001 From: Andrej Karpathy Date: Thu, 5 Feb 2026 22:21:03 +0000 Subject: [PATCH] briefly mention batch ramp experimentation too, too weak to merge in my few attempts --- dev/LOG.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/dev/LOG.md b/dev/LOG.md index 6ce4173..dec2c06 100644 --- a/dev/LOG.md +++ b/dev/LOG.md @@ -48,6 +48,10 @@ Also refactored model initialization to use `build_model_meta(depth)` helper and - [Brown et al., Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) - [Merrill et al., The Batch Size–Critical Batch Size Myth](https://arxiv.org/abs/2505.23971) +### One more thing (batch size ramp) + +Tried batch size ramping. The simplest implementation I could think of "tricks" the existing training loop by slicing each micro-batch into smaller pieces and calling optimizer.step() more frequently early in training (1/8 → 1/4 → 1/2 → full batch over the first x% of training, with sqrt LR scaling). Also required a torch.compile warmup phase to pre-compile all slice sizes and avoid recompilation spikes during training. While the idea is sound and small gains were observed, they weren't sufficient to justify the code complexity introduced (conditional slicing logic, warmup with state save/restore, etc.). Not merged for now. + --- ## 2026-02-05: SwiGLU Activation (Negative Result)