Commit Graph

102 Commits

Author SHA1 Message Date
Unsal Gokdag
c3f234cfca CORE eval: GPU-resident data, continuous pipeline, per-task progress bars
three independent improvements to the cached CORE evaluation path:

   1. GPU-resident data: all base-4 collated batches (~144MB for full CORE eval)
      are moved to GPU upfront via .to(device). eliminates all CPU→GPU transfers
      from the forward loop. _forward_all_cached replaces double-buffered prefetch
      with a simple upfront bulk transfer — .to() is a no-op when the caller has
      already preloaded tensors to GPU (as bench_core_eval now does).

   2. continuous cross-task pipeline: _forward_all_cached flattens all tasks'
      batches into one stream. the last batch of task N flows directly into the
      first batch of task N+1 with no pipeline restart. GPU-side composition via
      merge (pad+cat for bs > base) and split (row-slice for bs < base) avoids
      the CPU-side compose_collated bottleneck that made bs=8 slower than bs=4.

   3. progress bars + per-task result printing: both cached and first-run paths
      in evaluate_model now show a tqdm progress bar with the current task label.
      on_task_done callback in _forward_all_cached prints each task's accuracy
      as soon as its last batch is processed (single-GPU). DDP falls back to
      printing after all_reduce. both paths print total elapsed time at the end.

   bench_core_eval: preloads ALL base-4 batches to GPU once before the batch-size
   sweep. all sweep iterations compose from GPU-resident tensors with zero
   CPU→GPU transfers in the hot loop.
2026-02-13 07:54:53 +00:00
Unsal Gokdag
7fa30f5ee3 CORE eval: disk-cached tokenized batches, double-buffered GPU transfers, batch composition, benchmark improvements
the main idea: tokenization + collation for CORE eval only needs to happen once per tokenizer.
      collated batches at base batch_size=4 are saved to disk (core_token_cache/), keyed by SHA-256
      of the tokenizer file. any batch_size can be served from these base-4 batches: larger sizes merge
      consecutive batches (right-pad shorter ones, cat along dim=0), smaller sizes split along example
      boundaries (trim trailing padding). this means prepare_task_data is truly a one-time cost.

      core_eval.py:
      - double-buffered CPU->GPU transfers in both forward paths (_forward_batches and evaluate_task's
        pipelined path). while GPU runs forward_model on batch N, batch N+1 is pin_memory()'d and
        DMA-transferred via non_blocking=True. the DMA engine and GPU compute units are separate
        hardware so they overlap. previously GPU idled during every transfer.
      - compose_collated(): merge base batches for larger batch_size (cat after right-padding to
        max_len), or split for smaller batch_size (slice along row boundaries from batch_meta,
        trim trailing padding via vectorized non_pad.any(dim=0)). works because examples are sorted
        by seq_len, so consecutive base batches have monotonically increasing lengths.
      - evaluate_task and _forward_batches accept optional pbar for progress tracking.

      base_eval.py:
      - evaluate_model now has 3-tier caching: in-memory (_batch_cache, across calls within same
        process), disk load (core_token_cache/, on first call when in-memory is empty), disk save
        (after first run's prepare+collate+forward, writes collated batches so future training runs
        and the benchmark skip tokenization entirely). keyed by tokenizer file hash + max_per_task.

      bench_core_eval.py:
      - cached sweep no longer re-runs the full first-run sweep to build collated data (was 2x the
        work for no reason). instead loads/builds base-4 cache once, then compose_collated serves
        any target batch_size. cached sweep only varies batch_size (no queue_size — no collation thread).
      - --skip-first: skip the first-run sweep entirely if disk cache exists. if cache is missing,
        runs a single bs=4 eval in minimal time to create it, then proceeds to cached sweep.
      - tqdm progress bars everywhere: old sequential baseline (per-example with task name),
        first-run sweep (double bar: outer=combo progress, inner=per-example), cache building
        (per-task), cached sweep (double bar). task names left-padded to max label length so the
        bar doesn't shift.
      - tokenizer identity via file_checksum (SHA-256 of tokenizer.pkl/tokenizer.json on disk),
        not encode-output hashing. HF models fall back to hashing the repo name.
2026-02-12 22:34:23 +00:00
unsalgokdag
8695280566 speed up CORE metric evaluation: batched GPU forward passes, threaded CPU prep, cross-call caching. first eval pipelines tokenization on a background thread while GPU processes the previous batch. second+ evals skip tokenization and collation entirely, only GPU forward passes remain. Also adds a benchmark script to sweep batch_size and queue_size hyperparameters. 2026-02-12 18:13:56 +01:00
Andrej Karpathy
2f09686724 clarify that this is bf16 mfu we're talking about 2026-02-10 23:35:00 +00:00
Andrej Karpathy
e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm 2026-02-10 18:46:39 +00:00
Andrej Karpathy
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon 2026-02-06 19:22:28 +00:00
Andrej Karpathy
2c062aaa94 nit: don't mutate args, create new var for total_batch_size 2026-02-05 19:59:46 +00:00
Andrej Karpathy
f41dd3cbd7 auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on 2026-02-05 19:40:37 +00:00
Andrej Karpathy
6079f78fc3 add fp8 training with torchao 2026-02-03 21:03:42 +00:00
Andrej Karpathy
8ebc14b348 small touchups to the eval script, re-order items etc, cosmetic 2026-02-03 21:03:42 +00:00
Sofie Van Landeghem
72b9064f9d
remove leftover mid references (#491) 2026-02-02 08:33:46 -08:00
Andrej Karpathy
07c4dd4cd9 manually control the over-active garbage collector, save a small few minutes from a typical run 2026-02-02 01:44:30 +00:00
Andrej Karpathy
8b4849d548 fix bug in chat_sft, the attention window must be preserved sigh 2026-02-01 20:58:44 +00:00
Andrej Karpathy
eaf49a33c8 fix path which i think was modified during the refactor and this is a bug introduced by claude i believe 2026-02-01 20:15:19 +00:00
Andrej Karpathy
31b61d2d17 fix broken import sigh 2026-02-01 05:03:44 +00:00
Andrej Karpathy
0307997f9b merge two files base_loss and base_eval into a single file, it's nicer this way, and unify the huggingface code associated with both 2026-02-01 02:36:43 +00:00
Andrej Karpathy
1ddaad1c1c nuke midtraining from orbit, it's not as needed now that we have a BOS-aligned dataloader. Also change the README a lot. midtrianing is not yet fully properly erased across the board, but good enough for step 1 2026-01-31 19:12:25 +00:00
Andrej Karpathy
348fbb301b fix dataloader for midtrain to never crop data. we can't just throw it away like we do in pretraining 2026-01-31 18:21:36 +00:00
Andrej Karpathy
3c3a3d7042 warmdown of 0.5 is slightly better: 2026-01-31 01:08:44 +00:00
Aarushi Singh
ace6740bdd
feat: allow top_k=0 in web api to disable filtering (#458)
* allow top_k=0 in web api to disable filtering

* adding a comment for clear reasoning

* adding change to docstring
2026-01-30 09:21:41 -08:00
Andrej Karpathy
41bb2eac32 Combine AdamW and Muon into single MuonAdamW optimizer, cleaner, ty @chrisjmccormick for idea/help 2026-01-29 00:52:08 +00:00
Andrej Karpathy
c88bbf8133 Merge branch 'engram' 2026-01-27 22:33:16 +00:00
Andrej Karpathy
c8d93beed2 add engram-lite, add log, tune scaling laws analysis scripts 2026-01-27 22:31:17 +00:00
Andrej Karpathy
8630d32be4 quick fix to not OOM main speedrun script 2026-01-26 22:31:42 +00:00
Andrej Karpathy
59e36cc727 first version of engram following modded nanogpt style 2026-01-25 18:59:51 +00:00
Andrej Karpathy
a91743c168 Merge branch 've' 2026-01-18 15:14:39 +00:00
Andrej Karpathy
cf5c9e5b8e resolve a crash for odd depths because FA3 needs head_dim % 8 == 0 2026-01-18 00:07:08 +00:00
Andrej Karpathy
413e91aa0f optimal ratio is now around 4 2026-01-17 23:51:09 +00:00
karpathy
f9a7e0f111 update the CPU/MPS script to give reasonable results. The model can at least answer that Paris is the capital of France and knows that the sky is blue, for about 40 minutes of training on my macbook. Also fixed a bug that existed due to KVCache bfloat16 dtype assumption 2026-01-17 12:27:30 -08:00
Andrej Karpathy
2955650327 add detection of device to report more correct mfu for bf16 2026-01-17 03:16:14 +00:00
Nitish Pandey
f42ae9e901
fix condition to perform bpb evaluation (#324)
Co-authored-by: svlandeg <svlandeg@github.com>
2026-01-16 18:56:43 -08:00
Andrej Karpathy
8203efa919 implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user. 2026-01-16 17:37:51 +00:00
Haoyu Wang
50413d2d67
typo in comments: change "GAPO" to "DAPO" 2026-01-15 22:03:42 -08:00
Sofie Van Landeghem
d4ea28d4e2
Fix args in readme (#438)
* fix commands in readme, using new arg format

* fix typo

* add required -i flag to chat_eval example runs
2026-01-15 16:26:38 -08:00
Andrej Karpathy
bdcc030ffa oops legacy spurious line now 2026-01-15 23:32:20 +00:00
Andrej Karpathy
255f8b9af6 cleanly separate cpu and gpu sections 2026-01-15 23:30:11 +00:00
Andrej Karpathy
7312ec9898 fix buggy midtrain and update all kwargs to be idiomatic. that is, argparse uses dashes variables use underscores. the underscores are just a remnant of the previous Configurator object. This is the right way 2026-01-13 22:45:27 +00:00
Andrej Karpathy
3b50b77ed3 fix base_loss to report correct loss by switching the dataloader to the new default 2026-01-13 22:09:36 +00:00
Andrej Karpathy
43c29dd9d5 Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training
The new DataLoader ensures that every token sequence in train/val batches has a BOS token
at the beginning. Therefore, no token streams start abruptly in the middle of a document,
which could be confusing for the model. Note that this changes the loss scale because there
are fewer confusing tokens in the train/val batches. The main downside is that we now waste
about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md
entry for this change for a lot more information.
2026-01-13 20:05:47 +00:00
Andrej Karpathy
21608ec51e allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway 2026-01-12 03:10:13 +00:00
Andrej Karpathy
b33e394528 oops actually make SSSL the default window pattern 2026-01-11 21:50:35 +00:00
Andrej Karpathy
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb 2026-01-11 21:49:54 +00:00
Andrej Karpathy
aa530cdad5 Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb 2026-01-11 18:47:35 +00:00
Andrej Karpathy
2c4473dd1b Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes. 2026-01-11 16:56:59 +00:00
Andrej Karpathy
061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then 2026-01-08 02:16:50 +00:00
Andrej Karpathy
ccf4b7f9bf nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script 2026-01-07 22:11:59 +00:00
Adria Blancafort
1b5de29e71
Fix undefined variable in chat_rl after recent refactor
* Fix undefined variable

* Remove unused import

Remove unused import 're' from chat_rl.py
2026-01-07 09:08:57 -08:00
Andrej Karpathy
ae0bf52529 tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3 2026-01-05 18:57:46 +00:00
Andrej Karpathy
9d4c9b786d many small fixes to base_train: reporting ETA, allowing some additional kwarg flexibility, making sure we don't crash when e.g. depth = 11 - we now calculate the closest num_heads that works 2026-01-05 00:38:09 +00:00
Andrej Karpathy
eb7bbc1b66 delete the configurator in favor of argparse and clean up a lot of kwarg details to make them more consistent across all scripts 2026-01-04 19:14:23 +00:00