Commit Graph

17 Commits

Author SHA1 Message Date
Unsal Gokdag
4f79e750e7 CORE eval: batched forwarding by default, per-example mode for verification
Switch cached eval path to batched=True (forwards full collated batches)
      for ~5-7x speedup over sequential per-example evaluation. Add per-example
      forwarding mode (batched=False) that trims collation padding to recover
      exact per-example tensor shapes, guaranteeing identical results to the
      old sequential path. Bench script uses batched=True for speed sweeps and
      per-example mode for correctness verification against old.
2026-02-13 08:42:45 +00:00
Unsal Gokdag
c3f234cfca CORE eval: GPU-resident data, continuous pipeline, per-task progress bars
three independent improvements to the cached CORE evaluation path:

   1. GPU-resident data: all base-4 collated batches (~144MB for full CORE eval)
      are moved to GPU upfront via .to(device). eliminates all CPU→GPU transfers
      from the forward loop. _forward_all_cached replaces double-buffered prefetch
      with a simple upfront bulk transfer — .to() is a no-op when the caller has
      already preloaded tensors to GPU (as bench_core_eval now does).

   2. continuous cross-task pipeline: _forward_all_cached flattens all tasks'
      batches into one stream. the last batch of task N flows directly into the
      first batch of task N+1 with no pipeline restart. GPU-side composition via
      merge (pad+cat for bs > base) and split (row-slice for bs < base) avoids
      the CPU-side compose_collated bottleneck that made bs=8 slower than bs=4.

   3. progress bars + per-task result printing: both cached and first-run paths
      in evaluate_model now show a tqdm progress bar with the current task label.
      on_task_done callback in _forward_all_cached prints each task's accuracy
      as soon as its last batch is processed (single-GPU). DDP falls back to
      printing after all_reduce. both paths print total elapsed time at the end.

   bench_core_eval: preloads ALL base-4 batches to GPU once before the batch-size
   sweep. all sweep iterations compose from GPU-resident tensors with zero
   CPU→GPU transfers in the hot loop.
2026-02-13 07:54:53 +00:00
Unsal Gokdag
7fa30f5ee3 CORE eval: disk-cached tokenized batches, double-buffered GPU transfers, batch composition, benchmark improvements
the main idea: tokenization + collation for CORE eval only needs to happen once per tokenizer.
      collated batches at base batch_size=4 are saved to disk (core_token_cache/), keyed by SHA-256
      of the tokenizer file. any batch_size can be served from these base-4 batches: larger sizes merge
      consecutive batches (right-pad shorter ones, cat along dim=0), smaller sizes split along example
      boundaries (trim trailing padding). this means prepare_task_data is truly a one-time cost.

      core_eval.py:
      - double-buffered CPU->GPU transfers in both forward paths (_forward_batches and evaluate_task's
        pipelined path). while GPU runs forward_model on batch N, batch N+1 is pin_memory()'d and
        DMA-transferred via non_blocking=True. the DMA engine and GPU compute units are separate
        hardware so they overlap. previously GPU idled during every transfer.
      - compose_collated(): merge base batches for larger batch_size (cat after right-padding to
        max_len), or split for smaller batch_size (slice along row boundaries from batch_meta,
        trim trailing padding via vectorized non_pad.any(dim=0)). works because examples are sorted
        by seq_len, so consecutive base batches have monotonically increasing lengths.
      - evaluate_task and _forward_batches accept optional pbar for progress tracking.

      base_eval.py:
      - evaluate_model now has 3-tier caching: in-memory (_batch_cache, across calls within same
        process), disk load (core_token_cache/, on first call when in-memory is empty), disk save
        (after first run's prepare+collate+forward, writes collated batches so future training runs
        and the benchmark skip tokenization entirely). keyed by tokenizer file hash + max_per_task.

      bench_core_eval.py:
      - cached sweep no longer re-runs the full first-run sweep to build collated data (was 2x the
        work for no reason). instead loads/builds base-4 cache once, then compose_collated serves
        any target batch_size. cached sweep only varies batch_size (no queue_size — no collation thread).
      - --skip-first: skip the first-run sweep entirely if disk cache exists. if cache is missing,
        runs a single bs=4 eval in minimal time to create it, then proceeds to cached sweep.
      - tqdm progress bars everywhere: old sequential baseline (per-example with task name),
        first-run sweep (double bar: outer=combo progress, inner=per-example), cache building
        (per-task), cached sweep (double bar). task names left-padded to max label length so the
        bar doesn't shift.
      - tokenizer identity via file_checksum (SHA-256 of tokenizer.pkl/tokenizer.json on disk),
        not encode-output hashing. HF models fall back to hashing the repo name.
2026-02-12 22:34:23 +00:00
unsalgokdag
8695280566 speed up CORE metric evaluation: batched GPU forward passes, threaded CPU prep, cross-call caching. first eval pipelines tokenization on a background thread while GPU processes the previous batch. second+ evals skip tokenization and collation entirely, only GPU forward passes remain. Also adds a benchmark script to sweep batch_size and queue_size hyperparameters. 2026-02-12 18:13:56 +01:00
Andrej Karpathy
8ebc14b348 small touchups to the eval script, re-order items etc, cosmetic 2026-02-03 21:03:42 +00:00
Andrej Karpathy
0307997f9b merge two files base_loss and base_eval into a single file, it's nicer this way, and unify the huggingface code associated with both 2026-02-01 02:36:43 +00:00
DU Wenjie
ea4229851b bugfix 2025-12-26 19:02:12 +08:00
DU Wenjie
7840049189 bugfix keep same args style in scripts/base_eval.py 2025-12-26 17:29:08 +08:00
duwenjie
92c6654b95 bugfix save and load ckpt from model_tag dir 2025-12-21 15:07:04 +08:00
svlandeg
c72b8b2309 add explicit UTF-8 encoding 2025-11-03 21:27:12 +01:00
Dipesh Babu
226953b841 fix: open JSONL and results CSV with UTF-8 encoding for portability 2025-11-03 01:20:56 -05:00
Andrej Karpathy
cf587acb1a move eval bundle download to be lazy and inside the python code so that we can substantially simplify the run bash scripts 2025-11-01 16:04:38 +00:00
Andrej Karpathy
7d2c4a3d95 delete pandas dep in base_eval use csv instead 2025-11-01 15:28:30 +00:00
svlandeg
8c9b004c99 typo fixes in scripts 2025-10-28 20:17:31 +01:00
karpathy
df600b6ed5 many small tweaks. base, eval, core work now i think 2025-10-16 15:46:18 -07:00
karpathy
786119d593 add autodetect of device and related stuff. getting weird warnings/errors still, so wip 2025-10-16 10:26:19 -07:00
karpathy
3a5e0bc50b initial commit 2025-10-13 06:49:24 -07:00