Switch cached eval path to batched=True (forwards full collated batches)
for ~5-7x speedup over sequential per-example evaluation. Add per-example
forwarding mode (batched=False) that trims collation padding to recover
exact per-example tensor shapes, guaranteeing identical results to the
old sequential path. Bench script uses batched=True for speed sweeps and
per-example mode for correctness verification against old.
three independent improvements to the cached CORE evaluation path:
1. GPU-resident data: all base-4 collated batches (~144MB for full CORE eval)
are moved to GPU upfront via .to(device). eliminates all CPU→GPU transfers
from the forward loop. _forward_all_cached replaces double-buffered prefetch
with a simple upfront bulk transfer — .to() is a no-op when the caller has
already preloaded tensors to GPU (as bench_core_eval now does).
2. continuous cross-task pipeline: _forward_all_cached flattens all tasks'
batches into one stream. the last batch of task N flows directly into the
first batch of task N+1 with no pipeline restart. GPU-side composition via
merge (pad+cat for bs > base) and split (row-slice for bs < base) avoids
the CPU-side compose_collated bottleneck that made bs=8 slower than bs=4.
3. progress bars + per-task result printing: both cached and first-run paths
in evaluate_model now show a tqdm progress bar with the current task label.
on_task_done callback in _forward_all_cached prints each task's accuracy
as soon as its last batch is processed (single-GPU). DDP falls back to
printing after all_reduce. both paths print total elapsed time at the end.
bench_core_eval: preloads ALL base-4 batches to GPU once before the batch-size
sweep. all sweep iterations compose from GPU-resident tensors with zero
CPU→GPU transfers in the hot loop.
the main idea: tokenization + collation for CORE eval only needs to happen once per tokenizer.
collated batches at base batch_size=4 are saved to disk (core_token_cache/), keyed by SHA-256
of the tokenizer file. any batch_size can be served from these base-4 batches: larger sizes merge
consecutive batches (right-pad shorter ones, cat along dim=0), smaller sizes split along example
boundaries (trim trailing padding). this means prepare_task_data is truly a one-time cost.
core_eval.py:
- double-buffered CPU->GPU transfers in both forward paths (_forward_batches and evaluate_task's
pipelined path). while GPU runs forward_model on batch N, batch N+1 is pin_memory()'d and
DMA-transferred via non_blocking=True. the DMA engine and GPU compute units are separate
hardware so they overlap. previously GPU idled during every transfer.
- compose_collated(): merge base batches for larger batch_size (cat after right-padding to
max_len), or split for smaller batch_size (slice along row boundaries from batch_meta,
trim trailing padding via vectorized non_pad.any(dim=0)). works because examples are sorted
by seq_len, so consecutive base batches have monotonically increasing lengths.
- evaluate_task and _forward_batches accept optional pbar for progress tracking.
base_eval.py:
- evaluate_model now has 3-tier caching: in-memory (_batch_cache, across calls within same
process), disk load (core_token_cache/, on first call when in-memory is empty), disk save
(after first run's prepare+collate+forward, writes collated batches so future training runs
and the benchmark skip tokenization entirely). keyed by tokenizer file hash + max_per_task.
bench_core_eval.py:
- cached sweep no longer re-runs the full first-run sweep to build collated data (was 2x the
work for no reason). instead loads/builds base-4 cache once, then compose_collated serves
any target batch_size. cached sweep only varies batch_size (no queue_size — no collation thread).
- --skip-first: skip the first-run sweep entirely if disk cache exists. if cache is missing,
runs a single bs=4 eval in minimal time to create it, then proceeds to cached sweep.
- tqdm progress bars everywhere: old sequential baseline (per-example with task name),
first-run sweep (double bar: outer=combo progress, inner=per-example), cache building
(per-task), cached sweep (double bar). task names left-padded to max label length so the
bar doesn't shift.
- tokenizer identity via file_checksum (SHA-256 of tokenizer.pkl/tokenizer.json on disk),
not encode-output hashing. HF models fall back to hashing the repo name.