mirror of
https://github.com/karpathy/nanochat.git
synced 2026-04-04 06:35:23 +00:00
three independent improvements to the cached CORE evaluation path:
1. GPU-resident data: all base-4 collated batches (~144MB for full CORE eval)
are moved to GPU upfront via .to(device). eliminates all CPU→GPU transfers
from the forward loop. _forward_all_cached replaces double-buffered prefetch
with a simple upfront bulk transfer — .to() is a no-op when the caller has
already preloaded tensors to GPU (as bench_core_eval now does).
2. continuous cross-task pipeline: _forward_all_cached flattens all tasks'
batches into one stream. the last batch of task N flows directly into the
first batch of task N+1 with no pipeline restart. GPU-side composition via
merge (pad+cat for bs > base) and split (row-slice for bs < base) avoids
the CPU-side compose_collated bottleneck that made bs=8 slower than bs=4.
3. progress bars + per-task result printing: both cached and first-run paths
in evaluate_model now show a tqdm progress bar with the current task label.
on_task_done callback in _forward_all_cached prints each task's accuracy
as soon as its last batch is processed (single-GPU). DDP falls back to
printing after all_reduce. both paths print total elapsed time at the end.
bench_core_eval: preloads ALL base-4 batches to GPU once before the batch-size
sweep. all sweep iterations compose from GPU-resident tensors with zero
CPU→GPU transfers in the hot loop.
|
||
|---|---|---|
| .. | ||
| base_eval.py | ||
| base_train.py | ||
| bench_core_eval.py | ||
| chat_cli.py | ||
| chat_eval.py | ||
| chat_rl.py | ||
| chat_sft.py | ||
| chat_web.py | ||
| tok_eval.py | ||
| tok_train.py | ||