mirror of
https://github.com/karpathy/nanochat.git
synced 2026-01-22 19:34:17 +00:00
1.7 KiB
1.7 KiB
Running lm-eval with nanochat checkpoints
This repo ships its own evals (CORE, ARC/GSM8K/MMLU/HumanEval/SpellingBee), but you can also run the HuggingFace-compatible lm-evaluation-harness. Steps below assume you've already run bash setup.sh (installs uv, submodules, deps, Rust tokenizer).
1) Activate env
source .venv/bin/activate
2) Export a trained checkpoint to HF format
nanochat/to_hf.pyloads the latest checkpoint from~/.cache/nanochat/<source>_checkpointsand writes an HF folder.- Choose source:
base|mid|sft|rl.
# export latest base checkpoint to hf-export/base
uv run python -m nanochat.to_hf --source base --output hf-export/base
# export latest SFT checkpoint (chat model)
uv run python -m nanochat.to_hf --source sft --output hf-export/sft
3) Run lm-eval benchmarks on the exported model
Use the HF backend (--model hf). Pick tasks; nanochat's built-in evals cover these, so they're good starters in lm-eval too:
arc_easy,arc_challengemmlugsm8khumaneval
Example runs:
# Single task (MMLU)
uv run lm-eval run --model hf \
--model_args pretrained=hf-export/sft \
--tasks mmlu \
--batch_size 1
# A small suite similar to nanochat chat_eval coverage
uv run lm-eval run --model hf \
--model_args pretrained=hf-export/sft \
--tasks arc_easy,arc_challenge,gsm8k,mmlu,humaneval \
--batch_size 1
Notes:
- If you exported to a different folder, change
pretrained=...accordingly. You can also point to a remote HF repo name. --batch_size autocan help find the largest batch that fits GPU RAM. On CPU, keep it small.- No KV cache is implemented in the HF wrapper; generation is standard
AutoModelForCausalLMstyle.