humaneval done

2026-01-23 20:04:22 +00:00 · 2025-12-23 16:21:51 +00:00 · 2025-12-23 16:21:51 +00:00 · a1f836bbeb
commit a1f836bbeb
parent c026e6f63d
5 changed files with 3064 additions and 8 deletions
--- a/lm_eval.md
+++ b/lm_eval.md
@ -33,28 +33,27 @@ uv run lm-eval run --model hf \
  --tasks mmlu \
  --batch_size 1

-# A small suite similar to nanochat chat_eval coverage
+# A small suite similar to nanochat chat_eval coverage (vanilla HF backend)
 # HumanEval requires both flags below to allow executing generated code.
 HF_ALLOW_CODE_EVAL=1 uv run lm-eval run --confirm_run_unsafe_code --model hf \
  --model_args pretrained=hf-export/sft,trust_remote_code=True \
-  --tasks arc_easy,arc_challenge,gsm8k,mmlu,humaneval \
-  --apply_chat_template \
+  --tasks arc_easy,arc_challenge,mmlu \
  --batch_size 1 > log.log 2>&1

+# Nanochat-aligned tool-use backend (matches nanochat eval formatting)
 HF_ALLOW_CODE_EVAL=1 uv run lm-eval run \
    --include_path tools/lm-eval/lm_eval/tasks \
    --confirm_run_unsafe_code \
-    --model hf \
+    --model hf-nanochat-tool \
    --model_args pretrained=hf-export/sft,trust_remote_code=True,tokenizer=hf-export/sft \
    --tasks gsm8k_nanochat,humaneval_nanochat \
-    --apply_chat_template \
    --batch_size 1 \
    --log_samples \
-    --output_path lm_eval_sample_nanochat.json > log.log 2>&1
+    --output_path lm_eval_sample_nanochat > log.log 2>&1
 ```

 Notes:
 - If you exported to a different folder, change `pretrained=...` accordingly. You can also point to a remote HF repo name.
 - If you must stay offline, add `HF_DATASETS_OFFLINE=1 HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1`, **but** ensure the datasets are already cached locally (e.g., `allenai/ai2_arc`, `openai_humaneval`, `gsm8k`, `cais/mmlu`). Otherwise, leave them unset so the harness can download once.
 - `--batch_size auto` can help find the largest batch that fits GPU RAM. On CPU, keep it small.
- No KV cache is implemented in the HF wrapper; generation is standard `AutoModelForCausalLM` style.
+- No KV cache is implemented in the HF wrapper; generation is standard `AutoModelForCausalLM` style. The `hf-nanochat-tool` wrapper runs a nanochat-style tool loop (greedy, batch=1) and does not need `--apply_chat_template` because the prompts already contain special tokens.
--- a/lm_eval_sample_nanochat/hf-export__sft/results_2025-12-23T16-17-55.835102.json
+++ b/lm_eval_sample_nanochat/hf-export__sft/results_2025-12-23T16-17-55.835102.json
--- a/lm_eval_sample_nanochat/hf-export__sft/samples_gsm8k_nanochat_2025-12-23T16-17-55.835102.jsonl
+++ b/lm_eval_sample_nanochat/hf-export__sft/samples_gsm8k_nanochat_2025-12-23T16-17-55.835102.jsonl
--- a/lm_eval_sample_nanochat/hf-export__sft/samples_humaneval_nanochat_2025-12-23T16-17-55.835102.jsonl
+++ b/lm_eval_sample_nanochat/hf-export__sft/samples_humaneval_nanochat_2025-12-23T16-17-55.835102.jsonl
--- a/tools/lm-eval
+++ b/tools/lm-eval
@ -1 +1 @@
-Subproject commit 5628f98f0c387366f18964e3d34b614e5600f83b
+Subproject commit 32c4b74696a41586712a8a8b7906591833ba1a78