nanochat/docs/pre_gpu_runbook.md
2026-03-24 20:52:36 -04:00

2.3 KiB

Pre-GPU Runbook

This runbook is the minimum operational checklist before spending GPU time.

1. Local Prep

  1. Build the seed tool datasets:
python -m scripts.build_tool_datasets
  1. Import the starting checkpoint from Hugging Face into native nanochat format:
python -m scripts.import_hf_checkpoint \
  --repo-id ManmohanSharma/nanochat-d24 \
  --model-tag d24_hf_import
  1. Validate tool tokenization and mock tool execution with local tests:
python -m pytest tests/test_engine.py tests/test_tools.py -v
  1. Dry-run tool evaluation on CPU:
python -m scripts.chat_eval \
  -i sft \
  -a ToolJSON \
  --tool-jsonl seed_data/tool_eval_seed.jsonl \
  --device-type cpu \
  -x 3

2. 48-Hour GPU Schedule

  1. Pilot CPT

    • Run a short continuation test from the imported base checkpoint.
    • Confirm loss is moving, checkpoint save works, and HF sync works.
  2. Full CPT

    • Run the main continuation stage on ClimbMix backbone.
    • Save staged checkpoints at planned intervals.
  3. SFT

    • Include the local tool SFT JSONL via --extra-train-jsonl.
    • Validate that calculator/web_search traces render correctly.
  4. RL / tool tuning

    • Keep this stage narrow and short.
    • Focus on tool-choice correctness and grounded answers.
  5. Eval

    • Run ARC, MMLU, GSM8K, HumanEval, and ToolJSON checks.
    • Do not ship if tool behavior regresses or citations are missing.

3. Checkpoint Upload Cadence

Upload every stage boundary and any explicit resume point:

python -m scripts.hf_sync_checkpoint \
  --repo-id ManmohanSharma/nanochat-d24 \
  --source base \
  --model-tag d24_hf_import \
  --step 0

If a whole checkpoint directory should be mirrored:

python -m scripts.hf_sync_checkpoint \
  --repo-id ManmohanSharma/nanochat-d24 \
  --source base \
  --model-tag d24_hf_import

4. Go / No-Go

Go only if:

  • HF import works.
  • HF sync works.
  • Mock tool execution works.
  • Tool seed datasets are generated.
  • Tool eval runs locally.
  • The search backend plan is explicit: search provider plus Cloudflare fetch/crawl.

No-Go if:

  • Any tokenizer mismatch appears during HF import.
  • Tool blocks fail to render.
  • web_search still has no backend plan beyond fetch-only Cloudflare Browser Rendering.
  • Local tool eval is missing or failing.