nanochat/docs/pre_gpu_runbook.md
2026-03-24 20:52:36 -04:00

98 lines
2.3 KiB
Markdown

# Pre-GPU Runbook
This runbook is the minimum operational checklist before spending GPU time.
## 1. Local Prep
1. Build the seed tool datasets:
```bash
python -m scripts.build_tool_datasets
```
2. Import the starting checkpoint from Hugging Face into native nanochat format:
```bash
python -m scripts.import_hf_checkpoint \
--repo-id ManmohanSharma/nanochat-d24 \
--model-tag d24_hf_import
```
3. Validate tool tokenization and mock tool execution with local tests:
```bash
python -m pytest tests/test_engine.py tests/test_tools.py -v
```
4. Dry-run tool evaluation on CPU:
```bash
python -m scripts.chat_eval \
-i sft \
-a ToolJSON \
--tool-jsonl seed_data/tool_eval_seed.jsonl \
--device-type cpu \
-x 3
```
## 2. 48-Hour GPU Schedule
1. Pilot CPT
- Run a short continuation test from the imported base checkpoint.
- Confirm loss is moving, checkpoint save works, and HF sync works.
2. Full CPT
- Run the main continuation stage on ClimbMix backbone.
- Save staged checkpoints at planned intervals.
3. SFT
- Include the local tool SFT JSONL via `--extra-train-jsonl`.
- Validate that calculator/web_search traces render correctly.
4. RL / tool tuning
- Keep this stage narrow and short.
- Focus on tool-choice correctness and grounded answers.
5. Eval
- Run ARC, MMLU, GSM8K, HumanEval, and ToolJSON checks.
- Do not ship if tool behavior regresses or citations are missing.
## 3. Checkpoint Upload Cadence
Upload every stage boundary and any explicit resume point:
```bash
python -m scripts.hf_sync_checkpoint \
--repo-id ManmohanSharma/nanochat-d24 \
--source base \
--model-tag d24_hf_import \
--step 0
```
If a whole checkpoint directory should be mirrored:
```bash
python -m scripts.hf_sync_checkpoint \
--repo-id ManmohanSharma/nanochat-d24 \
--source base \
--model-tag d24_hf_import
```
## 4. Go / No-Go
Go only if:
- HF import works.
- HF sync works.
- Mock tool execution works.
- Tool seed datasets are generated.
- Tool eval runs locally.
- The search backend plan is explicit: search provider plus Cloudflare fetch/crawl.
No-Go if:
- Any tokenizer mismatch appears during HF import.
- Tool blocks fail to render.
- `web_search` still has no backend plan beyond fetch-only Cloudflare Browser Rendering.
- Local tool eval is missing or failing.