mirror of
https://github.com/karpathy/nanochat.git
synced 2026-06-15 10:39:08 +00:00
- base_train.py: CUDA profiler + PyTorch profiler hooks gated by NANOCHAT_PROFILE_* env vars - profile_step.py: standalone single-step profiler with NVTX ranges and phase selection - LOCAL_STATE.md: documents local branch/file state before machine teardown Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
36 lines
1.6 KiB
Markdown
36 lines
1.6 KiB
Markdown
# Local State — nanochat (karpathy fork)
|
|
|
|
Documented 2026-04-09 before machine teardown.
|
|
|
|
## Branch: fa3-flex-sdpa (current)
|
|
- Tracking: `fork/fa3-flex-sdpa` (ademeure/nanochat) — pushed and up to date
|
|
- 1 commit ahead of upstream master: `3d0dec5 FA3/FlexAttention/SDPA attention + PyTorch 2.11/CUDA 13.0`
|
|
|
|
## Branch: pytorch-2.11-cu130
|
|
- Tracking: `fork/pytorch-2.11-cu130` — pushed and up to date
|
|
- 2 commits ahead of master
|
|
|
|
## Branch: pytorch-2.11-cu128-test
|
|
- **Local-only, no upstream** — but 0 commits ahead of master, just a branch pointer. No unique content.
|
|
|
|
## Uncommitted changes (being committed now)
|
|
|
|
### scripts/base_train.py
|
|
- Added env-var-controlled profiling hooks (`NANOCHAT_PROFILE_START`, `NANOCHAT_PROFILE_STOP`, `NANOCHAT_PROFILE_EXIT`, `NANOCHAT_TORCH_PROFILE_DIR`)
|
|
- CUDA profiler start/stop integration around training steps
|
|
- PyTorch profiler with tensorboard trace output
|
|
- Early exit after profiling completes
|
|
- This is a work-in-progress profiling integration — functional but may need further tuning
|
|
|
|
### scripts/profile_step.py (new file)
|
|
- Standalone profiling script for a single training step (fwd/bwd/opt)
|
|
- Supports nsys and ncu profiling with NVTX ranges
|
|
- Usage: `nsys profile -o out python -m scripts.profile_step --depth 6`
|
|
- Supports `--phase {all,fwd,bwd,opt}` for targeted kernel analysis
|
|
|
|
### profiles/ (NOT committed — binary nsys artifacts)
|
|
- `nsys_d32_full.nsys-rep` (1.6M) — nsys trace, depth=32
|
|
- `nsys_d32_full.sqlite` (2.4M) — exported sqlite
|
|
- `nsys_d32_minimal.nsys-rep` (1.5M) — minimal nsys trace
|
|
- These are reproducible output artifacts, not committed to git
|