# Local State — nanochat (karpathy fork)

Documented 2026-04-09 before machine teardown.

## Branch: fa3-flex-sdpa (current)
- Tracking: `fork/fa3-flex-sdpa` (ademeure/nanochat) — pushed and up to date
- 1 commit ahead of upstream master: `3d0dec5 FA3/FlexAttention/SDPA attention + PyTorch 2.11/CUDA 13.0`

## Branch: pytorch-2.11-cu130
- Tracking: `fork/pytorch-2.11-cu130` — pushed and up to date
- 2 commits ahead of master

## Branch: pytorch-2.11-cu128-test
- **Local-only, no upstream** — but 0 commits ahead of master, just a branch pointer. No unique content.

## Uncommitted changes (being committed now)

### scripts/base_train.py
- Added env-var-controlled profiling hooks (`NANOCHAT_PROFILE_START`, `NANOCHAT_PROFILE_STOP`, `NANOCHAT_PROFILE_EXIT`, `NANOCHAT_TORCH_PROFILE_DIR`)
- CUDA profiler start/stop integration around training steps
- PyTorch profiler with tensorboard trace output
- Early exit after profiling completes
- This is a work-in-progress profiling integration — functional but may need further tuning

### scripts/profile_step.py (new file)
- Standalone profiling script for a single training step (fwd/bwd/opt)
- Supports nsys and ncu profiling with NVTX ranges
- Usage: `nsys profile -o out python -m scripts.profile_step --depth 6`
- Supports `--phase {all,fwd,bwd,opt}` for targeted kernel analysis

### profiles/ (NOT committed — binary nsys artifacts)
- `nsys_d32_full.nsys-rep` (1.6M) — nsys trace, depth=32
- `nsys_d32_full.sqlite` (2.4M) — exported sqlite
- `nsys_d32_minimal.nsys-rep` (1.5M) — minimal nsys trace
- These are reproducible output artifacts, not committed to git