Capture PyTorch execution traces and CUDA memory snapshots. Traces
display detailed CPU and CUDA activity, including individual CUDA kernel
calls.
CUDA memory snapshots visualize all memory allocations, helping diagnose
CUDA out-of-memory errors, investigate memory leaks, or understand GPU
memory usage for educational purposes.
Enable profiling with the --enable_profiling=True flag in speedrun.sh.
See PROFILING.md for documentation and example visualizations.