Commit Graph

15 Commits

Author SHA1 Message Date
google-labs-jules[bot]
962deeefb6 Fix HIP invalid device ordinal error on multi-GPU setup
The `speedrun.sh` script was hardcoding `NPROC_PER_NODE=8` if any GPU capability was detected, causing crashes on systems with fewer than 8 GPUs. Additionally, `nanochat/common.py` was autodetecting "cuda" even if `torch.cuda.device_count()` was 0 on some ROCm builds, leading to "invalid device ordinal" errors.

Changes:
- `speedrun.sh`: Dynamically set `NPROC_PER_NODE` using `torch.cuda.device_count()`.
- `nanochat/common.py`: Ensure `autodetect_device_type` only returns "cuda" if devices are actually present.
2025-11-23 05:34:20 +00:00
google-labs-jules[bot]
b92647c580 Fix AMD Triton runtime error by reinstalling pytorch-triton-rocm
Uninstalling the conflicting `triton` package (upstream) on AMD systems often removes the `triton` directory shared with `pytorch-triton-rocm`, breaking the latter. This caused `ImportError: cannot import name 'Config' from 'triton'`.

This change adds a step to force reinstall `pytorch-triton-rocm` immediately after uninstalling `triton`, ensuring the correct package is present and intact for the runtime.
2025-11-23 05:22:21 +00:00
google-labs-jules[bot]
d291a62ad8 Fix AMD Triton re-installation issue in speedrun.sh
On AMD ROCm environments, `uv run` was detecting that the manually uninstalled `triton` package was missing (since it's a transitive dependency of `torch`) and reinstalling it during the tokenizer build step. This caused `ImportError: cannot import name 'Config' from 'triton'` due to conflict with `pytorch-triton-rocm`.

This change adds `--no-sync` to the `uv run` command for building the tokenizer, preventing `uv` from undoing the manual uninstallation of `triton`.
2025-11-23 04:26:32 +00:00
google-labs-jules[bot]
994491b28d Move triton uninstall after uv sync in speedrun.sh for AMD 2025-11-23 03:50:46 +00:00
google-labs-jules[bot]
8881ea84bf Fix AMD Triton conflict in speedrun.sh
Explicitly uninstall `triton` when AMD GPU is detected.
The standard `triton` package (often pulled by NVIDIA dependencies or accident)
conflicts with `pytorch-triton-rocm` on AMD systems, causing
`ImportError: cannot import name 'Config' from 'triton'`.
This change ensures a clean ROCm environment by removing the conflicting package.
Also retains the `uv run --extra $EXTRAS` fix from the previous step.
2025-11-23 03:38:56 +00:00
google-labs-jules[bot]
83bb650b49 Fix AMD ROCm install regression in speedrun.sh
Explicitly pass `--extra $EXTRAS` to `uv run` when building the tokenizer.
This prevents `uv` from reverting to the default (NVIDIA) dependency set
during the `maturin` build step, ensuring the correct PyTorch version
(ROCm) is preserved on AMD hardware.
2025-11-23 02:33:07 +00:00
google-labs-jules[bot]
083de95913 Fix hardware detection for AMD ROCm and single-process CPU crashes 2025-11-22 23:50:50 +00:00
google-labs-jules[bot]
48e632245e Fix ROCm/APU detection and CPU DDP OOM crash 2025-11-22 09:18:40 +00:00
google-labs-jules[bot]
a35621e726 Fix CPU DDP crashes: Init Gloo backend, prevent OOM by reducing NPROC, add script safety 2025-11-22 05:31:47 +00:00
svlandeg
f1683c5b16 set nproc_per_node as var in speedrun and run1000 scripts 2025-11-04 21:36:10 +01:00
Andrej Karpathy
cf587acb1a move eval bundle download to be lazy and inside the python code so that we can substantially simplify the run bash scripts 2025-11-01 16:04:38 +00:00
Luke Stanley
901b075605 Fix GPU-less CPU use on Linux with specific Torch indexes 2025-10-21 23:14:16 +00:00
Andrej Karpathy
fe5aed940b add personality to nanochat. breaks previous code on git pull and requires download of a new file from s3, but there is a helpful error message so hopefully its ok 2025-10-21 15:04:58 +00:00
Zach Mueller
f0855cbcc7
Update speedrun.sh 2025-10-14 14:12:01 -04:00
karpathy
3a5e0bc50b initial commit 2025-10-13 06:49:24 -07:00