nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2025-12-06 04:12:13 +00:00

Author	SHA1	Message	Date
google-labs-jules[bot]	962deeefb6	Fix HIP invalid device ordinal error on multi-GPU setup The `speedrun.sh` script was hardcoding `NPROC_PER_NODE=8` if any GPU capability was detected, causing crashes on systems with fewer than 8 GPUs. Additionally, `nanochat/common.py` was autodetecting "cuda" even if `torch.cuda.device_count()` was 0 on some ROCm builds, leading to "invalid device ordinal" errors. Changes: - `speedrun.sh`: Dynamically set `NPROC_PER_NODE` using `torch.cuda.device_count()`. - `nanochat/common.py`: Ensure `autodetect_device_type` only returns "cuda" if devices are actually present.	2025-11-23 05:34:20 +00:00
google-labs-jules[bot]	b92647c580	Fix AMD Triton runtime error by reinstalling pytorch-triton-rocm Uninstalling the conflicting `triton` package (upstream) on AMD systems often removes the `triton` directory shared with `pytorch-triton-rocm`, breaking the latter. This caused `ImportError: cannot import name 'Config' from 'triton'`. This change adds a step to force reinstall `pytorch-triton-rocm` immediately after uninstalling `triton`, ensuring the correct package is present and intact for the runtime.	2025-11-23 05:22:21 +00:00
google-labs-jules[bot]	d291a62ad8	Fix AMD Triton re-installation issue in speedrun.sh On AMD ROCm environments, `uv run` was detecting that the manually uninstalled `triton` package was missing (since it's a transitive dependency of `torch`) and reinstalling it during the tokenizer build step. This caused `ImportError: cannot import name 'Config' from 'triton'` due to conflict with `pytorch-triton-rocm`. This change adds `--no-sync` to the `uv run` command for building the tokenizer, preventing `uv` from undoing the manual uninstallation of `triton`.	2025-11-23 04:26:32 +00:00
google-labs-jules[bot]	994491b28d	Move triton uninstall after uv sync in speedrun.sh for AMD	2025-11-23 03:50:46 +00:00
google-labs-jules[bot]	8881ea84bf	Fix AMD Triton conflict in speedrun.sh Explicitly uninstall `triton` when AMD GPU is detected. The standard `triton` package (often pulled by NVIDIA dependencies or accident) conflicts with `pytorch-triton-rocm` on AMD systems, causing `ImportError: cannot import name 'Config' from 'triton'`. This change ensures a clean ROCm environment by removing the conflicting package. Also retains the `uv run --extra $EXTRAS` fix from the previous step.	2025-11-23 03:38:56 +00:00
google-labs-jules[bot]	83bb650b49	Fix AMD ROCm install regression in speedrun.sh Explicitly pass `--extra $EXTRAS` to `uv run` when building the tokenizer. This prevents `uv` from reverting to the default (NVIDIA) dependency set during the `maturin` build step, ensuring the correct PyTorch version (ROCm) is preserved on AMD hardware.	2025-11-23 02:33:07 +00:00
google-labs-jules[bot]	083de95913	Fix hardware detection for AMD ROCm and single-process CPU crashes	2025-11-22 23:50:50 +00:00
google-labs-jules[bot]	48e632245e	Fix ROCm/APU detection and CPU DDP OOM crash	2025-11-22 09:18:40 +00:00
google-labs-jules[bot]	a35621e726	Fix CPU DDP crashes: Init Gloo backend, prevent OOM by reducing NPROC, add script safety	2025-11-22 05:31:47 +00:00
svlandeg	f1683c5b16	set nproc_per_node as var in speedrun and run1000 scripts	2025-11-04 21:36:10 +01:00
Andrej Karpathy	cf587acb1a	move eval bundle download to be lazy and inside the python code so that we can substantially simplify the run bash scripts	2025-11-01 16:04:38 +00:00
Luke Stanley	901b075605	Fix GPU-less CPU use on Linux with specific Torch indexes	2025-10-21 23:14:16 +00:00
Andrej Karpathy	fe5aed940b	add personality to nanochat. breaks previous code on git pull and requires download of a new file from s3, but there is a helpful error message so hopefully its ok	2025-10-21 15:04:58 +00:00
Zach Mueller	f0855cbcc7	Update speedrun.sh	2025-10-14 14:12:01 -04:00
karpathy	3a5e0bc50b	initial commit	2025-10-13 06:49:24 -07:00

15 Commits