Commit Graph

172 Commits

Author SHA1 Message Date
Lawrence R Kincheloe III
c1fc4400b0
Merge pull request #20 from LokiMetaSmith/fix-amd-triton-reinstall
Fix HIP invalid device ordinal error on multi-GPU setup
2025-11-22 23:34:51 -06:00
google-labs-jules[bot]
962deeefb6 Fix HIP invalid device ordinal error on multi-GPU setup
The `speedrun.sh` script was hardcoding `NPROC_PER_NODE=8` if any GPU capability was detected, causing crashes on systems with fewer than 8 GPUs. Additionally, `nanochat/common.py` was autodetecting "cuda" even if `torch.cuda.device_count()` was 0 on some ROCm builds, leading to "invalid device ordinal" errors.

Changes:
- `speedrun.sh`: Dynamically set `NPROC_PER_NODE` using `torch.cuda.device_count()`.
- `nanochat/common.py`: Ensure `autodetect_device_type` only returns "cuda" if devices are actually present.
2025-11-23 05:34:20 +00:00
Lawrence R Kincheloe III
23695f817d
Merge pull request #19 from LokiMetaSmith/fix-amd-triton-reinstall
Fix AMD Triton runtime error by reinstalling pytorch-triton-rocm
2025-11-22 23:22:57 -06:00
google-labs-jules[bot]
b92647c580 Fix AMD Triton runtime error by reinstalling pytorch-triton-rocm
Uninstalling the conflicting `triton` package (upstream) on AMD systems often removes the `triton` directory shared with `pytorch-triton-rocm`, breaking the latter. This caused `ImportError: cannot import name 'Config' from 'triton'`.

This change adds a step to force reinstall `pytorch-triton-rocm` immediately after uninstalling `triton`, ensuring the correct package is present and intact for the runtime.
2025-11-23 05:22:21 +00:00
Lawrence R Kincheloe III
44476a3512
Merge pull request #18 from LokiMetaSmith/fix-amd-triton-reinstall
Fix AMD Triton re-installation issue in speedrun.sh
2025-11-22 22:26:58 -06:00
google-labs-jules[bot]
d291a62ad8 Fix AMD Triton re-installation issue in speedrun.sh
On AMD ROCm environments, `uv run` was detecting that the manually uninstalled `triton` package was missing (since it's a transitive dependency of `torch`) and reinstalling it during the tokenizer build step. This caused `ImportError: cannot import name 'Config' from 'triton'` due to conflict with `pytorch-triton-rocm`.

This change adds `--no-sync` to the `uv run` command for building the tokenizer, preventing `uv` from undoing the manual uninstallation of `triton`.
2025-11-23 04:26:32 +00:00
Lawrence R Kincheloe III
054394c708
Merge pull request #17 from LokiMetaSmith/amd-triton-fix
Move triton uninstall after uv sync in speedrun.sh for AMD
2025-11-22 21:51:45 -06:00
google-labs-jules[bot]
994491b28d Move triton uninstall after uv sync in speedrun.sh for AMD 2025-11-23 03:50:46 +00:00
Lawrence R Kincheloe III
33b6b800fa
Merge pull request #16 from LokiMetaSmith/fix-amd-install
Fix AMD Triton conflict in speedrun.sh
2025-11-22 21:40:05 -06:00
google-labs-jules[bot]
8881ea84bf Fix AMD Triton conflict in speedrun.sh
Explicitly uninstall `triton` when AMD GPU is detected.
The standard `triton` package (often pulled by NVIDIA dependencies or accident)
conflicts with `pytorch-triton-rocm` on AMD systems, causing
`ImportError: cannot import name 'Config' from 'triton'`.
This change ensures a clean ROCm environment by removing the conflicting package.
Also retains the `uv run --extra $EXTRAS` fix from the previous step.
2025-11-23 03:38:56 +00:00
Lawrence R Kincheloe III
d46e9a72d4
Merge pull request #15 from LokiMetaSmith/fix-amd-install
Fix AMD ROCm install regression in speedrun.sh
2025-11-22 20:33:32 -06:00
google-labs-jules[bot]
83bb650b49 Fix AMD ROCm install regression in speedrun.sh
Explicitly pass `--extra $EXTRAS` to `uv run` when building the tokenizer.
This prevents `uv` from reverting to the default (NVIDIA) dependency set
during the `maturin` build step, ensuring the correct PyTorch version
(ROCm) is preserved on AMD hardware.
2025-11-23 02:33:07 +00:00
Lawrence R Kincheloe III
dd37f29fe4
Update Python version and torch dependencies
Updated Python version requirement and adjusted torch dependencies for CPU, GPU, and AMD support.
2025-11-22 20:02:59 -06:00
Lawrence R Kincheloe III
1af926205d
Update Python version from 3.10 to 3.12 2025-11-22 20:00:57 -06:00
Lawrence R Kincheloe III
ddc51d34df
Merge pull request #14 from LokiMetaSmith/fix-cpu-ddp-init
Fix hardware detection for AMD ROCm and single-process CPU crashes
2025-11-22 17:52:07 -06:00
google-labs-jules[bot]
083de95913 Fix hardware detection for AMD ROCm and single-process CPU crashes 2025-11-22 23:50:50 +00:00
Lawrence R Kincheloe III
b23494d2e2
Update .gitignore
adding output.txt to git ignore
2025-11-22 17:04:27 -06:00
Lawrence R Kincheloe III
3b3113c8d2
Merge pull request #13 from LokiMetaSmith/fix-cpu-ddp-init
Fix CPU DDP crashes, enable ROCm detection, and prevent single-proces…
2025-11-22 12:07:29 -06:00
google-labs-jules[bot]
28ef4c528e Fix CPU DDP crashes, enable ROCm detection, and prevent single-process distributed optimizer errors 2025-11-22 18:06:58 +00:00
Lawrence R Kincheloe III
36df08a5a9
Merge pull request #12 from LokiMetaSmith/fix-cpu-ddp-init
Fix ROCm/APU detection and CPU DDP OOM crash
2025-11-22 03:19:05 -06:00
google-labs-jules[bot]
48e632245e Fix ROCm/APU detection and CPU DDP OOM crash 2025-11-22 09:18:40 +00:00
Lawrence R Kincheloe III
8009354739
Merge pull request #11 from LokiMetaSmith/fix-cpu-ddp-init
Fix CPU DDP crashes: Init Gloo backend, prevent OOM by reducing NPROC…
2025-11-22 01:35:48 -06:00
google-labs-jules[bot]
a35621e726 Fix CPU DDP crashes: Init Gloo backend, prevent OOM by reducing NPROC, add script safety 2025-11-22 05:31:47 +00:00
Lawrence R Kincheloe III
b5fd54ac1c
Merge pull request #10 from LokiMetaSmith/fix-cpu-ddp-init
Fix process group initialization for CPU DDP and improve cleanup safety
2025-11-21 17:42:06 -06:00
google-labs-jules[bot]
9235fe4000 Fix process group initialization for CPU DDP and improve cleanup safety 2025-11-21 23:41:34 +00:00
Lawrence R Kincheloe III
104308cf78
Merge pull request #9 from LokiMetaSmith/fix-dataloader-typeerror
Fix TypeError in tokenizing_distributed_data_loader and robustness in…
2025-11-20 23:12:47 -06:00
google-labs-jules[bot]
f97e55eb93 Fix TypeError in tokenizing_distributed_data_loader and robustness in configurator.py
- Explicitly add `device` argument to `tokenizing_distributed_data_loader` in `nanochat/dataloader.py` to prevent `TypeError: unexpected keyword argument 'device'` when called from `scripts/base_train.py`.
- Update `nanochat/configurator.py` to ignore command-line flags starting with `--` (e.g. `--help`) instead of raising `AssertionError`, improving robustness when running with various launchers or flags.
2025-11-21 05:12:07 +00:00
Lawrence R Kincheloe III
6a1bfa919f
Merge pull request #8 from karpathy/master
updating repo
2025-11-20 20:57:15 -06:00
Andrej
4a87a0d19f
Merge pull request #299 from samjabrahams/rotary_embedding_head_dim_comment_cleanup
Fix comment: rotary embeddings final dimension size
2025-11-17 13:29:21 -08:00
Sam Abrahams
11e68bf442 Fix comment: rotary embeddings final dimension size 2025-11-17 11:32:56 -05:00
Andrej Karpathy
bc1fca39f3 mqa -> gqa to reduce confusion 2025-11-15 15:43:37 +00:00
Andrej
f66a780f68
Fix torch.dtype mismatching when running engine inline test. 2025-11-14 07:28:29 -08:00
Andrej
4763ce612a
Small fixes to typos 2025-11-14 07:25:59 -08:00
Sofie Van Landeghem
c6f5bd67db
revert change of base to sft for quick inline test 2025-11-14 12:20:03 +01:00
svlandeg
a2fb3c83a6 fix typos 2025-11-14 11:20:25 +01:00
svlandeg
e5efb4b471 add test_engine.py to file structure 2025-11-14 11:13:42 +01:00
Andrej Karpathy
9a71d13688 typo oops 2025-11-13 16:08:30 +00:00
Andrej Karpathy
7b7fd0fe71 thank you Sophie for your help with nanochat 2025-11-13 16:07:54 +00:00
Andrej Karpathy
c6abcdfe3a big change: add pretraining resumption logic so that checkpoints can now be approximately resumed and training can continue. this is useful for very long runs when you don't want the anxiety of your run crashing for some reason. alternatively, it's a way to recover training in the event of loss spikes. i mean, this should have been there in v0 but it's ok. the resumption is approximate to control complexity and bloat, but it's possible we want to change that in the future. to use, set --save_every to a step interval to write checkpoints with, and then use --resume_from_step to resume optimization from a given step. only base model training (pretraining) supports this atm, but it's ok because midtraining is comparably quite a bit faster. 2025-11-13 15:34:40 +00:00
Andrej Karpathy
91f09ccd0d minor fix comment in engine 2025-11-13 15:28:18 +00:00
Andrej Karpathy
adb5d4a16c uv lock has to change when we removed numpy the other commit 2025-11-13 15:16:27 +00:00
howardgao@outlook.com
b399e43168 fix engine test bug 2025-11-06 08:56:45 +08:00
Andrej Karpathy
c6b7ab7440 grad clip logging and printing and cosmetics 2025-11-05 21:08:30 +00:00
Andrej
885a4f25e7
Replace fcntl with filelock for Windows compatibility 2025-11-04 16:35:39 -08:00
Andrej
3a2ae631c4
Merge branch 'master' into master 2025-11-04 16:35:02 -08:00
Andrej
12d995f58c
Add NPROC_PER_NODE var to speedrun.sh and run1000.sh 2025-11-04 16:26:33 -08:00
svlandeg
f1683c5b16 set nproc_per_node as var in speedrun and run1000 scripts 2025-11-04 21:36:10 +01:00
Andrej
d1558c7873
handle bf16 on MPS by casting to fp32 during load checkpoint 2025-11-04 09:42:50 -08:00
Andrej
df25293087
Add explicit UTF-8 encoding on open 2025-11-04 09:38:18 -08:00
Yasser Makram
1e89af9862 Replace fcntl with filelock for Windows compatibility 2025-11-04 07:22:34 +00:00