nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-04-15 13:28:38 +00:00

Author	SHA1	Message	Date
Lawrence R Kincheloe III	cbdca27e27	Merge pull request #24 from LokiMetaSmith/fix-amd-triton-reinstall Reduce base_train batch size and set PYTORCH_HIP_ALLOC_CONF	2025-11-23 10:03:31 -06:00
google-labs-jules[bot]	bbc816dc77	Reduce base_train batch size and set PYTORCH_HIP_ALLOC_CONF To address "HIP out of memory" errors on some AMD ROCm configurations (potentially due to memory fragmentation or limited per-device VRAM), this change: 1. Reduces the default `device_batch_size` from 32 to 16. 2. Explicitly sets `PYTORCH_HIP_ALLOC_CONF=expandable_segments:True` when ROCm is detected, which helps the allocator manage fragmented memory better than the default behavior.	2025-11-23 16:03:02 +00:00
Lawrence R Kincheloe III	e14d7ba6bf	Merge pull request #23 from LokiMetaSmith/fix-amd-triton-reinstall Explicitly enable allow_tf32 in nanochat/common.py	2025-11-23 02:28:00 -06:00
google-labs-jules[bot]	41ba458c3b	Explicitly enable allow_tf32 in nanochat/common.py Even when calling `torch.set_float32_matmul_precision('high')`, the `torch.compile` (Inductor) backend on some ROCm versions may still warn that TensorFloat32 is available but not enabled. This change explicitly sets `torch.backends.cuda.matmul.allow_tf32 = True` to ensure the setting is active and to silence the warning.	2025-11-23 08:27:13 +00:00
Lawrence R Kincheloe III	40ef6e81a9	Merge pull request #22 from LokiMetaSmith/fix-amd-triton-reinstall Export TRITON_HIP_LLD_PATH in speedrun.sh for AMD ROCm	2025-11-23 02:19:33 -06:00
google-labs-jules[bot]	68148b1bf3	Export TRITON_HIP_LLD_PATH in speedrun.sh for AMD ROCm When running on AMD ROCm using `uv`-installed packages (`rocm-sdk-core`), the `ld.lld` linker is not in the default `/opt/rocm/llvm/bin/` location expected by `pytorch-triton-rocm`. This causes `InductorError` during `torch.compile`. This change updates `speedrun.sh` to dynamically find the `ld.lld` binary within the active python environment's site-packages (`_rocm_sdk_core`) and export the `TRITON_HIP_LLD_PATH` environment variable, allowing Triton to locate the linker correctly.	2025-11-23 08:19:07 +00:00
Lawrence R Kincheloe III	da035bf408	Merge pull request #21 from LokiMetaSmith/fix-amd-triton-reinstall Use gloo backend for DDP on AMD ROCm to avoid NCCL crashes	2025-11-23 00:49:55 -06:00
google-labs-jules[bot]	1f9b734358	Use gloo backend for DDP on AMD ROCm to avoid NCCL crashes On consumer AMD hardware (like APUs or gaming GPUs) running ROCm, the default `nccl` backend (which wraps RCCL) often fails with `invalid device function` due to architecture mismatches or kernel issues. This change detects the presence of `torch.version.hip` and forces the `gloo` backend for `torch.distributed.init_process_group`. While `gloo` is slower for data transfer, it is CPU-based and significantly more robust for these setups, ensuring the training script can run without crashing.	2025-11-23 06:49:07 +00:00
Lawrence R Kincheloe III	c1fc4400b0	Merge pull request #20 from LokiMetaSmith/fix-amd-triton-reinstall Fix HIP invalid device ordinal error on multi-GPU setup	2025-11-22 23:34:51 -06:00
google-labs-jules[bot]	962deeefb6	Fix HIP invalid device ordinal error on multi-GPU setup The `speedrun.sh` script was hardcoding `NPROC_PER_NODE=8` if any GPU capability was detected, causing crashes on systems with fewer than 8 GPUs. Additionally, `nanochat/common.py` was autodetecting "cuda" even if `torch.cuda.device_count()` was 0 on some ROCm builds, leading to "invalid device ordinal" errors. Changes: - `speedrun.sh`: Dynamically set `NPROC_PER_NODE` using `torch.cuda.device_count()`. - `nanochat/common.py`: Ensure `autodetect_device_type` only returns "cuda" if devices are actually present.	2025-11-23 05:34:20 +00:00
Lawrence R Kincheloe III	23695f817d	Merge pull request #19 from LokiMetaSmith/fix-amd-triton-reinstall Fix AMD Triton runtime error by reinstalling pytorch-triton-rocm	2025-11-22 23:22:57 -06:00
google-labs-jules[bot]	b92647c580	Fix AMD Triton runtime error by reinstalling pytorch-triton-rocm Uninstalling the conflicting `triton` package (upstream) on AMD systems often removes the `triton` directory shared with `pytorch-triton-rocm`, breaking the latter. This caused `ImportError: cannot import name 'Config' from 'triton'`. This change adds a step to force reinstall `pytorch-triton-rocm` immediately after uninstalling `triton`, ensuring the correct package is present and intact for the runtime.	2025-11-23 05:22:21 +00:00
Lawrence R Kincheloe III	44476a3512	Merge pull request #18 from LokiMetaSmith/fix-amd-triton-reinstall Fix AMD Triton re-installation issue in speedrun.sh	2025-11-22 22:26:58 -06:00
google-labs-jules[bot]	d291a62ad8	Fix AMD Triton re-installation issue in speedrun.sh On AMD ROCm environments, `uv run` was detecting that the manually uninstalled `triton` package was missing (since it's a transitive dependency of `torch`) and reinstalling it during the tokenizer build step. This caused `ImportError: cannot import name 'Config' from 'triton'` due to conflict with `pytorch-triton-rocm`. This change adds `--no-sync` to the `uv run` command for building the tokenizer, preventing `uv` from undoing the manual uninstallation of `triton`.	2025-11-23 04:26:32 +00:00
Lawrence R Kincheloe III	054394c708	Merge pull request #17 from LokiMetaSmith/amd-triton-fix Move triton uninstall after uv sync in speedrun.sh for AMD	2025-11-22 21:51:45 -06:00
google-labs-jules[bot]	994491b28d	Move triton uninstall after uv sync in speedrun.sh for AMD	2025-11-23 03:50:46 +00:00
Lawrence R Kincheloe III	33b6b800fa	Merge pull request #16 from LokiMetaSmith/fix-amd-install Fix AMD Triton conflict in speedrun.sh	2025-11-22 21:40:05 -06:00
google-labs-jules[bot]	8881ea84bf	Fix AMD Triton conflict in speedrun.sh Explicitly uninstall `triton` when AMD GPU is detected. The standard `triton` package (often pulled by NVIDIA dependencies or accident) conflicts with `pytorch-triton-rocm` on AMD systems, causing `ImportError: cannot import name 'Config' from 'triton'`. This change ensures a clean ROCm environment by removing the conflicting package. Also retains the `uv run --extra $EXTRAS` fix from the previous step.	2025-11-23 03:38:56 +00:00
Lawrence R Kincheloe III	d46e9a72d4	Merge pull request #15 from LokiMetaSmith/fix-amd-install Fix AMD ROCm install regression in speedrun.sh	2025-11-22 20:33:32 -06:00
google-labs-jules[bot]	83bb650b49	Fix AMD ROCm install regression in speedrun.sh Explicitly pass `--extra $EXTRAS` to `uv run` when building the tokenizer. This prevents `uv` from reverting to the default (NVIDIA) dependency set during the `maturin` build step, ensuring the correct PyTorch version (ROCm) is preserved on AMD hardware.	2025-11-23 02:33:07 +00:00
Lawrence R Kincheloe III	dd37f29fe4	Update Python version and torch dependencies Updated Python version requirement and adjusted torch dependencies for CPU, GPU, and AMD support.	2025-11-22 20:02:59 -06:00
Lawrence R Kincheloe III	1af926205d	Update Python version from 3.10 to 3.12	2025-11-22 20:00:57 -06:00
Lawrence R Kincheloe III	ddc51d34df	Merge pull request #14 from LokiMetaSmith/fix-cpu-ddp-init Fix hardware detection for AMD ROCm and single-process CPU crashes	2025-11-22 17:52:07 -06:00
google-labs-jules[bot]	083de95913	Fix hardware detection for AMD ROCm and single-process CPU crashes	2025-11-22 23:50:50 +00:00
Lawrence R Kincheloe III	b23494d2e2	Update .gitignore adding output.txt to git ignore	2025-11-22 17:04:27 -06:00
Lawrence R Kincheloe III	3b3113c8d2	Merge pull request #13 from LokiMetaSmith/fix-cpu-ddp-init Fix CPU DDP crashes, enable ROCm detection, and prevent single-proces…	2025-11-22 12:07:29 -06:00
google-labs-jules[bot]	28ef4c528e	Fix CPU DDP crashes, enable ROCm detection, and prevent single-process distributed optimizer errors	2025-11-22 18:06:58 +00:00
Lawrence R Kincheloe III	36df08a5a9	Merge pull request #12 from LokiMetaSmith/fix-cpu-ddp-init Fix ROCm/APU detection and CPU DDP OOM crash	2025-11-22 03:19:05 -06:00
google-labs-jules[bot]	48e632245e	Fix ROCm/APU detection and CPU DDP OOM crash	2025-11-22 09:18:40 +00:00
Lawrence R Kincheloe III	8009354739	Merge pull request #11 from LokiMetaSmith/fix-cpu-ddp-init Fix CPU DDP crashes: Init Gloo backend, prevent OOM by reducing NPROC…	2025-11-22 01:35:48 -06:00
google-labs-jules[bot]	a35621e726	Fix CPU DDP crashes: Init Gloo backend, prevent OOM by reducing NPROC, add script safety	2025-11-22 05:31:47 +00:00
Lawrence R Kincheloe III	b5fd54ac1c	Merge pull request #10 from LokiMetaSmith/fix-cpu-ddp-init Fix process group initialization for CPU DDP and improve cleanup safety	2025-11-21 17:42:06 -06:00
google-labs-jules[bot]	9235fe4000	Fix process group initialization for CPU DDP and improve cleanup safety	2025-11-21 23:41:34 +00:00
Lawrence R Kincheloe III	104308cf78	Merge pull request #9 from LokiMetaSmith/fix-dataloader-typeerror Fix TypeError in tokenizing_distributed_data_loader and robustness in…	2025-11-20 23:12:47 -06:00
google-labs-jules[bot]	f97e55eb93	Fix TypeError in tokenizing_distributed_data_loader and robustness in configurator.py - Explicitly add `device` argument to `tokenizing_distributed_data_loader` in `nanochat/dataloader.py` to prevent `TypeError: unexpected keyword argument 'device'` when called from `scripts/base_train.py`. - Update `nanochat/configurator.py` to ignore command-line flags starting with `--` (e.g. `--help`) instead of raising `AssertionError`, improving robustness when running with various launchers or flags.	2025-11-21 05:12:07 +00:00
Lawrence R Kincheloe III	6a1bfa919f	Merge pull request #8 from karpathy/master updating repo	2025-11-20 20:57:15 -06:00
Andrej	4a87a0d19f	Merge pull request #299 from samjabrahams/rotary_embedding_head_dim_comment_cleanup Fix comment: rotary embeddings final dimension size	2025-11-17 13:29:21 -08:00
Sam Abrahams	11e68bf442	Fix comment: rotary embeddings final dimension size	2025-11-17 11:32:56 -05:00
Andrej Karpathy	bc1fca39f3	mqa -> gqa to reduce confusion	2025-11-15 15:43:37 +00:00
Andrej	f66a780f68	Fix torch.dtype mismatching when running engine inline test.	2025-11-14 07:28:29 -08:00
Andrej	4763ce612a	Small fixes to typos	2025-11-14 07:25:59 -08:00
Sofie Van Landeghem	c6f5bd67db	revert change of base to sft for quick inline test	2025-11-14 12:20:03 +01:00
svlandeg	a2fb3c83a6	fix typos	2025-11-14 11:20:25 +01:00
svlandeg	e5efb4b471	add test_engine.py to file structure	2025-11-14 11:13:42 +01:00
Andrej Karpathy	9a71d13688	typo oops	2025-11-13 16:08:30 +00:00
Andrej Karpathy	7b7fd0fe71	thank you Sophie for your help with nanochat	2025-11-13 16:07:54 +00:00
Andrej Karpathy	c6abcdfe3a	big change: add pretraining resumption logic so that checkpoints can now be approximately resumed and training can continue. this is useful for very long runs when you don't want the anxiety of your run crashing for some reason. alternatively, it's a way to recover training in the event of loss spikes. i mean, this should have been there in v0 but it's ok. the resumption is approximate to control complexity and bloat, but it's possible we want to change that in the future. to use, set --save_every to a step interval to write checkpoints with, and then use --resume_from_step to resume optimization from a given step. only base model training (pretraining) supports this atm, but it's ok because midtraining is comparably quite a bit faster.	2025-11-13 15:34:40 +00:00
Andrej Karpathy	91f09ccd0d	minor fix comment in engine	2025-11-13 15:28:18 +00:00
Andrej Karpathy	adb5d4a16c	uv lock has to change when we removed numpy the other commit	2025-11-13 15:16:27 +00:00
howardgao@outlook.com	b399e43168	fix engine test bug	2025-11-06 08:56:45 +08:00

1 2 3 4

180 Commits