updating nproc to 8

This commit is contained in:
Sachin Agrawal 2025-11-03 13:06:38 +01:00
parent e42ac0f428
commit cf5e213613
2 changed files with 4 additions and 6 deletions

View File

@ -95,25 +95,23 @@ And a bit more about computing environments that will run nanochat:
### Adjusting for different GPU counts
When working with a different number of GPUs (fewer or more), you need to adjust the `NPROC_PER_NODE` variable in the training scripts. This variable controls the number of processes spawned for distributed training (one per GPU). For example, in [speedrun.sh](speedrun.sh):
When working with a different number of GPUs (fewer or more), you need to adjust the `NPROC_PER_NODE` variable in speedrun.sh. This variable controls the number of worlds the training script expects. The default value is set to 8.
```bash
# Set this to match your number of GPUs
NPROC_PER_NODE=4 # change to 2 for 2 GPUs, 8 for 8 GPUs, etc.
```
Or when running `torchrun` directly:
If running `torchrun` directly:
```bash
# For 4 GPUs:
torchrun --standalone --nproc_per_node=4 -m scripts.base_train
# For 2 GPUs:
torchrun --standalone --nproc_per_node=2 -m scripts.base_train
# For 8 GPUs:
torchrun --standalone --nproc_per_node=8 -m scripts.base_train
```
**Important**: The total batch size must be divisible by the number of GPUs. The training scripts calculate the effective batch size as `device_batch_size × number_of_gpus`. If you change the GPU count and encounter batch size errors, you may need to adjust `--device_batch_size` to ensure divisibility. For example, if using a total batch size configuration that expects 8 GPUs but you only have 4, you might need to double the `device_batch_size` to maintain the same effective total batch size (assuming you have enough VRAM).
**Important**: The total batch size must be divisible by the number of GPUs. The training scripts calculate the effective batch size as `device_batch_size × number_of_gpus`. If you change the GPU count and encounter batch size errors, you may need to adjust `--device_batch_size` to ensure divisibility. For example, if using a total batch size configuration that expects 8 GPUs but you only have 4, you might need to double the `device_batch_size` to maintain the same effective total batch size (assuming you have enough VRAM).
## Running on CPU / MPS

View File

@ -16,7 +16,7 @@ export NANOCHAT_BASE_DIR="$HOME/.cache/nanochat"
mkdir -p $NANOCHAT_BASE_DIR
# Number of processes per node for distributed training
NPROC_PER_NODE=4
NPROC_PER_NODE=8
# -----------------------------------------------------------------------------
# Python venv setup with uv