Introduces automatic memory detection and batch size optimization for Apple Silicon Macs in runcpu.sh and runmac_overnight.sh scripts. Adds a comprehensive README_MACOS.md with usage instructions, performance profiles, environment variable overrides, troubleshooting, and expected training times. Updates scripts to allow manual overrides and improve usability for various Mac configurations. Also switched python to arm64 for 2-3x improvement
5.2 KiB
macOS / MPS Training Guide
This guide explains how to train nanochat on Apple Silicon Macs with automatic memory optimization.
Memory-Optimized Scripts
All scripts now auto-detect your system memory and optimize batch sizes accordingly:
Performance Profiles
| Memory | device_batch_size | total_batch_size | Speed Boost | Recommended For |
|---|---|---|---|---|
| 128GB+ | 16 | 4096 | 16× | M3 Max/Ultra, Mac Studio Ultra |
| 64GB | 8 | 2048 | 8× | M2/M3 Max, Mac Studio Max |
| 32GB | 4 | 1024 | 4× | M2/M3 Pro, MacBook Pro |
| <32GB | 1 | 512 | 1× | Base M1/M2/M3 |
Quick Start Scripts
1. runcpu.sh - Quick Test (30 minutes)
Fast validation that everything works:
bash dev/runcpu.sh
What it does:
- Trains depth=4 model (37M params)
- 50 base iterations + 100 mid + 100 SFT
- Good for testing, not production quality
Your 128GB Mac: ~15-30 minutes (16× faster!)
2. runmac_overnight.sh - Production Quality (2-8 hours)
Full training for better results:
bash dev/runmac_overnight.sh
What it does:
- Trains depth=6 model (82M params)
- 500 base iterations + 150 mid + 150 SFT
- Downloads 50 data shards
- Production-quality chatbot
Your 128GB Mac: ~2-3 hours (vs 8-12 hours at batch_size=1)
Manual Configuration
Override memory detection:
# Pretend you have 64GB (more conservative)
MEMORY_SIZE=64 bash dev/runcpu.sh
# Set specific batch sizes
DEVICE_BATCH_SIZE=8 TOTAL_BATCH_SIZE=2048 bash dev/runmac_overnight.sh
# Combine overrides
DEPTH=8 MEMORY_SIZE=128 BASE_ITERATIONS=1000 bash dev/runmac_overnight.sh
Environment Variables
All scripts support these overrides:
| Variable | Default | Description |
|---|---|---|
MEMORY_SIZE |
auto-detect | System memory in GB |
DEVICE_BATCH_SIZE |
auto-calc | Sequences per device |
TOTAL_BATCH_SIZE |
auto-calc | Total batch size in tokens |
EVAL_TOKENS |
auto-calc | Tokens for evaluation |
SPLIT_TOKENS |
auto-calc | Tokens for loss eval |
DEPTH |
6 (overnight), 4 (cpu) | Model depth (layers) |
BASE_ITERATIONS |
500 (overnight), 50 (cpu) | Base training steps |
MID_ITERATIONS |
150 (overnight), 100 (cpu) | Midtraining steps |
SFT_ITERATIONS |
150 (overnight), 100 (cpu) | SFT steps |
DATA_SHARDS |
50 (overnight), 4 (cpu) | Training data shards |
Expected Training Times (128GB Mac)
Quick Test (runcpu.sh)
- Data download: 1-2 min
- Tokenizer: 1-2 min
- Base training (50 iter): 3-5 min
- Midtraining (100 iter): 6-10 min
- SFT (100 iter): 6-10 min
- Total: 15-30 minutes
Overnight (runmac_overnight.sh)
- Data download: 5-10 min
- Tokenizer: 1-2 min
- Base training (500 iter): 40-60 min
- Midtraining (150 iter): 20-30 min
- SFT (150 iter): 20-30 min
- Total: 2-3 hours
Model Quality Expectations
After runcpu.sh (quick)
- Forms basic sentences
- Limited coherence
- Frequent hallucinations
- Good for testing setup
After runmac_overnight.sh (production)
- Complete sentences
- Better coherence
- Follows conversation structure
- Still makes mistakes (it's small!)
- Good for demos/learning
For GPT-2 Quality
Would need depth=20-32, billions of tokens, and 8×H100 GPUs ($800-1000)
Memory Usage Tips
Monitor memory:
# Real-time memory usage
sudo powermetrics --samplers smc -i 5000 -n 1 | grep -i memory
# Or use Activity Monitor
open -a "Activity Monitor"
If you get OOM errors:
# Reduce batch size manually
DEVICE_BATCH_SIZE=4 bash dev/runmac_overnight.sh
# Or reduce model size
DEPTH=4 bash dev/runmac_overnight.sh
Optimal setup for your 128GB Mac:
# Maximum performance (recommended)
bash dev/runmac_overnight.sh
# Or go even bigger if you want
DEPTH=8 BASE_ITERATIONS=1000 bash dev/runmac_overnight.sh
Troubleshooting
Script fails with memory errors:
- Reduce
MEMORY_SIZE=64orDEVICE_BATCH_SIZE=8 - Reduce
DEPTH=4
Training is slow:
- Check memory profile is correct:
sysctl hw.memsize - Ensure MPS is being used: Check logs for "Autodetected device type: mps"
- Close other applications
Chat responses are still poor:
- Increase iterations:
BASE_ITERATIONS=1000 MID_ITERATIONS=300 SFT_ITERATIONS=300 - Download more data:
DATA_SHARDS=100 - Increase model size:
DEPTH=8(warning: needs more memory)
Running in Background
Screen (recommended):
screen -S nanochat bash dev/runmac_overnight.sh
# Detach: Ctrl+A, D
# Reattach: screen -r nanochat
nohup:
nohup bash dev/runmac_overnight.sh > training.log 2>&1 &
tail -f training.log
After Training
Chat via CLI:
python -m scripts.chat_cli -i sft
Chat via Web UI:
python -m scripts.chat_web -i sft
# Visit http://localhost:8000
Check your report:
cat report_overnight.md
# or
cat ~/.cache/nanochat/report/report.md
Notes
- All MPS compatibility fixes are applied automatically
- torch.compile is disabled on MPS (not supported yet)
- BFloat16 is replaced with float32 on MPS
- Pinned memory optimizations disabled on MPS
- Training is slower than CUDA but much faster than CPU
Enjoy your locally-trained LLM! 🚀