# macOS / MPS Training Guide This guide explains how to train nanochat on Apple Silicon Macs with automatic memory optimization. ## Memory-Optimized Scripts All scripts now auto-detect your system memory and optimize batch sizes accordingly: ### Performance Profiles | Memory | device_batch_size | total_batch_size | Speed Boost | Recommended For | |--------|-------------------|------------------|-------------|-----------------| | **128GB+** | 16 | 4096 | 16× | M3 Max/Ultra, Mac Studio Ultra | | **64GB** | 8 | 2048 | 8× | M2/M3 Max, Mac Studio Max | | **32GB** | 4 | 1024 | 4× | M2/M3 Pro, MacBook Pro | | **<32GB** | 1 | 512 | 1× | Base M1/M2/M3 | ## Quick Start Scripts ### 1. `runcpu.sh` - Quick Test (30 minutes) Fast validation that everything works: ```bash bash dev/runcpu.sh ``` **What it does:** - Trains depth=4 model (37M params) - 50 base iterations + 100 mid + 100 SFT - Good for testing, not production quality **Your 128GB Mac:** ~15-30 minutes (16× faster!) ### 2. `runmac_overnight.sh` - Production Quality (2-8 hours) Full training for better results: ```bash bash dev/runmac_overnight.sh ``` **What it does:** - Trains depth=6 model (82M params) - 500 base iterations + 150 mid + 150 SFT - Downloads 50 data shards - Production-quality chatbot **Your 128GB Mac:** ~2-3 hours (vs 8-12 hours at batch_size=1) ## Manual Configuration Override memory detection: ```bash # Pretend you have 64GB (more conservative) MEMORY_SIZE=64 bash dev/runcpu.sh # Set specific batch sizes DEVICE_BATCH_SIZE=8 TOTAL_BATCH_SIZE=2048 bash dev/runmac_overnight.sh # Combine overrides DEPTH=8 MEMORY_SIZE=128 BASE_ITERATIONS=1000 bash dev/runmac_overnight.sh ``` ## Environment Variables All scripts support these overrides: | Variable | Default | Description | |----------|---------|-------------| | `MEMORY_SIZE` | auto-detect | System memory in GB | | `DEVICE_BATCH_SIZE` | auto-calc | Sequences per device | | `TOTAL_BATCH_SIZE` | auto-calc | Total batch size in tokens | | `EVAL_TOKENS` | auto-calc | Tokens for evaluation | | `SPLIT_TOKENS` | auto-calc | Tokens for loss eval | | `DEPTH` | 6 (overnight), 4 (cpu) | Model depth (layers) | | `BASE_ITERATIONS` | 500 (overnight), 50 (cpu) | Base training steps | | `MID_ITERATIONS` | 150 (overnight), 100 (cpu) | Midtraining steps | | `SFT_ITERATIONS` | 150 (overnight), 100 (cpu) | SFT steps | | `DATA_SHARDS` | 50 (overnight), 4 (cpu) | Training data shards | ## Expected Training Times (128GB Mac) ### Quick Test (`runcpu.sh`) - Data download: 1-2 min - Tokenizer: 1-2 min - Base training (50 iter): 3-5 min - Midtraining (100 iter): 6-10 min - SFT (100 iter): 6-10 min - **Total: 15-30 minutes** ### Overnight (`runmac_overnight.sh`) - Data download: 5-10 min - Tokenizer: 1-2 min - Base training (500 iter): 40-60 min - Midtraining (150 iter): 20-30 min - SFT (150 iter): 20-30 min - **Total: 2-3 hours** ## Model Quality Expectations ### After `runcpu.sh` (quick) - Forms basic sentences - Limited coherence - Frequent hallucinations - Good for testing setup ### After `runmac_overnight.sh` (production) - Complete sentences - Better coherence - Follows conversation structure - Still makes mistakes (it's small!) - Good for demos/learning ### For GPT-2 Quality Would need depth=20-32, billions of tokens, and 8×H100 GPUs ($800-1000) ## Memory Usage Tips **Monitor memory:** ```bash # Real-time memory usage sudo powermetrics --samplers smc -i 5000 -n 1 | grep -i memory # Or use Activity Monitor open -a "Activity Monitor" ``` **If you get OOM errors:** ```bash # Reduce batch size manually DEVICE_BATCH_SIZE=4 bash dev/runmac_overnight.sh # Or reduce model size DEPTH=4 bash dev/runmac_overnight.sh ``` **Optimal setup for your 128GB Mac:** ```bash # Maximum performance (recommended) bash dev/runmac_overnight.sh # Or go even bigger if you want DEPTH=8 BASE_ITERATIONS=1000 bash dev/runmac_overnight.sh ``` ## Troubleshooting **Script fails with memory errors:** - Reduce `MEMORY_SIZE=64` or `DEVICE_BATCH_SIZE=8` - Reduce `DEPTH=4` **Training is slow:** - Check memory profile is correct: `sysctl hw.memsize` - Ensure MPS is being used: Check logs for "Autodetected device type: mps" - Close other applications **Chat responses are still poor:** - Increase iterations: `BASE_ITERATIONS=1000 MID_ITERATIONS=300 SFT_ITERATIONS=300` - Download more data: `DATA_SHARDS=100` - Increase model size: `DEPTH=8` (warning: needs more memory) ## Running in Background **Screen (recommended):** ```bash screen -S nanochat bash dev/runmac_overnight.sh # Detach: Ctrl+A, D # Reattach: screen -r nanochat ``` **nohup:** ```bash nohup bash dev/runmac_overnight.sh > training.log 2>&1 & tail -f training.log ``` ## After Training **Chat via CLI:** ```bash python -m scripts.chat_cli -i sft ``` **Chat via Web UI:** ```bash python -m scripts.chat_web -i sft # Visit http://localhost:8000 ``` **Check your report:** ```bash cat report_overnight.md # or cat ~/.cache/nanochat/report/report.md ``` ## Notes - All MPS compatibility fixes are applied automatically - torch.compile is disabled on MPS (not supported yet) - BFloat16 is replaced with float32 on MPS - Pinned memory optimizations disabled on MPS - Training is slower than CUDA but much faster than CPU Enjoy your locally-trained LLM! 🚀