Add macOS memory-optimized training and documentation

Introduces automatic memory detection and batch size optimization for Apple Silicon Macs in runcpu.sh and runmac_overnight.sh scripts. Adds a comprehensive README_MACOS.md with usage instructions, performance profiles, environment variable overrides, troubleshooting, and expected training times. Updates scripts to allow manual overrides and improve usability for various Mac configurations. Also switched python to arm64 for 2-3x improvement
2025-12-06 12:22:18 +00:00 · 2025-10-22 07:35:26 +01:00 · 2025-10-22 07:35:26 +01:00 · 1225ddf00e
commit 1225ddf00e
parent 5a3d8b6b5e
3 changed files with 330 additions and 31 deletions
--- a/dev/README_MACOS.md
+++ b/dev/README_MACOS.md
@ -0,0 +1,202 @@
+# macOS / MPS Training Guide
+
+This guide explains how to train nanochat on Apple Silicon Macs with automatic memory optimization.
+
+## Memory-Optimized Scripts
+
+All scripts now auto-detect your system memory and optimize batch sizes accordingly:
+
+### Performance Profiles
+
+| Memory | device_batch_size | total_batch_size | Speed Boost | Recommended For |
+|--------|-------------------|------------------|-------------|-----------------|
+| **128GB+** | 16 | 4096 | 16× | M3 Max/Ultra, Mac Studio Ultra |
+| **64GB** | 8 | 2048 | 8× | M2/M3 Max, Mac Studio Max |
+| **32GB** | 4 | 1024 | 4× | M2/M3 Pro, MacBook Pro |
+| **<32GB** | 1 | 512 | 1× | Base M1/M2/M3 |
+
+## Quick Start Scripts
+
+### 1. `runcpu.sh` - Quick Test (30 minutes)
+Fast validation that everything works:
+```bash
+bash dev/runcpu.sh
+```
+
+**What it does:**
+- Trains depth=4 model (37M params)
+- 50 base iterations + 100 mid + 100 SFT
+- Good for testing, not production quality
+
+**Your 128GB Mac:** ~15-30 minutes (16× faster!)
+
+### 2. `runmac_overnight.sh` - Production Quality (2-8 hours)
+Full training for better results:
+```bash
+bash dev/runmac_overnight.sh
+```
+
+**What it does:**
+- Trains depth=6 model (82M params)
+- 500 base iterations + 150 mid + 150 SFT
+- Downloads 50 data shards
+- Production-quality chatbot
+
+**Your 128GB Mac:** ~2-3 hours (vs 8-12 hours at batch_size=1)
+
+## Manual Configuration
+
+Override memory detection:
+```bash
+# Pretend you have 64GB (more conservative)
+MEMORY_SIZE=64 bash dev/runcpu.sh
+
+# Set specific batch sizes
+DEVICE_BATCH_SIZE=8 TOTAL_BATCH_SIZE=2048 bash dev/runmac_overnight.sh
+
+# Combine overrides
+DEPTH=8 MEMORY_SIZE=128 BASE_ITERATIONS=1000 bash dev/runmac_overnight.sh
+```
+
+## Environment Variables
+
+All scripts support these overrides:
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `MEMORY_SIZE` | auto-detect | System memory in GB |
+| `DEVICE_BATCH_SIZE` | auto-calc | Sequences per device |
+| `TOTAL_BATCH_SIZE` | auto-calc | Total batch size in tokens |
+| `EVAL_TOKENS` | auto-calc | Tokens for evaluation |
+| `SPLIT_TOKENS` | auto-calc | Tokens for loss eval |
+| `DEPTH` | 6 (overnight), 4 (cpu) | Model depth (layers) |
+| `BASE_ITERATIONS` | 500 (overnight), 50 (cpu) | Base training steps |
+| `MID_ITERATIONS` | 150 (overnight), 100 (cpu) | Midtraining steps |
+| `SFT_ITERATIONS` | 150 (overnight), 100 (cpu) | SFT steps |
+| `DATA_SHARDS` | 50 (overnight), 4 (cpu) | Training data shards |
+
+## Expected Training Times (128GB Mac)
+
+### Quick Test (`runcpu.sh`)
+- Data download: 1-2 min
+- Tokenizer: 1-2 min
+- Base training (50 iter): 3-5 min
+- Midtraining (100 iter): 6-10 min
+- SFT (100 iter): 6-10 min
+- **Total: 15-30 minutes**
+
+### Overnight (`runmac_overnight.sh`)
+- Data download: 5-10 min
+- Tokenizer: 1-2 min
+- Base training (500 iter): 40-60 min
+- Midtraining (150 iter): 20-30 min
+- SFT (150 iter): 20-30 min
+- **Total: 2-3 hours**
+
+## Model Quality Expectations
+
+### After `runcpu.sh` (quick)
+- Forms basic sentences
+- Limited coherence
+- Frequent hallucinations
+- Good for testing setup
+
+### After `runmac_overnight.sh` (production)
+- Complete sentences
+- Better coherence
+- Follows conversation structure
+- Still makes mistakes (it's small!)
+- Good for demos/learning
+
+### For GPT-2 Quality
+Would need depth=20-32, billions of tokens, and 8×H100 GPUs ($800-1000)
+
+## Memory Usage Tips
+
+**Monitor memory:**
+```bash
+# Real-time memory usage
+sudo powermetrics --samplers smc -i 5000 -n 1 | grep -i memory
+
+# Or use Activity Monitor
+open -a "Activity Monitor"
+```
+
+**If you get OOM errors:**
+```bash
+# Reduce batch size manually
+DEVICE_BATCH_SIZE=4 bash dev/runmac_overnight.sh
+
+# Or reduce model size
+DEPTH=4 bash dev/runmac_overnight.sh
+```
+
+**Optimal setup for your 128GB Mac:**
+```bash
+# Maximum performance (recommended)
+bash dev/runmac_overnight.sh
+
+# Or go even bigger if you want
+DEPTH=8 BASE_ITERATIONS=1000 bash dev/runmac_overnight.sh
+```
+
+## Troubleshooting
+
+**Script fails with memory errors:**
+- Reduce `MEMORY_SIZE=64` or `DEVICE_BATCH_SIZE=8`
+- Reduce `DEPTH=4`
+
+**Training is slow:**
+- Check memory profile is correct: `sysctl hw.memsize`
+- Ensure MPS is being used: Check logs for "Autodetected device type: mps"
+- Close other applications
+
+**Chat responses are still poor:**
+- Increase iterations: `BASE_ITERATIONS=1000 MID_ITERATIONS=300 SFT_ITERATIONS=300`
+- Download more data: `DATA_SHARDS=100`
+- Increase model size: `DEPTH=8` (warning: needs more memory)
+
+## Running in Background
+
+**Screen (recommended):**
+```bash
+screen -S nanochat bash dev/runmac_overnight.sh
+# Detach: Ctrl+A, D
+# Reattach: screen -r nanochat
+```
+
+**nohup:**
+```bash
+nohup bash dev/runmac_overnight.sh > training.log 2>&1 &
+tail -f training.log
+```
+
+## After Training
+
+**Chat via CLI:**
+```bash
+python -m scripts.chat_cli -i sft
+```
+
+**Chat via Web UI:**
+```bash
+python -m scripts.chat_web -i sft
+# Visit http://localhost:8000
+```
+
+**Check your report:**
+```bash
+cat report_overnight.md
+# or
+cat ~/.cache/nanochat/report/report.md
+```
+
+## Notes
+
+- All MPS compatibility fixes are applied automatically
+- torch.compile is disabled on MPS (not supported yet)
+- BFloat16 is replaced with float32 on MPS
+- Pinned memory optimizations disabled on MPS
+- Training is slower than CUDA but much faster than CPU
+
+Enjoy your locally-trained LLM! 🚀
--- a/dev/runcpu.sh
+++ b/dev/runcpu.sh
@ -2,7 +2,7 @@

 # Showing an example run for exercising some of the code paths on the CPU (or MPS on Macbooks)
 # Run as:
-# bash dev/cpu_demo_run.sh
+# bash dev/runcpu.sh

 # NOTE: Training LLMs requires GPU compute and $$$. You will not get far on your Macbook.
 # Think of this run as educational/fun demo, not something you should expect to work well.
@ -12,6 +12,51 @@
 export OMP_NUM_THREADS=1
 NANOCHAT_BASE_DIR="$HOME/.cache/nanochat"
 mkdir -p $NANOCHAT_BASE_DIR
+
+# Memory-based configuration for macOS
+# Detect system memory (in GB) or allow manual override
+if [ -z "$MEMORY_SIZE" ]; then
+    if [[ "$OSTYPE" == "darwin"* ]]; then
+        MEMORY_SIZE=$(sysctl hw.memsize | awk '{print int($2/1024/1024/1024)}')
+        echo "Auto-detected macOS memory: ${MEMORY_SIZE}GB"
+    else
+        # Linux fallback - assume conservative
+        MEMORY_SIZE=16
+        echo "Non-macOS system, using conservative: ${MEMORY_SIZE}GB"
+    fi
+fi
+
+# Calculate optimal batch sizes based on available memory
+# Note: total_batch_size must be divisible by (device_batch_size * max_seq_len)
+# With max_seq_len=1024: device_batch_size * 1024 must divide total_batch_size
+if [ $MEMORY_SIZE -ge 128 ]; then
+    DEVICE_BATCH_SIZE=16
+    TOTAL_BATCH_SIZE=16384    # 16 * 1024 = 16384
+    EVAL_TOKENS=16384
+    SPLIT_TOKENS=16384
+    echo "Memory profile: 128GB+ (High performance)"
+elif [ $MEMORY_SIZE -ge 64 ]; then
+    DEVICE_BATCH_SIZE=8
+    TOTAL_BATCH_SIZE=8192     # 8 * 1024 = 8192
+    EVAL_TOKENS=8192
+    SPLIT_TOKENS=8192
+    echo "Memory profile: 64GB (Good performance)"
+elif [ $MEMORY_SIZE -ge 32 ]; then
+    DEVICE_BATCH_SIZE=4
+    TOTAL_BATCH_SIZE=4096     # 4 * 1024 = 4096
+    EVAL_TOKENS=4096
+    SPLIT_TOKENS=4096
+    echo "Memory profile: 32GB (Moderate performance)"
+else
+    DEVICE_BATCH_SIZE=1
+    TOTAL_BATCH_SIZE=1024     # 1 * 1024 = 1024
+    EVAL_TOKENS=2048
+    SPLIT_TOKENS=2048
+    echo "Memory profile: <32GB (Conservative)"
+fi
+
+echo "Using: device_batch_size=$DEVICE_BATCH_SIZE, total_batch_size=$TOTAL_BATCH_SIZE"
+echo ""
 command -v uv &> /dev/null || curl -LsSf https://astral.sh/uv/install.sh | sh
 [ -d ".venv" ] || uv venv
 uv sync
@ -38,39 +83,39 @@ python -m nanochat.dataset -n 4
 python -m scripts.tok_train --max_chars=1000000000
 python -m scripts.tok_eval

-# train a very small 4 layer model on the CPU
-# each optimization step processes a single sequence of 1024 tokens
+# train a very small 4 layer model on the CPU/MPS
+# batch sizes are now optimized based on available memory
 # we only run 50 steps of optimization (bump this to get better results)
 python -m scripts.base_train \
    --depth=4 \
    --max_seq_len=1024 \
-    --device_batch_size=1 \
-    --total_batch_size=1024 \
+    --device_batch_size=$DEVICE_BATCH_SIZE \
+    --total_batch_size=$TOTAL_BATCH_SIZE \
    --eval_every=50 \
-    --eval_tokens=4096 \
+    --eval_tokens=$EVAL_TOKENS \
    --core_metric_every=50 \
    --core_metric_max_per_task=12 \
    --sample_every=50 \
    --num_iterations=50
-python -m scripts.base_loss --device_batch_size=1 --split_tokens=4096
+python -m scripts.base_loss --device_batch_size=$DEVICE_BATCH_SIZE --split_tokens=$SPLIT_TOKENS
 python -m scripts.base_eval --max-per-task=5

 # midtraining
 python -m scripts.mid_train \
    --max_seq_len=1024 \
-    --device_batch_size=1 \
+    --device_batch_size=$DEVICE_BATCH_SIZE \
    --eval_every=50 \
-    --eval_tokens=4096 \
-    --total_batch_size=1024 \
+    --eval_tokens=$EVAL_TOKENS \
+    --total_batch_size=$TOTAL_BATCH_SIZE \
    --num_iterations=100
 # eval results will be terrible, this is just to execute the code paths.
 # note that we lower the execution memory limit to 1MB to avoid warnings on smaller systems
-python -m scripts.chat_eval --source=mid --max-new-tokens=128 --max-problems=20
+python -m scripts.chat_eval -i mid --max-new-tokens=128 --max-problems=20

 # SFT
 python -m scripts.chat_sft \
-    --device_batch_size=1 \
-    --target_examples_per_step=4 \
+    --device_batch_size=$DEVICE_BATCH_SIZE \
+    --target_examples_per_step=$((DEVICE_BATCH_SIZE * 2)) \
    --num_iterations=100 \
    --eval_steps=4 \
    --eval_metrics_max_problems=16
--- a/dev/runmac_overnight.sh
+++ b/dev/runmac_overnight.sh
@ -14,25 +14,73 @@ echo ""
 # Activate virtual environment
 source .venv/bin/activate

-# Configuration
-DEPTH=6                    # Bigger model (6 layers vs 4)
-BASE_ITERATIONS=500        # More base training
-MID_ITERATIONS=150         # More midtraining
-SFT_ITERATIONS=150         # More SFT
-DATA_SHARDS=50             # More training data
+# Memory-based configuration
+# Detect system memory (in GB) or allow manual override
+if [ -z "$MEMORY_SIZE" ]; then
+    MEMORY_SIZE=$(sysctl hw.memsize | awk '{print int($2/1024/1024/1024)}')
+    echo "Auto-detected memory: ${MEMORY_SIZE}GB"
+else
+    echo "Using specified memory: ${MEMORY_SIZE}GB"
+fi

+# Calculate optimal batch sizes based on available memory
+# Conservative estimates for MPS (unified memory shared with system)
+# Note: total_batch_size must be divisible by (device_batch_size * max_seq_len)
+# With max_seq_len=1024: device_batch_size * 1024 must divide total_batch_size
+if [ $MEMORY_SIZE -ge 128 ]; then
+    DEVICE_BATCH_SIZE=16
+    TOTAL_BATCH_SIZE=16384    # 16 * 1024 = 16384
+    EVAL_TOKENS=16384
+    SPLIT_TOKENS=16384
+    echo "Memory profile: 128GB+ (High performance)"
+elif [ $MEMORY_SIZE -ge 64 ]; then
+    DEVICE_BATCH_SIZE=8
+    TOTAL_BATCH_SIZE=8192     # 8 * 1024 = 8192
+    EVAL_TOKENS=8192
+    SPLIT_TOKENS=8192
+    echo "Memory profile: 64GB (Good performance)"
+elif [ $MEMORY_SIZE -ge 32 ]; then
+    DEVICE_BATCH_SIZE=4
+    TOTAL_BATCH_SIZE=4096     # 4 * 1024 = 4096
+    EVAL_TOKENS=4096
+    SPLIT_TOKENS=4096
+    echo "Memory profile: 32GB (Moderate performance)"
+else
+    DEVICE_BATCH_SIZE=1
+    TOTAL_BATCH_SIZE=1024     # 1 * 1024 = 1024
+    EVAL_TOKENS=2048
+    SPLIT_TOKENS=2048
+    echo "Memory profile: <32GB (Conservative)"
+fi
+
+# Allow manual overrides
+DEPTH=${DEPTH:-6}                          # Bigger model (6 layers vs 4)
+BASE_ITERATIONS=${BASE_ITERATIONS:-500}    # More base training
+MID_ITERATIONS=${MID_ITERATIONS:-150}      # More midtraining
+SFT_ITERATIONS=${SFT_ITERATIONS:-150}      # More SFT
+DATA_SHARDS=${DATA_SHARDS:-50}             # More training data
+
+echo ""
 echo "Configuration:"
-echo "  Model depth: $DEPTH (36.7M → 82M params)"
+echo "  System Memory: ${MEMORY_SIZE}GB"
+echo "  Model depth: $DEPTH (~82M params for d6)"
+echo "  Device batch size: $DEVICE_BATCH_SIZE"
+echo "  Total batch size: $TOTAL_BATCH_SIZE"
+echo "  Eval tokens: $EVAL_TOKENS"
 echo "  Base iterations: $BASE_ITERATIONS"
 echo "  Mid iterations: $MID_ITERATIONS"
 echo "  SFT iterations: $SFT_ITERATIONS"
 echo "  Data shards: $DATA_SHARDS"
 echo ""
+echo "To override, set environment variables:"
+echo "  MEMORY_SIZE=64 bash dev/runmac_overnight.sh"
+echo "  DEVICE_BATCH_SIZE=8 bash dev/runmac_overnight.sh"
+echo ""

 # Clean up old run
 echo "Cleaning up previous training..."
 rm -f report.md
-python -m scripts.report --reset
+python -m nanochat.report reset

 # Download training data
 echo ""
@ -57,15 +105,16 @@ python -m nanochat.tokenizer
 # Base model training
 echo ""
 echo "Step 4/6: Training base model ($BASE_ITERATIONS iterations)..."
+echo "  Device batch size: $DEVICE_BATCH_SIZE, Total batch size: $TOTAL_BATCH_SIZE"
 echo "  This will take ~2-4 hours..."
 python -m scripts.base_train \
  --depth=$DEPTH \
  --max_seq_len=1024 \
-  --device_batch_size=1 \
-  --total_batch_size=1024 \
+  --device_batch_size=$DEVICE_BATCH_SIZE \
+  --total_batch_size=$TOTAL_BATCH_SIZE \
  --num_iterations=$BASE_ITERATIONS \
  --eval_every=100 \
-  --eval_tokens=8192 \
+  --eval_tokens=$EVAL_TOKENS \
  --core_metric_every=250 \
  --core_metric_max_per_task=20 \
  --sample_every=100
@ -73,28 +122,31 @@ python -m scripts.base_train \
 # Evaluate base model
 echo ""
 echo "Evaluating base model..."
-python -m scripts.base_loss
+python -m scripts.base_loss --split_tokens=$SPLIT_TOKENS
 python -m scripts.base_eval

 # Midtraining
 echo ""
 echo "Step 5/6: Midtraining ($MID_ITERATIONS iterations)..."
+echo "  Device batch size: $DEVICE_BATCH_SIZE, Total batch size: $TOTAL_BATCH_SIZE"
 echo "  This will take ~2-3 hours..."
 python -m scripts.mid_train \
  --num_iterations=$MID_ITERATIONS \
-  --device_batch_size=1 \
+  --device_batch_size=$DEVICE_BATCH_SIZE \
  --max_seq_len=1024 \
-  --total_batch_size=1024 \
-  --eval_every=50
+  --total_batch_size=$TOTAL_BATCH_SIZE \
+  --eval_every=50 \
+  --eval_tokens=$EVAL_TOKENS

 # SFT training
 echo ""
 echo "Step 6/6: Chat fine-tuning (SFT) ($SFT_ITERATIONS iterations)..."
+echo "  Device batch size: $DEVICE_BATCH_SIZE"
 echo "  This will take ~2-3 hours..."
 python -m scripts.chat_sft \
  --num_iterations=$SFT_ITERATIONS \
-  --device_batch_size=1 \
-  --target_examples_per_step=8 \
+  --device_batch_size=$DEVICE_BATCH_SIZE \
+  --target_examples_per_step=$((DEVICE_BATCH_SIZE * 2)) \
  --eval_steps=10

 # Final evaluation
@ -105,7 +157,7 @@ python -m scripts.chat_eval -i sft || echo "Chat eval had issues, skipping..."
 # Generate report
 echo ""
 echo "Generating final report..."
-python -m scripts.report
+python -m nanochat.report generate

 # Copy report to current directory
 cp ~/.cache/nanochat/report/report.md ./report_overnight.md