# Auto-Discovery Testing Suite Comprehensive tests for the auto-discovery batch size functionality in NanoChat. ## Overview This testing suite validates the auto-discovery system across different scenarios: - **Unit Tests**: Isolated testing of core algorithms (exponential search, binary search, caching) - **Integration Tests**: End-to-end testing with actual training scripts - **Stability Tests**: Long-running tests to detect memory leaks and OOM issues - **Performance Tests**: Throughput comparisons between manual and auto-discovered batch sizes ## Quick Start ### Run All Tests ```bash # Run unit tests only (fast, ~10 seconds) bash tests/run_unit_tests.sh # Run integration tests (requires GPU, 10-30 minutes) bash tests/run_integration_tests.sh # Run integration tests including long stability tests (1+ hours) RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh ``` ### Run Individual Tests ```bash # Unit tests pytest tests/test_auto_batch_size.py -v # Specific integration test bash tests/integration/test_single_gpu_discovery.sh bash tests/integration/test_ddp_discovery.sh bash tests/integration/test_throughput_comparison.sh ``` ## Test Categories ### Unit Tests (`test_auto_batch_size.py`) Tests the core discovery algorithms in isolation using mocks: - **Test 1**: Exponential search finds upper bound (1, 2, 4, 8, 16, 32, 64) - **Test 2**: Binary search refines to exact boundary - **Test 3**: Safety margin application (0.85, 0.90, 0.95) - **Test 4**: Cache hit/miss behavior - **Test 5**: DDP broadcast simulation **Run with:** ```bash pytest tests/test_auto_batch_size.py -v --tb=short ``` ### Integration Tests #### Single GPU Tests - **Test 6**: Basic discovery run (`test_single_gpu_discovery.sh`) - Verifies discovery completes in < 30 seconds - Checks for proper log messages - Validates no OOM errors - **Test 7**: Manual vs Auto comparison (`test_manual_vs_auto.sh`) - Compares manual batch_size=8 with auto-discovery - Validates auto batch size ≥ manual - Ensures both runs complete successfully #### Multi-GPU Tests - **Test 8**: 2-GPU DDP discovery (`test_ddp_discovery.sh`) - Verifies rank 0 performs discovery - Checks broadcast to rank 1 - Validates synchronization - **Test 9**: 4-GPU DDP discovery (if available) - Same as Test 8 with 4 GPUs - Skipped if fewer than 4 GPUs available #### Throughput Tests - **Test 10**: Throughput comparison (`test_throughput_comparison.sh`) - Measures iterations/second for manual vs auto - Calculates speedup ratio - Target: ≥ 1.3x speedup (allows for discovery overhead) - Saves results to `tests/results/throughput_comparison.json` #### Stability Tests Long-running tests (1000 iterations each): - **Test 11**: Depth=12 (`test_stability_depth12.sh`) - **Test 12**: Depth=20 (`test_stability_depth20.sh`) - **Test 13**: Depth=26 (`test_stability_depth26.sh`) - **Test 14**: Depth=32 (`test_stability_depth32.sh`) - Verifies larger models use smaller batch sizes - Monitors for memory leaks - Ensures no OOM during long runs **Run with:** ```bash RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh ``` #### Override Tests - **Test 15**: Manual override (`test_overrides.sh`) - Verifies `--device_batch_size=16` skips auto-discovery - Checks for manual batch size usage message - **Test 16**: Disable auto-discovery - Tests with auto-discovery disabled - Verifies fallback to default batch_size=8 - **Test 17**: Custom safety margin - Tests `--batch_size_margin=0.85` vs `0.90` - Verifies higher margin gives larger batch size #### Cache Tests - **Test 18**: Cache hit (`test_cache_mechanism.sh`) - First run: discovery + cache save - Second run: cache hit (< 5 seconds) - Verifies cache file creation - **Test 19**: Cache key validation - Different depth → different cache key - Different max_seq_len → different cache key - Verifies multiple cache files created - **Test 20**: Cache invalidation - Corrupts cache file - Verifies graceful fallback to re-discovery - Tests cache deletion and re-run #### Failure Handling Tests - **Test 21**: Artificial memory constraint (`test_failure_handling.sh`) - Tests with very large model (depth=40) - Verifies fallback to defaults - Checks for warning messages - **Test 22**: Mid-training override warning - Tests mid_train.py with larger batch size than pretrain - Verifies "FOOTGUN WARNING" appears - Ensures training continues despite warning ## Test Results Results are saved to `tests/results/`: ``` tests/results/ ├── test_single_gpu_discovery.log ├── test_manual_baseline.log ├── test_auto_discovery.log ├── throughput_comparison.json ├── stability_depth12.log ├── stability_depth20.log ├── cache_run1.log ├── cache_run2.log └── ... ``` ### Throughput Results Format `tests/results/throughput_comparison.json`: ```json { "timestamp": "2024-01-15T10:30:00Z", "depth": 12, "max_iterations": 100, "manual": { "batch_size": 8, "duration_seconds": 120, "throughput_iter_per_sec": 0.833 }, "auto": { "batch_size": 32, "duration_seconds": 60, "throughput_iter_per_sec": 1.667 }, "speedup_ratio": 2.0 } ``` ## Requirements ### Unit Tests - Python 3.8+ - PyTorch - pytest - No GPU required (runs on CPU) ### Integration Tests - CUDA-capable GPU (≥ 24GB VRAM recommended) - Multiple GPUs for DDP tests (optional) - Environment variables: - `NANOCHAT_BASE_DIR`: Base directory for checkpoints/cache (optional) - `RUN_LONG_TESTS=1`: Enable 1000-iteration stability tests (optional) ## CI/CD Integration For automated testing in CI: ```bash # Quick validation (unit tests + fast integration tests) bash tests/run_unit_tests.sh bash tests/run_integration_tests.sh # ~15 minutes # Full validation (includes long tests) RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh # ~1 hour ``` ### GitHub Actions Example ```yaml name: Auto-Discovery Tests on: [push, pull_request] jobs: test: runs-on: [self-hosted, gpu] steps: - uses: actions/checkout@v2 - name: Run unit tests run: bash tests/run_unit_tests.sh - name: Run integration tests run: bash tests/run_integration_tests.sh - name: Upload results uses: actions/upload-artifact@v2 with: name: test-results path: tests/results/ ``` ## Troubleshooting ### Common Issues 1. **"SKIP: Need at least 2 GPUs for DDP tests"** - Expected if you have only 1 GPU - DDP tests will be skipped automatically 2. **"Cache directory is empty or doesn't exist"** - Cache may be disabled or path issue - Check `NANOCHAT_BASE_DIR` environment variable 3. **"Discovery takes longer than 30 seconds"** - May indicate large model or slow GPU - Increase timeout in test script if needed 4. **"Speedup ratio below threshold"** - Discovery overhead may be high for short runs - Try longer runs (increase `MAX_ITERATIONS`) ### Debug Mode Run tests with verbose output: ```bash # Unit tests with full traceback pytest tests/test_auto_batch_size.py -vv --tb=long # Integration tests with set -x bash -x tests/integration/test_single_gpu_discovery.sh ``` ## Success Criteria ### Unit Tests - ✓ All 5 unit tests pass - ✓ Tests complete in < 10 seconds - ✓ Code coverage ≥ 80% for `nanochat/auto_batch_size.py` ### Integration Tests - ✓ Single GPU discovery completes in < 30 seconds - ✓ No OOM errors during 1000+ iteration stability tests - ✓ Throughput improvement ≥ 1.3x compared to manual baseline - ✓ DDP tests show identical batch size across all ranks - ✓ Override tests correctly skip discovery or use manual values - ✓ Cache tests show < 5 second cache hit time vs 15-30 second discovery ### Failure Handling - ✓ Artificial memory constraints trigger fallback to defaults - ✓ Warning messages appear in logs for fallback scenarios - ✓ No crashes or exceptions, only graceful degradation ## Contributing When adding new tests: 1. Add unit tests to `tests/test_auto_batch_size.py` 2. Add integration tests as new `.sh` scripts in `tests/integration/` 3. Update `tests/run_integration_tests.sh` to include new tests 4. Update this README with test descriptions 5. Ensure tests clean up after themselves (delete temp files, clear cache) ## License Same as NanoChat project.