# Auto-Discovery Test Plan ## Test Coverage Matrix | Test # | Name | Type | Duration | GPU Required | Status | |--------|------|------|----------|--------------|--------| | 1 | Exponential Search Logic | Unit | < 1s | No | ✓ Implemented | | 2 | Binary Search Refinement | Unit | < 1s | No | ✓ Implemented | | 3 | Safety Margin Application | Unit | < 1s | No | ✓ Implemented | | 4 | Cache Mechanism | Unit | < 1s | No | ✓ Implemented | | 5 | DDP Broadcast Simulation | Unit | < 1s | No | ✓ Implemented | | 6 | Basic Discovery Run | Integration | 30s | 1 GPU | ✓ Implemented | | 7 | Manual vs Auto Comparison | Integration | 2-3 min | 1 GPU | ✓ Implemented | | 8 | DDP Discovery (2 GPUs) | Integration | 1-2 min | 2 GPUs | ✓ Implemented | | 9 | DDP Discovery (4 GPUs) | Integration | 1-2 min | 4 GPUs | ✓ Implemented | | 10 | Throughput Comparison | Integration | 5-10 min | 1 GPU | ✓ Implemented | | 11 | Stability (depth=12) | Integration | 10-15 min | 1 GPU | ✓ Implemented | | 12 | Stability (depth=20) | Integration | 15-20 min | 1 GPU | ✓ Implemented | | 13 | Stability (depth=26) | Integration | 20-25 min | 1 GPU | ✓ Implemented | | 14 | Stability (depth=32) | Integration | 25-30 min | 1 GPU | ✓ Implemented | | 15 | Manual Override | Integration | 1-2 min | 1 GPU | ✓ Implemented | | 16 | Disable Auto-Discovery | Integration | 1-2 min | 1 GPU | ✓ Implemented | | 17 | Custom Safety Margin | Integration | 2-3 min | 1 GPU | ✓ Implemented | | 18 | Cache Hit | Integration | 2-3 min | 1 GPU | ✓ Implemented | | 19 | Cache Key Validation | Integration | 3-4 min | 1 GPU | ✓ Implemented | | 20 | Cache Invalidation | Integration | 2-3 min | 1 GPU | ✓ Implemented | | 21 | Artificial Memory Constraint | Integration | 2-3 min | 1 GPU | ✓ Implemented | | 22 | Mid-Training Override Warning | Integration | 2-3 min | 1 GPU | ✓ Implemented | ## Test Execution Time Estimates ### Fast Suite (Unit Tests Only) - **Duration**: ~10 seconds - **GPU**: Not required - **Command**: `bash tests/run_unit_tests.sh` ### Standard Suite (Unit + Short Integration) - **Duration**: ~15-30 minutes - **GPU**: 1 GPU required - **Command**: `bash tests/run_integration_tests.sh` ### Full Suite (Including Long Stability Tests) - **Duration**: ~1-2 hours - **GPU**: 1 GPU required - **Command**: `RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh` ### Multi-GPU Suite - **Duration**: ~20-40 minutes - **GPU**: 2-4 GPUs required - **Command**: `bash tests/run_integration_tests.sh` (auto-detects GPUs) ## Success Criteria ### Unit Tests - [ ] All 5 unit tests pass - [ ] Tests complete in < 10 seconds total - [ ] Code coverage ≥ 80% for `nanochat/auto_batch_size.py` ### Integration Tests - Basic - [ ] Single GPU discovery completes in < 30 seconds - [ ] Auto-discovered batch size ≥ manual baseline (8) - [ ] No OOM errors in any test - [ ] All logs contain expected messages ### Integration Tests - DDP - [ ] Rank 0 performs discovery, other ranks receive broadcast - [ ] All ranks use identical batch size - [ ] No deadlocks or synchronization errors - [ ] Tests complete successfully on 2 and 4 GPUs ### Integration Tests - Performance - [ ] Throughput improvement ≥ 1.3x compared to manual baseline - [ ] Speedup ratio calculated and logged - [ ] Results saved to JSON for analysis ### Integration Tests - Stability - [ ] All 1000 iterations complete without errors - [ ] No OOM errors during long runs - [ ] No memory leaks detected - [ ] Larger models (depth=32) use smaller batch sizes than smaller models (depth=12) ### Integration Tests - Overrides - [ ] Manual `--device_batch_size` skips auto-discovery - [ ] Custom safety margins produce expected batch sizes - [ ] Disabled auto-discovery uses default values ### Integration Tests - Cache - [ ] Cache hit reduces startup time from 15-30s to < 5s - [ ] Different configurations create different cache keys - [ ] Corrupted cache handled gracefully (fallback to re-discovery) - [ ] Cache files created in correct directory ### Integration Tests - Failure Handling - [ ] Artificial memory constraints trigger fallback - [ ] Warning messages logged appropriately - [ ] Mid-training override warning appears - [ ] No crashes or exceptions, only graceful degradation ## Known Limitations 1. **Cache Tests**: Require write access to cache directory (usually `~/.nanochat/auto_batch_cache/`) 2. **DDP Tests**: Automatically skipped if fewer than 2 GPUs available 3. **Long Tests**: Disabled by default, require `RUN_LONG_TESTS=1` environment variable 4. **Memory Constraint Tests**: Difficult to reliably simulate on all systems 5. **Mid-Training Tests**: Require existing checkpoint from base_train ## Test Maintenance ### Adding New Tests 1. **Unit Tests**: Add to `tests/test_auto_batch_size.py` ```python def test_new_feature(): # Test implementation assert result == expected ``` 2. **Integration Tests**: Create new script in `tests/integration/` ```bash #!/bin/bash # tests/integration/test_new_feature.sh set -e # Test implementation ``` 3. Update `tests/run_integration_tests.sh` to include new test 4. Update this test plan document ### Debugging Failed Tests 1. **Check logs**: All test output saved to `tests/results/*.log` 2. **Run individually**: Execute specific test script in isolation 3. **Increase verbosity**: Use `-x` flag for bash scripts, `-vv` for pytest 4. **Check GPU state**: Run `nvidia-smi` before and after tests 5. **Clear cache**: Remove `~/.nanochat/auto_batch_cache/` if cache issues suspected ## CI/CD Integration ### Recommended CI Pipeline ```yaml stages: - test-unit - test-integration-fast - test-integration-full test-unit: script: - bash tests/run_unit_tests.sh duration: 1 minute test-integration-fast: script: - bash tests/run_integration_tests.sh duration: 30 minutes requires: [test-unit] test-integration-full: script: - RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh duration: 2 hours requires: [test-integration-fast] when: manual # Only run on-demand ``` ### Pre-commit Hooks ```bash #!/bin/bash # .git/hooks/pre-commit bash tests/run_unit_tests.sh ``` ## Test Data ### Expected Batch Sizes (A100 80GB GPU) - depth=12: ~64-96 - depth=20: ~32-48 - depth=26: ~16-32 - depth=32: ~8-16 **Note**: Actual values depend on GPU memory, safety margin, and max_seq_len. ### Expected Speedups - Baseline: device_batch_size=8 - Auto-discovered: device_batch_size=32-64 - Expected speedup: 1.5-3.0x (target: ≥1.3x after overhead) ## Appendix: Test File Structure ``` tests/ ├── README.md # User-facing documentation ├── TEST_PLAN.md # This file ├── test_auto_batch_size.py # Unit tests ├── run_unit_tests.sh # Unit test runner ├── run_integration_tests.sh # Integration test runner ├── make_executable.sh # Helper to chmod +x scripts ├── integration/ # Integration test scripts │ ├── test_single_gpu_discovery.sh │ ├── test_manual_vs_auto.sh │ ├── test_ddp_discovery.sh │ ├── test_throughput_comparison.sh │ ├── test_stability_depth12.sh │ ├── test_stability_depth20.sh │ ├── test_stability_depth26.sh │ ├── test_stability_depth32.sh │ ├── test_overrides.sh │ ├── test_cache_mechanism.sh │ └── test_failure_handling.sh └── results/ # Test output (gitignored) ├── .gitkeep ├── *.log └── throughput_comparison.json ``` ## Version History - **v1.0** (2024-01): Initial test suite implementation - 5 unit tests - 17 integration tests (Tests 6-22) - Unit and integration test runners - Comprehensive documentation