nanochat/tests/TEST_PLAN.md

224 lines
7.7 KiB
Markdown

# Auto-Discovery Test Plan
## Test Coverage Matrix
| Test # | Name | Type | Duration | GPU Required | Status |
|--------|------|------|----------|--------------|--------|
| 1 | Exponential Search Logic | Unit | < 1s | No | Implemented |
| 2 | Binary Search Refinement | Unit | < 1s | No | Implemented |
| 3 | Safety Margin Application | Unit | < 1s | No | Implemented |
| 4 | Cache Mechanism | Unit | < 1s | No | Implemented |
| 5 | DDP Broadcast Simulation | Unit | < 1s | No | Implemented |
| 6 | Basic Discovery Run | Integration | 30s | 1 GPU | Implemented |
| 7 | Manual vs Auto Comparison | Integration | 2-3 min | 1 GPU | Implemented |
| 8 | DDP Discovery (2 GPUs) | Integration | 1-2 min | 2 GPUs | Implemented |
| 9 | DDP Discovery (4 GPUs) | Integration | 1-2 min | 4 GPUs | Implemented |
| 10 | Throughput Comparison | Integration | 5-10 min | 1 GPU | Implemented |
| 11 | Stability (depth=12) | Integration | 10-15 min | 1 GPU | Implemented |
| 12 | Stability (depth=20) | Integration | 15-20 min | 1 GPU | Implemented |
| 13 | Stability (depth=26) | Integration | 20-25 min | 1 GPU | Implemented |
| 14 | Stability (depth=32) | Integration | 25-30 min | 1 GPU | Implemented |
| 15 | Manual Override | Integration | 1-2 min | 1 GPU | Implemented |
| 16 | Disable Auto-Discovery | Integration | 1-2 min | 1 GPU | Implemented |
| 17 | Custom Safety Margin | Integration | 2-3 min | 1 GPU | Implemented |
| 18 | Cache Hit | Integration | 2-3 min | 1 GPU | Implemented |
| 19 | Cache Key Validation | Integration | 3-4 min | 1 GPU | Implemented |
| 20 | Cache Invalidation | Integration | 2-3 min | 1 GPU | Implemented |
| 21 | Artificial Memory Constraint | Integration | 2-3 min | 1 GPU | Implemented |
| 22 | Mid-Training Override Warning | Integration | 2-3 min | 1 GPU | Implemented |
## Test Execution Time Estimates
### Fast Suite (Unit Tests Only)
- **Duration**: ~10 seconds
- **GPU**: Not required
- **Command**: `bash tests/run_unit_tests.sh`
### Standard Suite (Unit + Short Integration)
- **Duration**: ~15-30 minutes
- **GPU**: 1 GPU required
- **Command**: `bash tests/run_integration_tests.sh`
### Full Suite (Including Long Stability Tests)
- **Duration**: ~1-2 hours
- **GPU**: 1 GPU required
- **Command**: `RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh`
### Multi-GPU Suite
- **Duration**: ~20-40 minutes
- **GPU**: 2-4 GPUs required
- **Command**: `bash tests/run_integration_tests.sh` (auto-detects GPUs)
## Success Criteria
### Unit Tests
- [ ] All 5 unit tests pass
- [ ] Tests complete in < 10 seconds total
- [ ] Code coverage 80% for `nanochat/auto_batch_size.py`
### Integration Tests - Basic
- [ ] Single GPU discovery completes in < 30 seconds
- [ ] Auto-discovered batch size manual baseline (8)
- [ ] No OOM errors in any test
- [ ] All logs contain expected messages
### Integration Tests - DDP
- [ ] Rank 0 performs discovery, other ranks receive broadcast
- [ ] All ranks use identical batch size
- [ ] No deadlocks or synchronization errors
- [ ] Tests complete successfully on 2 and 4 GPUs
### Integration Tests - Performance
- [ ] Throughput improvement 1.3x compared to manual baseline
- [ ] Speedup ratio calculated and logged
- [ ] Results saved to JSON for analysis
### Integration Tests - Stability
- [ ] All 1000 iterations complete without errors
- [ ] No OOM errors during long runs
- [ ] No memory leaks detected
- [ ] Larger models (depth=32) use smaller batch sizes than smaller models (depth=12)
### Integration Tests - Overrides
- [ ] Manual `--device_batch_size` skips auto-discovery
- [ ] Custom safety margins produce expected batch sizes
- [ ] Disabled auto-discovery uses default values
### Integration Tests - Cache
- [ ] Cache hit reduces startup time from 15-30s to < 5s
- [ ] Different configurations create different cache keys
- [ ] Corrupted cache handled gracefully (fallback to re-discovery)
- [ ] Cache files created in correct directory
### Integration Tests - Failure Handling
- [ ] Artificial memory constraints trigger fallback
- [ ] Warning messages logged appropriately
- [ ] Mid-training override warning appears
- [ ] No crashes or exceptions, only graceful degradation
## Known Limitations
1. **Cache Tests**: Require write access to cache directory (usually `~/.nanochat/auto_batch_cache/`)
2. **DDP Tests**: Automatically skipped if fewer than 2 GPUs available
3. **Long Tests**: Disabled by default, require `RUN_LONG_TESTS=1` environment variable
4. **Memory Constraint Tests**: Difficult to reliably simulate on all systems
5. **Mid-Training Tests**: Require existing checkpoint from base_train
## Test Maintenance
### Adding New Tests
1. **Unit Tests**: Add to `tests/test_auto_batch_size.py`
```python
def test_new_feature():
# Test implementation
assert result == expected
```
2. **Integration Tests**: Create new script in `tests/integration/`
```bash
#!/bin/bash
# tests/integration/test_new_feature.sh
set -e
# Test implementation
```
3. Update `tests/run_integration_tests.sh` to include new test
4. Update this test plan document
### Debugging Failed Tests
1. **Check logs**: All test output saved to `tests/results/*.log`
2. **Run individually**: Execute specific test script in isolation
3. **Increase verbosity**: Use `-x` flag for bash scripts, `-vv` for pytest
4. **Check GPU state**: Run `nvidia-smi` before and after tests
5. **Clear cache**: Remove `~/.nanochat/auto_batch_cache/` if cache issues suspected
## CI/CD Integration
### Recommended CI Pipeline
```yaml
stages:
- test-unit
- test-integration-fast
- test-integration-full
test-unit:
script:
- bash tests/run_unit_tests.sh
duration: 1 minute
test-integration-fast:
script:
- bash tests/run_integration_tests.sh
duration: 30 minutes
requires: [test-unit]
test-integration-full:
script:
- RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh
duration: 2 hours
requires: [test-integration-fast]
when: manual # Only run on-demand
```
### Pre-commit Hooks
```bash
#!/bin/bash
# .git/hooks/pre-commit
bash tests/run_unit_tests.sh
```
## Test Data
### Expected Batch Sizes (A100 80GB GPU)
- depth=12: ~64-96
- depth=20: ~32-48
- depth=26: ~16-32
- depth=32: ~8-16
**Note**: Actual values depend on GPU memory, safety margin, and max_seq_len.
### Expected Speedups
- Baseline: device_batch_size=8
- Auto-discovered: device_batch_size=32-64
- Expected speedup: 1.5-3.0x (target: 1.3x after overhead)
## Appendix: Test File Structure
```
tests/
├── README.md # User-facing documentation
├── TEST_PLAN.md # This file
├── test_auto_batch_size.py # Unit tests
├── run_unit_tests.sh # Unit test runner
├── run_integration_tests.sh # Integration test runner
├── make_executable.sh # Helper to chmod +x scripts
├── integration/ # Integration test scripts
│ ├── test_single_gpu_discovery.sh
│ ├── test_manual_vs_auto.sh
│ ├── test_ddp_discovery.sh
│ ├── test_throughput_comparison.sh
│ ├── test_stability_depth12.sh
│ ├── test_stability_depth20.sh
│ ├── test_stability_depth26.sh
│ ├── test_stability_depth32.sh
│ ├── test_overrides.sh
│ ├── test_cache_mechanism.sh
│ └── test_failure_handling.sh
└── results/ # Test output (gitignored)
├── .gitkeep
├── *.log
└── throughput_comparison.json
```
## Version History
- **v1.0** (2024-01): Initial test suite implementation
- 5 unit tests
- 17 integration tests (Tests 6-22)
- Unit and integration test runners
- Comprehensive documentation