mirror of
https://github.com/karpathy/nanochat.git
synced 2026-03-31 09:05:14 +00:00
305 lines
8.2 KiB
Markdown
305 lines
8.2 KiB
Markdown
# Auto-Discovery Testing Suite
|
|
|
|
Comprehensive tests for the auto-discovery batch size functionality in NanoChat.
|
|
|
|
## Overview
|
|
|
|
This testing suite validates the auto-discovery system across different scenarios:
|
|
- **Unit Tests**: Isolated testing of core algorithms (exponential search, binary search, caching)
|
|
- **Integration Tests**: End-to-end testing with actual training scripts
|
|
- **Stability Tests**: Long-running tests to detect memory leaks and OOM issues
|
|
- **Performance Tests**: Throughput comparisons between manual and auto-discovered batch sizes
|
|
|
|
## Quick Start
|
|
|
|
### Run All Tests
|
|
|
|
```bash
|
|
# Run unit tests only (fast, ~10 seconds)
|
|
bash tests/run_unit_tests.sh
|
|
|
|
# Run integration tests (requires GPU, 10-30 minutes)
|
|
bash tests/run_integration_tests.sh
|
|
|
|
# Run integration tests including long stability tests (1+ hours)
|
|
RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh
|
|
```
|
|
|
|
### Run Individual Tests
|
|
|
|
```bash
|
|
# Unit tests
|
|
pytest tests/test_auto_batch_size.py -v
|
|
|
|
# Specific integration test
|
|
bash tests/integration/test_single_gpu_discovery.sh
|
|
bash tests/integration/test_ddp_discovery.sh
|
|
bash tests/integration/test_throughput_comparison.sh
|
|
```
|
|
|
|
## Test Categories
|
|
|
|
### Unit Tests (`test_auto_batch_size.py`)
|
|
|
|
Tests the core discovery algorithms in isolation using mocks:
|
|
|
|
- **Test 1**: Exponential search finds upper bound (1, 2, 4, 8, 16, 32, 64)
|
|
- **Test 2**: Binary search refines to exact boundary
|
|
- **Test 3**: Safety margin application (0.85, 0.90, 0.95)
|
|
- **Test 4**: Cache hit/miss behavior
|
|
- **Test 5**: DDP broadcast simulation
|
|
|
|
**Run with:**
|
|
```bash
|
|
pytest tests/test_auto_batch_size.py -v --tb=short
|
|
```
|
|
|
|
### Integration Tests
|
|
|
|
#### Single GPU Tests
|
|
|
|
- **Test 6**: Basic discovery run (`test_single_gpu_discovery.sh`)
|
|
- Verifies discovery completes in < 30 seconds
|
|
- Checks for proper log messages
|
|
- Validates no OOM errors
|
|
|
|
- **Test 7**: Manual vs Auto comparison (`test_manual_vs_auto.sh`)
|
|
- Compares manual batch_size=8 with auto-discovery
|
|
- Validates auto batch size ≥ manual
|
|
- Ensures both runs complete successfully
|
|
|
|
#### Multi-GPU Tests
|
|
|
|
- **Test 8**: 2-GPU DDP discovery (`test_ddp_discovery.sh`)
|
|
- Verifies rank 0 performs discovery
|
|
- Checks broadcast to rank 1
|
|
- Validates synchronization
|
|
|
|
- **Test 9**: 4-GPU DDP discovery (if available)
|
|
- Same as Test 8 with 4 GPUs
|
|
- Skipped if fewer than 4 GPUs available
|
|
|
|
#### Throughput Tests
|
|
|
|
- **Test 10**: Throughput comparison (`test_throughput_comparison.sh`)
|
|
- Measures iterations/second for manual vs auto
|
|
- Calculates speedup ratio
|
|
- Target: ≥ 1.3x speedup (allows for discovery overhead)
|
|
- Saves results to `tests/results/throughput_comparison.json`
|
|
|
|
#### Stability Tests
|
|
|
|
Long-running tests (1000 iterations each):
|
|
|
|
- **Test 11**: Depth=12 (`test_stability_depth12.sh`)
|
|
- **Test 12**: Depth=20 (`test_stability_depth20.sh`)
|
|
- **Test 13**: Depth=26 (`test_stability_depth26.sh`)
|
|
- **Test 14**: Depth=32 (`test_stability_depth32.sh`)
|
|
- Verifies larger models use smaller batch sizes
|
|
- Monitors for memory leaks
|
|
- Ensures no OOM during long runs
|
|
|
|
**Run with:**
|
|
```bash
|
|
RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh
|
|
```
|
|
|
|
#### Override Tests
|
|
|
|
- **Test 15**: Manual override (`test_overrides.sh`)
|
|
- Verifies `--device_batch_size=16` skips auto-discovery
|
|
- Checks for manual batch size usage message
|
|
|
|
- **Test 16**: Disable auto-discovery
|
|
- Tests with auto-discovery disabled
|
|
- Verifies fallback to default batch_size=8
|
|
|
|
- **Test 17**: Custom safety margin
|
|
- Tests `--batch_size_margin=0.85` vs `0.90`
|
|
- Verifies higher margin gives larger batch size
|
|
|
|
#### Cache Tests
|
|
|
|
- **Test 18**: Cache hit (`test_cache_mechanism.sh`)
|
|
- First run: discovery + cache save
|
|
- Second run: cache hit (< 5 seconds)
|
|
- Verifies cache file creation
|
|
|
|
- **Test 19**: Cache key validation
|
|
- Different depth → different cache key
|
|
- Different max_seq_len → different cache key
|
|
- Verifies multiple cache files created
|
|
|
|
- **Test 20**: Cache invalidation
|
|
- Corrupts cache file
|
|
- Verifies graceful fallback to re-discovery
|
|
- Tests cache deletion and re-run
|
|
|
|
#### Failure Handling Tests
|
|
|
|
- **Test 21**: Artificial memory constraint (`test_failure_handling.sh`)
|
|
- Tests with very large model (depth=40)
|
|
- Verifies fallback to defaults
|
|
- Checks for warning messages
|
|
|
|
- **Test 22**: Mid-training override warning
|
|
- Tests mid_train.py with larger batch size than pretrain
|
|
- Verifies "FOOTGUN WARNING" appears
|
|
- Ensures training continues despite warning
|
|
|
|
## Test Results
|
|
|
|
Results are saved to `tests/results/`:
|
|
|
|
```
|
|
tests/results/
|
|
├── test_single_gpu_discovery.log
|
|
├── test_manual_baseline.log
|
|
├── test_auto_discovery.log
|
|
├── throughput_comparison.json
|
|
├── stability_depth12.log
|
|
├── stability_depth20.log
|
|
├── cache_run1.log
|
|
├── cache_run2.log
|
|
└── ...
|
|
```
|
|
|
|
### Throughput Results Format
|
|
|
|
`tests/results/throughput_comparison.json`:
|
|
```json
|
|
{
|
|
"timestamp": "2024-01-15T10:30:00Z",
|
|
"depth": 12,
|
|
"max_iterations": 100,
|
|
"manual": {
|
|
"batch_size": 8,
|
|
"duration_seconds": 120,
|
|
"throughput_iter_per_sec": 0.833
|
|
},
|
|
"auto": {
|
|
"batch_size": 32,
|
|
"duration_seconds": 60,
|
|
"throughput_iter_per_sec": 1.667
|
|
},
|
|
"speedup_ratio": 2.0
|
|
}
|
|
```
|
|
|
|
## Requirements
|
|
|
|
### Unit Tests
|
|
- Python 3.8+
|
|
- PyTorch
|
|
- pytest
|
|
- No GPU required (runs on CPU)
|
|
|
|
### Integration Tests
|
|
- CUDA-capable GPU (≥ 24GB VRAM recommended)
|
|
- Multiple GPUs for DDP tests (optional)
|
|
- Environment variables:
|
|
- `NANOCHAT_BASE_DIR`: Base directory for checkpoints/cache (optional)
|
|
- `RUN_LONG_TESTS=1`: Enable 1000-iteration stability tests (optional)
|
|
|
|
## CI/CD Integration
|
|
|
|
For automated testing in CI:
|
|
|
|
```bash
|
|
# Quick validation (unit tests + fast integration tests)
|
|
bash tests/run_unit_tests.sh
|
|
bash tests/run_integration_tests.sh # ~15 minutes
|
|
|
|
# Full validation (includes long tests)
|
|
RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh # ~1 hour
|
|
```
|
|
|
|
### GitHub Actions Example
|
|
|
|
```yaml
|
|
name: Auto-Discovery Tests
|
|
|
|
on: [push, pull_request]
|
|
|
|
jobs:
|
|
test:
|
|
runs-on: [self-hosted, gpu]
|
|
steps:
|
|
- uses: actions/checkout@v2
|
|
- name: Run unit tests
|
|
run: bash tests/run_unit_tests.sh
|
|
- name: Run integration tests
|
|
run: bash tests/run_integration_tests.sh
|
|
- name: Upload results
|
|
uses: actions/upload-artifact@v2
|
|
with:
|
|
name: test-results
|
|
path: tests/results/
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **"SKIP: Need at least 2 GPUs for DDP tests"**
|
|
- Expected if you have only 1 GPU
|
|
- DDP tests will be skipped automatically
|
|
|
|
2. **"Cache directory is empty or doesn't exist"**
|
|
- Cache may be disabled or path issue
|
|
- Check `NANOCHAT_BASE_DIR` environment variable
|
|
|
|
3. **"Discovery takes longer than 30 seconds"**
|
|
- May indicate large model or slow GPU
|
|
- Increase timeout in test script if needed
|
|
|
|
4. **"Speedup ratio below threshold"**
|
|
- Discovery overhead may be high for short runs
|
|
- Try longer runs (increase `MAX_ITERATIONS`)
|
|
|
|
### Debug Mode
|
|
|
|
Run tests with verbose output:
|
|
|
|
```bash
|
|
# Unit tests with full traceback
|
|
pytest tests/test_auto_batch_size.py -vv --tb=long
|
|
|
|
# Integration tests with set -x
|
|
bash -x tests/integration/test_single_gpu_discovery.sh
|
|
```
|
|
|
|
## Success Criteria
|
|
|
|
### Unit Tests
|
|
- ✓ All 5 unit tests pass
|
|
- ✓ Tests complete in < 10 seconds
|
|
- ✓ Code coverage ≥ 80% for `nanochat/auto_batch_size.py`
|
|
|
|
### Integration Tests
|
|
- ✓ Single GPU discovery completes in < 30 seconds
|
|
- ✓ No OOM errors during 1000+ iteration stability tests
|
|
- ✓ Throughput improvement ≥ 1.3x compared to manual baseline
|
|
- ✓ DDP tests show identical batch size across all ranks
|
|
- ✓ Override tests correctly skip discovery or use manual values
|
|
- ✓ Cache tests show < 5 second cache hit time vs 15-30 second discovery
|
|
|
|
### Failure Handling
|
|
- ✓ Artificial memory constraints trigger fallback to defaults
|
|
- ✓ Warning messages appear in logs for fallback scenarios
|
|
- ✓ No crashes or exceptions, only graceful degradation
|
|
|
|
## Contributing
|
|
|
|
When adding new tests:
|
|
|
|
1. Add unit tests to `tests/test_auto_batch_size.py`
|
|
2. Add integration tests as new `.sh` scripts in `tests/integration/`
|
|
3. Update `tests/run_integration_tests.sh` to include new tests
|
|
4. Update this README with test descriptions
|
|
5. Ensure tests clean up after themselves (delete temp files, clear cache)
|
|
|
|
## License
|
|
|
|
Same as NanoChat project.
|