| .. | ||
| integration | ||
| results | ||
| CHECKLIST.md | ||
| IMPLEMENTATION_NOTES.md | ||
| make_executable.sh | ||
| QUICKSTART.md | ||
| README.md | ||
| run_integration_tests.sh | ||
| run_unit_tests.sh | ||
| test_auto_batch_size.py | ||
| TEST_PLAN.md | ||
| test_rustbpe.py | ||
Auto-Discovery Testing Suite
Comprehensive tests for the auto-discovery batch size functionality in NanoChat.
Overview
This testing suite validates the auto-discovery system across different scenarios:
- Unit Tests: Isolated testing of core algorithms (exponential search, binary search, caching)
- Integration Tests: End-to-end testing with actual training scripts
- Stability Tests: Long-running tests to detect memory leaks and OOM issues
- Performance Tests: Throughput comparisons between manual and auto-discovered batch sizes
Quick Start
Run All Tests
# Run unit tests only (fast, ~10 seconds)
bash tests/run_unit_tests.sh
# Run integration tests (requires GPU, 10-30 minutes)
bash tests/run_integration_tests.sh
# Run integration tests including long stability tests (1+ hours)
RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh
Run Individual Tests
# Unit tests
pytest tests/test_auto_batch_size.py -v
# Specific integration test
bash tests/integration/test_single_gpu_discovery.sh
bash tests/integration/test_ddp_discovery.sh
bash tests/integration/test_throughput_comparison.sh
Test Categories
Unit Tests (test_auto_batch_size.py)
Tests the core discovery algorithms in isolation using mocks:
- Test 1: Exponential search finds upper bound (1, 2, 4, 8, 16, 32, 64)
- Test 2: Binary search refines to exact boundary
- Test 3: Safety margin application (0.85, 0.90, 0.95)
- Test 4: Cache hit/miss behavior
- Test 5: DDP broadcast simulation
Run with:
pytest tests/test_auto_batch_size.py -v --tb=short
Integration Tests
Single GPU Tests
-
Test 6: Basic discovery run (
test_single_gpu_discovery.sh)- Verifies discovery completes in < 30 seconds
- Checks for proper log messages
- Validates no OOM errors
-
Test 7: Manual vs Auto comparison (
test_manual_vs_auto.sh)- Compares manual batch_size=8 with auto-discovery
- Validates auto batch size ≥ manual
- Ensures both runs complete successfully
Multi-GPU Tests
-
Test 8: 2-GPU DDP discovery (
test_ddp_discovery.sh)- Verifies rank 0 performs discovery
- Checks broadcast to rank 1
- Validates synchronization
-
Test 9: 4-GPU DDP discovery (if available)
- Same as Test 8 with 4 GPUs
- Skipped if fewer than 4 GPUs available
Throughput Tests
- Test 10: Throughput comparison (
test_throughput_comparison.sh)- Measures iterations/second for manual vs auto
- Calculates speedup ratio
- Target: ≥ 1.3x speedup (allows for discovery overhead)
- Saves results to
tests/results/throughput_comparison.json
Stability Tests
Long-running tests (1000 iterations each):
- Test 11: Depth=12 (
test_stability_depth12.sh) - Test 12: Depth=20 (
test_stability_depth20.sh) - Test 13: Depth=26 (
test_stability_depth26.sh) - Test 14: Depth=32 (
test_stability_depth32.sh)- Verifies larger models use smaller batch sizes
- Monitors for memory leaks
- Ensures no OOM during long runs
Run with:
RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh
Override Tests
-
Test 15: Manual override (
test_overrides.sh)- Verifies
--device_batch_size=16skips auto-discovery - Checks for manual batch size usage message
- Verifies
-
Test 16: Disable auto-discovery
- Tests with auto-discovery disabled
- Verifies fallback to default batch_size=8
-
Test 17: Custom safety margin
- Tests
--batch_size_margin=0.85vs0.90 - Verifies higher margin gives larger batch size
- Tests
Cache Tests
-
Test 18: Cache hit (
test_cache_mechanism.sh)- First run: discovery + cache save
- Second run: cache hit (< 5 seconds)
- Verifies cache file creation
-
Test 19: Cache key validation
- Different depth → different cache key
- Different max_seq_len → different cache key
- Verifies multiple cache files created
-
Test 20: Cache invalidation
- Corrupts cache file
- Verifies graceful fallback to re-discovery
- Tests cache deletion and re-run
Failure Handling Tests
-
Test 21: Artificial memory constraint (
test_failure_handling.sh)- Tests with very large model (depth=40)
- Verifies fallback to defaults
- Checks for warning messages
-
Test 22: Mid-training override warning
- Tests mid_train.py with larger batch size than pretrain
- Verifies "FOOTGUN WARNING" appears
- Ensures training continues despite warning
Test Results
Results are saved to tests/results/:
tests/results/
├── test_single_gpu_discovery.log
├── test_manual_baseline.log
├── test_auto_discovery.log
├── throughput_comparison.json
├── stability_depth12.log
├── stability_depth20.log
├── cache_run1.log
├── cache_run2.log
└── ...
Throughput Results Format
tests/results/throughput_comparison.json:
{
"timestamp": "2024-01-15T10:30:00Z",
"depth": 12,
"max_iterations": 100,
"manual": {
"batch_size": 8,
"duration_seconds": 120,
"throughput_iter_per_sec": 0.833
},
"auto": {
"batch_size": 32,
"duration_seconds": 60,
"throughput_iter_per_sec": 1.667
},
"speedup_ratio": 2.0
}
Requirements
Unit Tests
- Python 3.8+
- PyTorch
- pytest
- No GPU required (runs on CPU)
Integration Tests
- CUDA-capable GPU (≥ 24GB VRAM recommended)
- Multiple GPUs for DDP tests (optional)
- Environment variables:
NANOCHAT_BASE_DIR: Base directory for checkpoints/cache (optional)RUN_LONG_TESTS=1: Enable 1000-iteration stability tests (optional)
CI/CD Integration
For automated testing in CI:
# Quick validation (unit tests + fast integration tests)
bash tests/run_unit_tests.sh
bash tests/run_integration_tests.sh # ~15 minutes
# Full validation (includes long tests)
RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh # ~1 hour
GitHub Actions Example
name: Auto-Discovery Tests
on: [push, pull_request]
jobs:
test:
runs-on: [self-hosted, gpu]
steps:
- uses: actions/checkout@v2
- name: Run unit tests
run: bash tests/run_unit_tests.sh
- name: Run integration tests
run: bash tests/run_integration_tests.sh
- name: Upload results
uses: actions/upload-artifact@v2
with:
name: test-results
path: tests/results/
Troubleshooting
Common Issues
-
"SKIP: Need at least 2 GPUs for DDP tests"
- Expected if you have only 1 GPU
- DDP tests will be skipped automatically
-
"Cache directory is empty or doesn't exist"
- Cache may be disabled or path issue
- Check
NANOCHAT_BASE_DIRenvironment variable
-
"Discovery takes longer than 30 seconds"
- May indicate large model or slow GPU
- Increase timeout in test script if needed
-
"Speedup ratio below threshold"
- Discovery overhead may be high for short runs
- Try longer runs (increase
MAX_ITERATIONS)
Debug Mode
Run tests with verbose output:
# Unit tests with full traceback
pytest tests/test_auto_batch_size.py -vv --tb=long
# Integration tests with set -x
bash -x tests/integration/test_single_gpu_discovery.sh
Success Criteria
Unit Tests
- ✓ All 5 unit tests pass
- ✓ Tests complete in < 10 seconds
- ✓ Code coverage ≥ 80% for
nanochat/auto_batch_size.py
Integration Tests
- ✓ Single GPU discovery completes in < 30 seconds
- ✓ No OOM errors during 1000+ iteration stability tests
- ✓ Throughput improvement ≥ 1.3x compared to manual baseline
- ✓ DDP tests show identical batch size across all ranks
- ✓ Override tests correctly skip discovery or use manual values
- ✓ Cache tests show < 5 second cache hit time vs 15-30 second discovery
Failure Handling
- ✓ Artificial memory constraints trigger fallback to defaults
- ✓ Warning messages appear in logs for fallback scenarios
- ✓ No crashes or exceptions, only graceful degradation
Contributing
When adding new tests:
- Add unit tests to
tests/test_auto_batch_size.py - Add integration tests as new
.shscripts intests/integration/ - Update
tests/run_integration_tests.shto include new tests - Update this README with test descriptions
- Ensure tests clean up after themselves (delete temp files, clear cache)
License
Same as NanoChat project.