mirror of
https://github.com/karpathy/nanochat.git
synced 2025-12-16 01:02:18 +00:00
7.7 KiB
7.7 KiB
Auto-Discovery Test Plan
Test Coverage Matrix
| Test # | Name | Type | Duration | GPU Required | Status |
|---|---|---|---|---|---|
| 1 | Exponential Search Logic | Unit | < 1s | No | ✓ Implemented |
| 2 | Binary Search Refinement | Unit | < 1s | No | ✓ Implemented |
| 3 | Safety Margin Application | Unit | < 1s | No | ✓ Implemented |
| 4 | Cache Mechanism | Unit | < 1s | No | ✓ Implemented |
| 5 | DDP Broadcast Simulation | Unit | < 1s | No | ✓ Implemented |
| 6 | Basic Discovery Run | Integration | 30s | 1 GPU | ✓ Implemented |
| 7 | Manual vs Auto Comparison | Integration | 2-3 min | 1 GPU | ✓ Implemented |
| 8 | DDP Discovery (2 GPUs) | Integration | 1-2 min | 2 GPUs | ✓ Implemented |
| 9 | DDP Discovery (4 GPUs) | Integration | 1-2 min | 4 GPUs | ✓ Implemented |
| 10 | Throughput Comparison | Integration | 5-10 min | 1 GPU | ✓ Implemented |
| 11 | Stability (depth=12) | Integration | 10-15 min | 1 GPU | ✓ Implemented |
| 12 | Stability (depth=20) | Integration | 15-20 min | 1 GPU | ✓ Implemented |
| 13 | Stability (depth=26) | Integration | 20-25 min | 1 GPU | ✓ Implemented |
| 14 | Stability (depth=32) | Integration | 25-30 min | 1 GPU | ✓ Implemented |
| 15 | Manual Override | Integration | 1-2 min | 1 GPU | ✓ Implemented |
| 16 | Disable Auto-Discovery | Integration | 1-2 min | 1 GPU | ✓ Implemented |
| 17 | Custom Safety Margin | Integration | 2-3 min | 1 GPU | ✓ Implemented |
| 18 | Cache Hit | Integration | 2-3 min | 1 GPU | ✓ Implemented |
| 19 | Cache Key Validation | Integration | 3-4 min | 1 GPU | ✓ Implemented |
| 20 | Cache Invalidation | Integration | 2-3 min | 1 GPU | ✓ Implemented |
| 21 | Artificial Memory Constraint | Integration | 2-3 min | 1 GPU | ✓ Implemented |
| 22 | Mid-Training Override Warning | Integration | 2-3 min | 1 GPU | ✓ Implemented |
Test Execution Time Estimates
Fast Suite (Unit Tests Only)
- Duration: ~10 seconds
- GPU: Not required
- Command:
bash tests/run_unit_tests.sh
Standard Suite (Unit + Short Integration)
- Duration: ~15-30 minutes
- GPU: 1 GPU required
- Command:
bash tests/run_integration_tests.sh
Full Suite (Including Long Stability Tests)
- Duration: ~1-2 hours
- GPU: 1 GPU required
- Command:
RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh
Multi-GPU Suite
- Duration: ~20-40 minutes
- GPU: 2-4 GPUs required
- Command:
bash tests/run_integration_tests.sh(auto-detects GPUs)
Success Criteria
Unit Tests
- All 5 unit tests pass
- Tests complete in < 10 seconds total
- Code coverage ≥ 80% for
nanochat/auto_batch_size.py
Integration Tests - Basic
- Single GPU discovery completes in < 30 seconds
- Auto-discovered batch size ≥ manual baseline (8)
- No OOM errors in any test
- All logs contain expected messages
Integration Tests - DDP
- Rank 0 performs discovery, other ranks receive broadcast
- All ranks use identical batch size
- No deadlocks or synchronization errors
- Tests complete successfully on 2 and 4 GPUs
Integration Tests - Performance
- Throughput improvement ≥ 1.3x compared to manual baseline
- Speedup ratio calculated and logged
- Results saved to JSON for analysis
Integration Tests - Stability
- All 1000 iterations complete without errors
- No OOM errors during long runs
- No memory leaks detected
- Larger models (depth=32) use smaller batch sizes than smaller models (depth=12)
Integration Tests - Overrides
- Manual
--device_batch_sizeskips auto-discovery - Custom safety margins produce expected batch sizes
- Disabled auto-discovery uses default values
Integration Tests - Cache
- Cache hit reduces startup time from 15-30s to < 5s
- Different configurations create different cache keys
- Corrupted cache handled gracefully (fallback to re-discovery)
- Cache files created in correct directory
Integration Tests - Failure Handling
- Artificial memory constraints trigger fallback
- Warning messages logged appropriately
- Mid-training override warning appears
- No crashes or exceptions, only graceful degradation
Known Limitations
- Cache Tests: Require write access to cache directory (usually
~/.nanochat/auto_batch_cache/) - DDP Tests: Automatically skipped if fewer than 2 GPUs available
- Long Tests: Disabled by default, require
RUN_LONG_TESTS=1environment variable - Memory Constraint Tests: Difficult to reliably simulate on all systems
- Mid-Training Tests: Require existing checkpoint from base_train
Test Maintenance
Adding New Tests
-
Unit Tests: Add to
tests/test_auto_batch_size.pydef test_new_feature(): # Test implementation assert result == expected -
Integration Tests: Create new script in
tests/integration/#!/bin/bash # tests/integration/test_new_feature.sh set -e # Test implementation -
Update
tests/run_integration_tests.shto include new test -
Update this test plan document
Debugging Failed Tests
- Check logs: All test output saved to
tests/results/*.log - Run individually: Execute specific test script in isolation
- Increase verbosity: Use
-xflag for bash scripts,-vvfor pytest - Check GPU state: Run
nvidia-smibefore and after tests - Clear cache: Remove
~/.nanochat/auto_batch_cache/if cache issues suspected
CI/CD Integration
Recommended CI Pipeline
stages:
- test-unit
- test-integration-fast
- test-integration-full
test-unit:
script:
- bash tests/run_unit_tests.sh
duration: 1 minute
test-integration-fast:
script:
- bash tests/run_integration_tests.sh
duration: 30 minutes
requires: [test-unit]
test-integration-full:
script:
- RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh
duration: 2 hours
requires: [test-integration-fast]
when: manual # Only run on-demand
Pre-commit Hooks
#!/bin/bash
# .git/hooks/pre-commit
bash tests/run_unit_tests.sh
Test Data
Expected Batch Sizes (A100 80GB GPU)
- depth=12: ~64-96
- depth=20: ~32-48
- depth=26: ~16-32
- depth=32: ~8-16
Note: Actual values depend on GPU memory, safety margin, and max_seq_len.
Expected Speedups
- Baseline: device_batch_size=8
- Auto-discovered: device_batch_size=32-64
- Expected speedup: 1.5-3.0x (target: ≥1.3x after overhead)
Appendix: Test File Structure
tests/
├── README.md # User-facing documentation
├── TEST_PLAN.md # This file
├── test_auto_batch_size.py # Unit tests
├── run_unit_tests.sh # Unit test runner
├── run_integration_tests.sh # Integration test runner
├── make_executable.sh # Helper to chmod +x scripts
├── integration/ # Integration test scripts
│ ├── test_single_gpu_discovery.sh
│ ├── test_manual_vs_auto.sh
│ ├── test_ddp_discovery.sh
│ ├── test_throughput_comparison.sh
│ ├── test_stability_depth12.sh
│ ├── test_stability_depth20.sh
│ ├── test_stability_depth26.sh
│ ├── test_stability_depth32.sh
│ ├── test_overrides.sh
│ ├── test_cache_mechanism.sh
│ └── test_failure_handling.sh
└── results/ # Test output (gitignored)
├── .gitkeep
├── *.log
└── throughput_comparison.json
Version History
- v1.0 (2024-01): Initial test suite implementation
- 5 unit tests
- 17 integration tests (Tests 6-22)
- Unit and integration test runners
- Comprehensive documentation