nanochat/tests/README.md

# Auto-Discovery Testing Suite

Comprehensive tests for the auto-discovery batch size functionality in NanoChat.

## Overview

This testing suite validates the auto-discovery system across different scenarios:
- **Unit Tests**: Isolated testing of core algorithms (exponential search, binary search, caching)
- **Integration Tests**: End-to-end testing with actual training scripts
- **Stability Tests**: Long-running tests to detect memory leaks and OOM issues
- **Performance Tests**: Throughput comparisons between manual and auto-discovered batch sizes

## Quick Start

### Run All Tests

```bash
# Run unit tests only (fast, ~10 seconds)
bash tests/run_unit_tests.sh

# Run integration tests (requires GPU, 10-30 minutes)
bash tests/run_integration_tests.sh

# Run integration tests including long stability tests (1+ hours)
RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh
```

### Run Individual Tests

```bash
# Unit tests
pytest tests/test_auto_batch_size.py -v

# Specific integration test
bash tests/integration/test_single_gpu_discovery.sh
bash tests/integration/test_ddp_discovery.sh
bash tests/integration/test_throughput_comparison.sh
```

## Test Categories

### Unit Tests (`test_auto_batch_size.py`)

Tests the core discovery algorithms in isolation using mocks:

- **Test 1**: Exponential search finds upper bound (1, 2, 4, 8, 16, 32, 64)
- **Test 2**: Binary search refines to exact boundary
- **Test 3**: Safety margin application (0.85, 0.90, 0.95)
- **Test 4**: Cache hit/miss behavior
- **Test 5**: DDP broadcast simulation

**Run with:**
```bash
pytest tests/test_auto_batch_size.py -v --tb=short
```

### Integration Tests

#### Single GPU Tests

- **Test 6**: Basic discovery run (`test_single_gpu_discovery.sh`)
  - Verifies discovery completes in < 30 seconds
  - Checks for proper log messages
  - Validates no OOM errors

- **Test 7**: Manual vs Auto comparison (`test_manual_vs_auto.sh`)
  - Compares manual batch_size=8 with auto-discovery
  - Validates auto batch size ≥ manual
  - Ensures both runs complete successfully

#### Multi-GPU Tests

- **Test 8**: 2-GPU DDP discovery (`test_ddp_discovery.sh`)
  - Verifies rank 0 performs discovery
  - Checks broadcast to rank 1
  - Validates synchronization

- **Test 9**: 4-GPU DDP discovery (if available)
  - Same as Test 8 with 4 GPUs
  - Skipped if fewer than 4 GPUs available

#### Throughput Tests

- **Test 10**: Throughput comparison (`test_throughput_comparison.sh`)
  - Measures iterations/second for manual vs auto
  - Calculates speedup ratio
  - Target: ≥ 1.3x speedup (allows for discovery overhead)
  - Saves results to `tests/results/throughput_comparison.json`

#### Stability Tests

Long-running tests (1000 iterations each):

- **Test 11**: Depth=12 (`test_stability_depth12.sh`)
- **Test 12**: Depth=20 (`test_stability_depth20.sh`)
- **Test 13**: Depth=26 (`test_stability_depth26.sh`)
- **Test 14**: Depth=32 (`test_stability_depth32.sh`)
  - Verifies larger models use smaller batch sizes
  - Monitors for memory leaks
  - Ensures no OOM during long runs

**Run with:**
```bash
RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh
```

#### Override Tests

- **Test 15**: Manual override (`test_overrides.sh`)
  - Verifies `--device_batch_size=16` skips auto-discovery
  - Checks for manual batch size usage message

- **Test 16**: Disable auto-discovery
  - Tests with auto-discovery disabled
  - Verifies fallback to default batch_size=8

- **Test 17**: Custom safety margin
  - Tests `--batch_size_margin=0.85` vs `0.90`
  - Verifies higher margin gives larger batch size

#### Cache Tests

- **Test 18**: Cache hit (`test_cache_mechanism.sh`)
  - First run: discovery + cache save
  - Second run: cache hit (< 5 seconds)
  - Verifies cache file creation

- **Test 19**: Cache key validation
  - Different depth → different cache key
  - Different max_seq_len → different cache key
  - Verifies multiple cache files created

- **Test 20**: Cache invalidation
  - Corrupts cache file
  - Verifies graceful fallback to re-discovery
  - Tests cache deletion and re-run

#### Failure Handling Tests

- **Test 21**: Artificial memory constraint (`test_failure_handling.sh`)
  - Tests with very large model (depth=40)
  - Verifies fallback to defaults
  - Checks for warning messages

- **Test 22**: Mid-training override warning
  - Tests mid_train.py with larger batch size than pretrain
  - Verifies "FOOTGUN WARNING" appears
  - Ensures training continues despite warning

## Test Results

Results are saved to `tests/results/`:

```
tests/results/
├── test_single_gpu_discovery.log
├── test_manual_baseline.log
├── test_auto_discovery.log
├── throughput_comparison.json
├── stability_depth12.log
├── stability_depth20.log
├── cache_run1.log
├── cache_run2.log
└── ...
```

### Throughput Results Format

`tests/results/throughput_comparison.json`:
```json
{
  "timestamp": "2024-01-15T10:30:00Z",
  "depth": 12,
  "max_iterations": 100,
  "manual": {
    "batch_size": 8,
    "duration_seconds": 120,
    "throughput_iter_per_sec": 0.833
  },
  "auto": {
    "batch_size": 32,
    "duration_seconds": 60,
    "throughput_iter_per_sec": 1.667
  },
  "speedup_ratio": 2.0
}
```

## Requirements

### Unit Tests
- Python 3.8+
- PyTorch
- pytest
- No GPU required (runs on CPU)

### Integration Tests
- CUDA-capable GPU (≥ 24GB VRAM recommended)
- Multiple GPUs for DDP tests (optional)
- Environment variables:
  - `NANOCHAT_BASE_DIR`: Base directory for checkpoints/cache (optional)
  - `RUN_LONG_TESTS=1`: Enable 1000-iteration stability tests (optional)

## CI/CD Integration

For automated testing in CI:

```bash
# Quick validation (unit tests + fast integration tests)
bash tests/run_unit_tests.sh
bash tests/run_integration_tests.sh  # ~15 minutes

# Full validation (includes long tests)
RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh  # ~1 hour
```

### GitHub Actions Example

```yaml
name: Auto-Discovery Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v2
      - name: Run unit tests
        run: bash tests/run_unit_tests.sh
      - name: Run integration tests
        run: bash tests/run_integration_tests.sh
      - name: Upload results
        uses: actions/upload-artifact@v2
        with:
          name: test-results
          path: tests/results/
```

## Troubleshooting

### Common Issues

1. **"SKIP: Need at least 2 GPUs for DDP tests"**
   - Expected if you have only 1 GPU
   - DDP tests will be skipped automatically

2. **"Cache directory is empty or doesn't exist"**
   - Cache may be disabled or path issue
   - Check `NANOCHAT_BASE_DIR` environment variable

3. **"Discovery takes longer than 30 seconds"**
   - May indicate large model or slow GPU
   - Increase timeout in test script if needed

4. **"Speedup ratio below threshold"**
   - Discovery overhead may be high for short runs
   - Try longer runs (increase `MAX_ITERATIONS`)

### Debug Mode

Run tests with verbose output:

```bash
# Unit tests with full traceback
pytest tests/test_auto_batch_size.py -vv --tb=long

# Integration tests with set -x
bash -x tests/integration/test_single_gpu_discovery.sh
```

## Success Criteria

### Unit Tests
- ✓ All 5 unit tests pass
- ✓ Tests complete in < 10 seconds
- ✓ Code coverage ≥ 80% for `nanochat/auto_batch_size.py`

### Integration Tests
- ✓ Single GPU discovery completes in < 30 seconds
- ✓ No OOM errors during 1000+ iteration stability tests
- ✓ Throughput improvement ≥ 1.3x compared to manual baseline
- ✓ DDP tests show identical batch size across all ranks
- ✓ Override tests correctly skip discovery or use manual values
- ✓ Cache tests show < 5 second cache hit time vs 15-30 second discovery

### Failure Handling
- ✓ Artificial memory constraints trigger fallback to defaults
- ✓ Warning messages appear in logs for fallback scenarios
- ✓ No crashes or exceptions, only graceful degradation

## Contributing

When adding new tests:

1. Add unit tests to `tests/test_auto_batch_size.py`
2. Add integration tests as new `.sh` scripts in `tests/integration/`
3. Update `tests/run_integration_tests.sh` to include new tests
4. Update this README with test descriptions
5. Ensure tests clean up after themselves (delete temp files, clear cache)

## License

Same as NanoChat project.