mirror of https://github.com/karpathy/nanochat.git synced 2026-01-23 03:44:19 +00:00

History

Artemis Git Integration ffdbb9c247 test: add comprehensive test suite for auto-batch-size discovery with unit and integration tests, pytest framework, stability validation, and updated documentation		2025-11-05 16:52:29 +00:00
..
integration	test: add comprehensive test suite for auto-batch-size discovery with unit and integration tests, pytest framework, stability validation, and updated documentation	2025-11-05 16:52:29 +00:00
results	test: add comprehensive test suite for auto-batch-size discovery with unit and integration tests, pytest framework, stability validation, and updated documentation	2025-11-05 16:52:29 +00:00
CHECKLIST.md	test: add comprehensive test suite for auto-batch-size discovery with unit and integration tests, pytest framework, stability validation, and updated documentation	2025-11-05 16:52:29 +00:00
IMPLEMENTATION_NOTES.md	test: add comprehensive test suite for auto-batch-size discovery with unit and integration tests, pytest framework, stability validation, and updated documentation	2025-11-05 16:52:29 +00:00
make_executable.sh	test: add comprehensive test suite for auto-batch-size discovery with unit and integration tests, pytest framework, stability validation, and updated documentation	2025-11-05 16:52:29 +00:00
QUICKSTART.md	test: add comprehensive test suite for auto-batch-size discovery with unit and integration tests, pytest framework, stability validation, and updated documentation	2025-11-05 16:52:29 +00:00
README.md	test: add comprehensive test suite for auto-batch-size discovery with unit and integration tests, pytest framework, stability validation, and updated documentation	2025-11-05 16:52:29 +00:00
run_integration_tests.sh	test: add comprehensive test suite for auto-batch-size discovery with unit and integration tests, pytest framework, stability validation, and updated documentation	2025-11-05 16:52:29 +00:00
run_unit_tests.sh	test: add comprehensive test suite for auto-batch-size discovery with unit and integration tests, pytest framework, stability validation, and updated documentation	2025-11-05 16:52:29 +00:00
test_auto_batch_size.py	test: add comprehensive test suite for auto-batch-size discovery with unit and integration tests, pytest framework, stability validation, and updated documentation	2025-11-05 16:52:29 +00:00
TEST_PLAN.md	test: add comprehensive test suite for auto-batch-size discovery with unit and integration tests, pytest framework, stability validation, and updated documentation	2025-11-05 16:52:29 +00:00
test_rustbpe.py	initial commit	2025-10-13 06:49:24 -07:00

README.md

Auto-Discovery Testing Suite

Comprehensive tests for the auto-discovery batch size functionality in NanoChat.

Overview

This testing suite validates the auto-discovery system across different scenarios:

Unit Tests: Isolated testing of core algorithms (exponential search, binary search, caching)
Integration Tests: End-to-end testing with actual training scripts
Stability Tests: Long-running tests to detect memory leaks and OOM issues
Performance Tests: Throughput comparisons between manual and auto-discovered batch sizes

Quick Start

Run All Tests

# Run unit tests only (fast, ~10 seconds)
bash tests/run_unit_tests.sh

# Run integration tests (requires GPU, 10-30 minutes)
bash tests/run_integration_tests.sh

# Run integration tests including long stability tests (1+ hours)
RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh

Run Individual Tests

# Unit tests
pytest tests/test_auto_batch_size.py -v

# Specific integration test
bash tests/integration/test_single_gpu_discovery.sh
bash tests/integration/test_ddp_discovery.sh
bash tests/integration/test_throughput_comparison.sh

Test Categories

Unit Tests (`test_auto_batch_size.py`)

Tests the core discovery algorithms in isolation using mocks:

Test 1: Exponential search finds upper bound (1, 2, 4, 8, 16, 32, 64)
Test 2: Binary search refines to exact boundary
Test 3: Safety margin application (0.85, 0.90, 0.95)
Test 4: Cache hit/miss behavior
Test 5: DDP broadcast simulation

Run with:

pytest tests/test_auto_batch_size.py -v --tb=short

Integration Tests

Single GPU Tests

Test 6: Basic discovery run (test_single_gpu_discovery.sh)
- Verifies discovery completes in < 30 seconds
- Checks for proper log messages
- Validates no OOM errors
Test 7: Manual vs Auto comparison (test_manual_vs_auto.sh)
- Compares manual batch_size=8 with auto-discovery
- Validates auto batch size ≥ manual
- Ensures both runs complete successfully

Multi-GPU Tests

Test 8: 2-GPU DDP discovery (test_ddp_discovery.sh)
- Verifies rank 0 performs discovery
- Checks broadcast to rank 1
- Validates synchronization
Test 9: 4-GPU DDP discovery (if available)
- Same as Test 8 with 4 GPUs
- Skipped if fewer than 4 GPUs available

Throughput Tests

Test 10: Throughput comparison (test_throughput_comparison.sh)
- Measures iterations/second for manual vs auto
- Calculates speedup ratio
- Target: ≥ 1.3x speedup (allows for discovery overhead)
- Saves results to tests/results/throughput_comparison.json

Stability Tests

Long-running tests (1000 iterations each):

Test 11: Depth=12 (test_stability_depth12.sh)
Test 12: Depth=20 (test_stability_depth20.sh)
Test 13: Depth=26 (test_stability_depth26.sh)
Test 14: Depth=32 (test_stability_depth32.sh)
- Verifies larger models use smaller batch sizes
- Monitors for memory leaks
- Ensures no OOM during long runs

Run with:

RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh

Override Tests

Test 15: Manual override (test_overrides.sh)
- Verifies --device_batch_size=16 skips auto-discovery
- Checks for manual batch size usage message
Test 16: Disable auto-discovery
- Tests with auto-discovery disabled
- Verifies fallback to default batch_size=8
Test 17: Custom safety margin
- Tests --batch_size_margin=0.85 vs 0.90
- Verifies higher margin gives larger batch size

Cache Tests

Test 18: Cache hit (test_cache_mechanism.sh)
- First run: discovery + cache save
- Second run: cache hit (< 5 seconds)
- Verifies cache file creation
Test 19: Cache key validation
- Different depth → different cache key
- Different max_seq_len → different cache key
- Verifies multiple cache files created
Test 20: Cache invalidation
- Corrupts cache file
- Verifies graceful fallback to re-discovery
- Tests cache deletion and re-run

Failure Handling Tests

Test 21: Artificial memory constraint (test_failure_handling.sh)
- Tests with very large model (depth=40)
- Verifies fallback to defaults
- Checks for warning messages
Test 22: Mid-training override warning
- Tests mid_train.py with larger batch size than pretrain
- Verifies "FOOTGUN WARNING" appears
- Ensures training continues despite warning

Test Results

Results are saved to tests/results/:

tests/results/
├── test_single_gpu_discovery.log
├── test_manual_baseline.log
├── test_auto_discovery.log
├── throughput_comparison.json
├── stability_depth12.log
├── stability_depth20.log
├── cache_run1.log
├── cache_run2.log
└── ...

Throughput Results Format

tests/results/throughput_comparison.json:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "depth": 12,
  "max_iterations": 100,
  "manual": {
    "batch_size": 8,
    "duration_seconds": 120,
    "throughput_iter_per_sec": 0.833
  },
  "auto": {
    "batch_size": 32,
    "duration_seconds": 60,
    "throughput_iter_per_sec": 1.667
  },
  "speedup_ratio": 2.0
}

Requirements

Unit Tests

Python 3.8+
PyTorch
pytest
No GPU required (runs on CPU)

Integration Tests

CUDA-capable GPU (≥ 24GB VRAM recommended)
Multiple GPUs for DDP tests (optional)
Environment variables:
- NANOCHAT_BASE_DIR: Base directory for checkpoints/cache (optional)
- RUN_LONG_TESTS=1: Enable 1000-iteration stability tests (optional)

CI/CD Integration

For automated testing in CI:

# Quick validation (unit tests + fast integration tests)
bash tests/run_unit_tests.sh
bash tests/run_integration_tests.sh  # ~15 minutes

# Full validation (includes long tests)
RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh  # ~1 hour

GitHub Actions Example

name: Auto-Discovery Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v2
      - name: Run unit tests
        run: bash tests/run_unit_tests.sh
      - name: Run integration tests
        run: bash tests/run_integration_tests.sh
      - name: Upload results
        uses: actions/upload-artifact@v2
        with:
          name: test-results
          path: tests/results/

Troubleshooting

Common Issues

"SKIP: Need at least 2 GPUs for DDP tests"
- Expected if you have only 1 GPU
- DDP tests will be skipped automatically
"Cache directory is empty or doesn't exist"
- Cache may be disabled or path issue
- Check NANOCHAT_BASE_DIR environment variable
"Discovery takes longer than 30 seconds"
- May indicate large model or slow GPU
- Increase timeout in test script if needed
"Speedup ratio below threshold"
- Discovery overhead may be high for short runs
- Try longer runs (increase MAX_ITERATIONS)

Debug Mode

Run tests with verbose output:

# Unit tests with full traceback
pytest tests/test_auto_batch_size.py -vv --tb=long

# Integration tests with set -x
bash -x tests/integration/test_single_gpu_discovery.sh

Success Criteria

Unit Tests

✓ All 5 unit tests pass
✓ Tests complete in < 10 seconds
✓ Code coverage ≥ 80% for nanochat/auto_batch_size.py

Integration Tests

✓ Single GPU discovery completes in < 30 seconds
✓ No OOM errors during 1000+ iteration stability tests
✓ Throughput improvement ≥ 1.3x compared to manual baseline
✓ DDP tests show identical batch size across all ranks
✓ Override tests correctly skip discovery or use manual values
✓ Cache tests show < 5 second cache hit time vs 15-30 second discovery

Failure Handling

✓ Artificial memory constraints trigger fallback to defaults
✓ Warning messages appear in logs for fallback scenarios
✓ No crashes or exceptions, only graceful degradation

Contributing

When adding new tests:

Add unit tests to tests/test_auto_batch_size.py
Add integration tests as new .sh scripts in tests/integration/
Update tests/run_integration_tests.sh to include new tests
Update this README with test descriptions
Ensure tests clean up after themselves (delete temp files, clear cache)

License

Same as NanoChat project.

README.md

Auto-Discovery Testing Suite

Overview

Quick Start

Run All Tests

Run Individual Tests

Test Categories

Unit Tests (test_auto_batch_size.py)

Integration Tests

Single GPU Tests

Multi-GPU Tests

Throughput Tests

Stability Tests

Override Tests

Cache Tests

Failure Handling Tests

Test Results

Throughput Results Format

Requirements

Unit Tests

Integration Tests

CI/CD Integration

GitHub Actions Example

Troubleshooting

Common Issues

Debug Mode

Success Criteria

Unit Tests

Integration Tests

Failure Handling

Contributing

License

Unit Tests (`test_auto_batch_size.py`)