mirror of https://github.com/karpathy/nanochat.git synced 2026-03-31 00:55:18 +00:00

Artemis Git Integration ffdbb9c247 test: add comprehensive test suite for auto-batch-size discovery with unit and integration tests, pytest framework, stability validation, and updated documentation

2025-11-05 16:52:29 +00:00

9.2 KiB

Raw Blame History

Implementation Notes for Auto-Discovery Testing

Overview

This document describes the implementation of the comprehensive testing suite for the auto-discovery batch size functionality in NanoChat.

Current Status

What Has Been Implemented

Stub Auto-Discovery Module (nanochat/auto_batch_size.py)
- Minimal working implementation with expected interface
- Supports the full API required by tests
- Includes caching, DDP broadcast, and safety margin features
- Ready for full implementation to replace the stub logic
Unit Tests (tests/test_auto_batch_size.py)
- 11 comprehensive unit tests covering all core algorithms
- Tests for exponential search, binary search, safety margins
- Cache mechanism validation (hit/miss, key generation)
- DDP broadcast simulation
- Mock-based testing for isolation
- All tests runnable on CPU without GPU
Integration Test Scripts (tests/integration/*.sh)
- 17 bash-based integration tests (Tests 6-22)
- Single GPU discovery validation
- Multi-GPU DDP testing with auto-detection
- Throughput comparison with JSON output
- Stability tests for depths 12, 20, 26, 32
- Override and cache mechanism tests
- Failure handling and graceful degradation tests
Test Infrastructure
- tests/run_unit_tests.sh - Unit test runner
- tests/run_integration_tests.sh - Integration test orchestrator
- tests/results/ - Output directory for logs and results
- Comprehensive documentation (README, TEST_PLAN)

What Still Needs to Be Done

The tests are ready to run once the full auto-discovery implementation is complete. The current stub implementation allows the test framework to be validated, but for the tests to be meaningful, the following need to be implemented in nanochat/auto_batch_size.py:

Real Exponential Search Algorithm
- Currently returns a fixed value
- Needs to implement doubling strategy (1, 2, 4, 8, 16, ...)
- Must detect OOM boundary
Real Binary Search Refinement
- Currently not implemented in stub
- Should narrow down from exponential search bounds
- Must find exact maximum batch size that fits
OOM Detection in _test_batch_size()
- Currently has basic try-catch for OOM
- May need more robust handling
- Should properly clean up GPU memory
Integration with Training Scripts
- Scripts need to call discover_batch_size() when appropriate
- Need to add command-line flags:
  - --auto_batch_size=True/False
  - --batch_size_margin=0.85 (optional)
  - --batch_size_cache=True/False (optional)
- Need to add logic to skip discovery if manual batch size provided
- Need to add logging messages that tests expect
GPU Info for Cache Keys
- Currently uses placeholder GPU name
- Should detect actual GPU model for cache keys

Integration Points

Training Scripts That Need Updates

scripts/base_train.py

# Add near top after imports
from nanochat.auto_batch_size import discover_batch_size

# Add to config section
auto_batch_size = False  # Enable auto-discovery
batch_size_margin = 0.85  # Safety margin
batch_size_cache = True  # Enable caching

# Add after compute_init() and before model creation
if auto_batch_size and device_batch_size is None:
    device_batch_size = discover_batch_size(
        model=temp_model,  # or create temp model just for discovery
        max_seq_len=max_seq_len,
        device=device,
        safety_margin=batch_size_margin,
        ddp_rank=ddp_rank,
        ddp_world_size=ddp_world_size,
        use_cache=batch_size_cache,
        cache_key_components={
            'model_config': model_config_kwargs,
            'gpu': torch.cuda.get_device_name(),
            'max_seq_len': max_seq_len,
        }
    )

scripts/mid_train.py
- Similar integration as base_train
- Add warning if device_batch_size > pretrain batch size
scripts/chat_sft.py
- Similar integration
- Default batch size is 4, so auto-discovery should help significantly

Test Validation

To Verify Tests Are Working

Run unit tests (should work now with stub):
```
bash tests/run_unit_tests.sh
```
Expected: All tests pass (some may be skipped due to stub limitations)
Make scripts executable:
```
bash tests/make_executable.sh
```
Try a quick integration test (requires GPU):
```
bash tests/integration/test_single_gpu_discovery.sh
```
Expected: Will fail with current stub, but should run without errors
Once full implementation is done:
```
bash tests/run_integration_tests.sh
```
Expected: Most tests should pass

Expected Test Behavior

With Current Stub Implementation

Unit tests: Most pass, some may have limitations due to stub
Integration tests: Will run but may not find meaningful batch sizes
Cache tests: Should work (caching logic is implemented)
DDP tests: Broadcast should work, discovery logic is stubbed

With Full Implementation

Unit tests: All should pass
Single GPU tests: Should discover reasonable batch sizes (16-64 range)
DDP tests: Should show proper rank 0 discovery and broadcast
Throughput tests: Should show 1.5-3x speedup
Stability tests: Should complete 1000 iterations without OOM
Cache tests: Should show significant startup time improvement

Troubleshooting Guide

Common Issues and Solutions

"Auto-discovery found device_batch_size=" not in log
- Training script not calling discover_batch_size()
- Check integration in training script
- Verify --auto_batch_size=True is being passed
Tests fail with "Command not found"
- Scripts may not be executable
- Run: bash tests/make_executable.sh
Cache tests fail
- Check NANOCHAT_BASE_DIR environment variable
- Verify write permissions to cache directory
- Try: mkdir -p ~/.nanochat/auto_batch_cache
DDP tests skipped
- Expected if fewer than 2 GPUs
- Tests auto-detect GPU count
OOM during stability tests
- Discovery may not be working correctly
- Check safety margin (should be 0.85 or lower)
- Verify model size vs GPU memory

Performance Expectations

Discovery Time

Initial discovery: 15-30 seconds
Cache hit: < 5 seconds
Overhead per training run: 15-30 seconds (first run only)

Batch Size Improvements

Based on A100 80GB GPU:

depth=12: 8 (manual) → 64-96 (auto) = 8-12x larger
depth=20: 8 (manual) → 32-48 (auto) = 4-6x larger
depth=26: 8 (manual) → 16-32 (auto) = 2-4x larger
depth=32: 8 (manual) → 8-16 (auto) = 1-2x larger

Throughput Improvements

Expected speedup: 1.5-3.0x
Measured after discovery overhead
Varies by model size and GPU

Next Steps for Full Implementation

Implement core discovery algorithms in nanochat/auto_batch_size.py:
- Replace stub _perform_discovery() with real search
- Implement exponential + binary search
- Improve OOM detection
Integrate into training scripts:
- Add command-line flags
- Add discovery calls
- Add appropriate logging
Validate with tests:
- Run unit tests to verify algorithms
- Run integration tests to verify end-to-end
- Run stability tests for production validation
Optimize and tune:
- Adjust safety margins if needed
- Tune cache key components
- Add more robust error handling

Files Created

Core Implementation

nanochat/auto_batch_size.py (stub with full interface)

Tests

tests/test_auto_batch_size.py (unit tests)
tests/integration/test_single_gpu_discovery.sh
tests/integration/test_manual_vs_auto.sh
tests/integration/test_ddp_discovery.sh
tests/integration/test_throughput_comparison.sh
tests/integration/test_stability_depth{12,20,26,32}.sh
tests/integration/test_overrides.sh
tests/integration/test_cache_mechanism.sh
tests/integration/test_failure_handling.sh

Infrastructure

tests/run_unit_tests.sh
tests/run_integration_tests.sh
tests/make_executable.sh

Documentation

tests/README.md (user guide)
tests/TEST_PLAN.md (test specifications)
tests/IMPLEMENTATION_NOTES.md (this file)

Results Directory

tests/results/.gitkeep
Updated .gitignore to exclude test logs

Conclusion

The testing infrastructure is complete and ready to use. The stub implementation allows the test framework to be validated and demonstrates the expected interface. Once the full auto-discovery implementation is complete, these tests will provide comprehensive validation of correctness, performance, and stability.

The tests are designed to be:

Comprehensive: Cover all major functionality and edge cases
Maintainable: Clear structure, good documentation
CI-ready: Can run unattended with clear pass/fail
Fast: Unit tests in seconds, full suite in ~30 minutes
Reliable: Auto-skip tests when requirements not met (e.g., multiple GPUs)

For questions or issues, refer to:

tests/README.md for usage instructions
tests/TEST_PLAN.md for test specifications
Test logs in tests/results/ for debugging

9.2 KiB Raw Blame History