mirror of https://github.com/karpathy/nanochat.git synced 2025-12-16 01:02:18 +00:00

Artemis Git Integration ffdbb9c247 test: add comprehensive test suite for auto-batch-size discovery with unit and integration tests, pytest framework, stability validation, and updated documentation

2025-11-05 16:52:29 +00:00

7.7 KiB

Raw Blame History

Auto-Discovery Test Plan

Test Coverage Matrix

Test #	Name	Type	Duration	GPU Required	Status
1	Exponential Search Logic	Unit	< 1s	No	✓ Implemented
2	Binary Search Refinement	Unit	< 1s	No	✓ Implemented
3	Safety Margin Application	Unit	< 1s	No	✓ Implemented
4	Cache Mechanism	Unit	< 1s	No	✓ Implemented
5	DDP Broadcast Simulation	Unit	< 1s	No	✓ Implemented
6	Basic Discovery Run	Integration	30s	1 GPU	✓ Implemented
7	Manual vs Auto Comparison	Integration	2-3 min	1 GPU	✓ Implemented
8	DDP Discovery (2 GPUs)	Integration	1-2 min	2 GPUs	✓ Implemented
9	DDP Discovery (4 GPUs)	Integration	1-2 min	4 GPUs	✓ Implemented
10	Throughput Comparison	Integration	5-10 min	1 GPU	✓ Implemented
11	Stability (depth=12)	Integration	10-15 min	1 GPU	✓ Implemented
12	Stability (depth=20)	Integration	15-20 min	1 GPU	✓ Implemented
13	Stability (depth=26)	Integration	20-25 min	1 GPU	✓ Implemented
14	Stability (depth=32)	Integration	25-30 min	1 GPU	✓ Implemented
15	Manual Override	Integration	1-2 min	1 GPU	✓ Implemented
16	Disable Auto-Discovery	Integration	1-2 min	1 GPU	✓ Implemented
17	Custom Safety Margin	Integration	2-3 min	1 GPU	✓ Implemented
18	Cache Hit	Integration	2-3 min	1 GPU	✓ Implemented
19	Cache Key Validation	Integration	3-4 min	1 GPU	✓ Implemented
20	Cache Invalidation	Integration	2-3 min	1 GPU	✓ Implemented
21	Artificial Memory Constraint	Integration	2-3 min	1 GPU	✓ Implemented
22	Mid-Training Override Warning	Integration	2-3 min	1 GPU	✓ Implemented

Test Execution Time Estimates

Fast Suite (Unit Tests Only)

Duration: ~10 seconds
GPU: Not required
Command: bash tests/run_unit_tests.sh

Standard Suite (Unit + Short Integration)

Duration: ~15-30 minutes
GPU: 1 GPU required
Command: bash tests/run_integration_tests.sh

Full Suite (Including Long Stability Tests)

Duration: ~1-2 hours
GPU: 1 GPU required
Command: RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh

Multi-GPU Suite

Duration: ~20-40 minutes
GPU: 2-4 GPUs required
Command: bash tests/run_integration_tests.sh (auto-detects GPUs)

Success Criteria

Unit Tests

All 5 unit tests pass
Tests complete in < 10 seconds total
Code coverage ≥ 80% for nanochat/auto_batch_size.py

Integration Tests - Basic

Single GPU discovery completes in < 30 seconds
Auto-discovered batch size ≥ manual baseline (8)
No OOM errors in any test
All logs contain expected messages

Integration Tests - DDP

Rank 0 performs discovery, other ranks receive broadcast
All ranks use identical batch size
No deadlocks or synchronization errors
Tests complete successfully on 2 and 4 GPUs

Integration Tests - Performance

Throughput improvement ≥ 1.3x compared to manual baseline
Speedup ratio calculated and logged
Results saved to JSON for analysis

Integration Tests - Stability

All 1000 iterations complete without errors
No OOM errors during long runs
No memory leaks detected
Larger models (depth=32) use smaller batch sizes than smaller models (depth=12)

Integration Tests - Overrides

Manual --device_batch_size skips auto-discovery
Custom safety margins produce expected batch sizes
Disabled auto-discovery uses default values

Integration Tests - Cache

Cache hit reduces startup time from 15-30s to < 5s
Different configurations create different cache keys
Corrupted cache handled gracefully (fallback to re-discovery)
Cache files created in correct directory

Integration Tests - Failure Handling

Artificial memory constraints trigger fallback
Warning messages logged appropriately
Mid-training override warning appears
No crashes or exceptions, only graceful degradation

Known Limitations

Cache Tests: Require write access to cache directory (usually ~/.nanochat/auto_batch_cache/)
DDP Tests: Automatically skipped if fewer than 2 GPUs available
Long Tests: Disabled by default, require RUN_LONG_TESTS=1 environment variable
Memory Constraint Tests: Difficult to reliably simulate on all systems
Mid-Training Tests: Require existing checkpoint from base_train

Test Maintenance

Adding New Tests

Unit Tests: Add to tests/test_auto_batch_size.py

def test_new_feature():
    # Test implementation
    assert result == expected

Integration Tests: Create new script in tests/integration/

#!/bin/bash
# tests/integration/test_new_feature.sh
set -e
# Test implementation

Update tests/run_integration_tests.sh to include new test
Update this test plan document

Debugging Failed Tests

Check logs: All test output saved to tests/results/*.log
Run individually: Execute specific test script in isolation
Increase verbosity: Use -x flag for bash scripts, -vv for pytest
Check GPU state: Run nvidia-smi before and after tests
Clear cache: Remove ~/.nanochat/auto_batch_cache/ if cache issues suspected

CI/CD Integration

Recommended CI Pipeline

stages:
  - test-unit
  - test-integration-fast
  - test-integration-full

test-unit:
  script:
    - bash tests/run_unit_tests.sh
  duration: 1 minute

test-integration-fast:
  script:
    - bash tests/run_integration_tests.sh
  duration: 30 minutes
  requires: [test-unit]

test-integration-full:
  script:
    - RUN_LONG_TESTS=1 bash tests/run_integration_tests.sh
  duration: 2 hours
  requires: [test-integration-fast]
  when: manual  # Only run on-demand

Pre-commit Hooks

#!/bin/bash
# .git/hooks/pre-commit
bash tests/run_unit_tests.sh

Test Data

Expected Batch Sizes (A100 80GB GPU)

depth=12: ~64-96
depth=20: ~32-48
depth=26: ~16-32
depth=32: ~8-16

Note: Actual values depend on GPU memory, safety margin, and max_seq_len.

Expected Speedups

Baseline: device_batch_size=8
Auto-discovered: device_batch_size=32-64
Expected speedup: 1.5-3.0x (target: ≥1.3x after overhead)

Appendix: Test File Structure

tests/
├── README.md                          # User-facing documentation
├── TEST_PLAN.md                       # This file
├── test_auto_batch_size.py           # Unit tests
├── run_unit_tests.sh                 # Unit test runner
├── run_integration_tests.sh          # Integration test runner
├── make_executable.sh                # Helper to chmod +x scripts
├── integration/                      # Integration test scripts
│   ├── test_single_gpu_discovery.sh
│   ├── test_manual_vs_auto.sh
│   ├── test_ddp_discovery.sh
│   ├── test_throughput_comparison.sh
│   ├── test_stability_depth12.sh
│   ├── test_stability_depth20.sh
│   ├── test_stability_depth26.sh
│   ├── test_stability_depth32.sh
│   ├── test_overrides.sh
│   ├── test_cache_mechanism.sh
│   └── test_failure_handling.sh
└── results/                          # Test output (gitignored)
    ├── .gitkeep
    ├── *.log
    └── throughput_comparison.json

Version History

v1.0 (2024-01): Initial test suite implementation
- 5 unit tests
- 17 integration tests (Tests 6-22)
- Unit and integration test runners
- Comprehensive documentation

7.7 KiB Raw Blame History