Kaiyue Wen
|
ee04406ebb
|
Merge muonh-dev and master: FP8 training, optimizer tuning, and scaling improvements
Major changes:
- Add custom FP8 training module (replaces torchao dependency)
- Implement auto-calculated optimal batch sizes (1M for d26)
- Add hyperball data scaling
- Restore and tune momentum schedule (settled on 0.95)
- Add matrix warmup ratio and norm_lr parameters
- Improve weight decay scaling (Tepoch-based theory)
- Update d26 configuration and scaling laws
- Clarify MFU labeling as bf16_mfu
- Update leaderboard and documentation
Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
|
2026-02-12 16:15:15 -08:00 |
|