diff --git a/dev/LOG.md b/dev/LOG.md index 71cb18d3..b344b238 100644 --- a/dev/LOG.md +++ b/dev/LOG.md @@ -4,6 +4,19 @@ A running summary documenting some experiments and findings. Started ~Jan 7 2026 --- +## 2026-02-05: SwiGLU Activation (Negative Result) + +Replaced ReLU² MLP activation with SwiGLU (inspired by [twitter](https://x.com/_xjdr/status/2019141521690567058)). Implementation uses three projections (w1, w2, w3) with hidden_dim scaled to 8/3×n_embd to preserve both parameter count and FLOPs exactly (1.00x match on both). + +```python +# Old: x = c_proj(relu(c_fc(x)).square()) +# New: x = w3(silu(w1(x)) * w2(x)) +``` + +Tested at both d12 and d24 (GPT-2 scale). Worse on all measures — step efficiency, wall clock time, and FLOPs. ReLU² remains superior for nanochat. **Not adopted.** + +--- + ## 2026-02-03: Flip Muon MLP LR Multiplier (PR #492) Tested flipping the shape-based LR heuristic in Muon from boosting tall matrices (input projections like `c_fc`) to boosting wide matrices (output projections like `c_proj`). The original code applies `max(1, rows/cols)^0.5`, giving ~2x LR to `c_fc`. The flipped version gives ~2x LR to `c_proj` instead, which aligns with classical fan-in/fan-out scaling conventions. This was proposed in [PR #492](https://github.com/karpathy/nanochat/pull/492) and showed improvements in modded-nanogpt.