try and fail relu^2 -> swiglu

2026-07-06 04:49:14 +00:00 · 2026-02-05 02:41:46 +00:00 · 2026-02-05 02:41:46 +00:00 · d63b7ab9ac
commit d63b7ab9ac
parent 718e5e9d67
1 changed files with 13 additions and 0 deletions
--- a/dev/LOG.md
+++ b/dev/LOG.md
@ -4,6 +4,19 @@ A running summary documenting some experiments and findings. Started ~Jan 7 2026

 ---

+## 2026-02-05: SwiGLU Activation (Negative Result)
+
+Replaced ReLU² MLP activation with SwiGLU (inspired by [twitter](https://x.com/_xjdr/status/2019141521690567058)). Implementation uses three projections (w1, w2, w3) with hidden_dim scaled to 8/3×n_embd to preserve both parameter count and FLOPs exactly (1.00x match on both).
+
+```python
+# Old: x = c_proj(relu(c_fc(x)).square())
+# New: x = w3(silu(w1(x)) * w2(x))
+```
+
+Tested at both d12 and d24 (GPT-2 scale). Worse on all measures — step efficiency, wall clock time, and FLOPs. ReLU² remains superior for nanochat. **Not adopted.**
+
+---
+
 ## 2026-02-03: Flip Muon MLP LR Multiplier (PR #492)

 Tested flipping the shape-based LR heuristic in Muon from boosting tall matrices (input projections like `c_fc`) to boosting wide matrices (output projections like `c_proj`). The original code applies `max(1, rows/cols)^0.5`, giving ~2x LR to `c_fc`. The flipped version gives ~2x LR to `c_proj` instead, which aligns with classical fan-in/fan-out scaling conventions. This was proposed in [PR #492](https://github.com/karpathy/nanochat/pull/492) and showed improvements in modded-nanogpt.