mirror of
https://github.com/karpathy/nanochat.git
synced 2026-05-10 18:00:17 +00:00
try and fail relu^2 -> swiglu
This commit is contained in:
parent
718e5e9d67
commit
d63b7ab9ac
13
dev/LOG.md
13
dev/LOG.md
|
|
@ -4,6 +4,19 @@ A running summary documenting some experiments and findings. Started ~Jan 7 2026
|
|||
|
||||
---
|
||||
|
||||
## 2026-02-05: SwiGLU Activation (Negative Result)
|
||||
|
||||
Replaced ReLU² MLP activation with SwiGLU (inspired by [twitter](https://x.com/_xjdr/status/2019141521690567058)). Implementation uses three projections (w1, w2, w3) with hidden_dim scaled to 8/3×n_embd to preserve both parameter count and FLOPs exactly (1.00x match on both).
|
||||
|
||||
```python
|
||||
# Old: x = c_proj(relu(c_fc(x)).square())
|
||||
# New: x = w3(silu(w1(x)) * w2(x))
|
||||
```
|
||||
|
||||
Tested at both d12 and d24 (GPT-2 scale). Worse on all measures — step efficiency, wall clock time, and FLOPs. ReLU² remains superior for nanochat. **Not adopted.**
|
||||
|
||||
---
|
||||
|
||||
## 2026-02-03: Flip Muon MLP LR Multiplier (PR #492)
|
||||
|
||||
Tested flipping the shape-based LR heuristic in Muon from boosting tall matrices (input projections like `c_fc`) to boosting wide matrices (output projections like `c_proj`). The original code applies `max(1, rows/cols)^0.5`, giving ~2x LR to `c_fc`. The flipped version gives ~2x LR to `c_proj` instead, which aligns with classical fan-in/fan-out scaling conventions. This was proposed in [PR #492](https://github.com/karpathy/nanochat/pull/492) and showed improvements in modded-nanogpt.
|
||||
|
|
|
|||
Loading…
Reference in New Issue
Block a user