mirror of
https://github.com/karpathy/nanochat.git
synced 2026-03-26 22:55:16 +00:00
bunch of ideas tried from openai/parameter-golf, all negative results for nanochat
This commit is contained in:
parent
1cd94d768f
commit
4e1694cc95
53
dev/LOG.md
53
dev/LOG.md
|
|
@ -4,6 +4,59 @@ A running summary documenting some experiments and findings. Started ~Jan 7 2026
|
|||
|
||||
---
|
||||
|
||||
## 2026-03-24: Parameter-Golf Ideas Sweep (Negative)
|
||||
|
||||
Reviewed `openai/parameter-golf` for small/simple ideas that might transfer to nanochat pretraining without bloating the codebase. Cached notes are in `knowledge/parameter_golf.md`.
|
||||
|
||||
### Rationale
|
||||
|
||||
The parameter-golf leaderboard is a useful source of:
|
||||
|
||||
- tiny architecture tweaks
|
||||
- short-run optimizer/schedule tricks
|
||||
- Muon-related systems ideas
|
||||
|
||||
But much of that repo is optimized for a very different objective:
|
||||
|
||||
- fit in a 16MB artifact
|
||||
- train in under 10 minutes on 8xH100
|
||||
- evaluate on compression / bpb
|
||||
|
||||
So only a small subset of ideas looked worth trying in nanochat.
|
||||
|
||||
### Ideas Tried
|
||||
|
||||
**1. LeakyReLU(0.5)^2**
|
||||
- Replaced `relu^2` in the MLP with `leaky_relu(x, 0.5)^2`
|
||||
- **Result:** Slightly better per-step quality, but slightly slower. Net worse on wall clock.
|
||||
|
||||
**2. Partial RoPE**
|
||||
- Applied rotary embeddings to only the first quarter of each head dimension
|
||||
- **Result:** Slightly worse.
|
||||
|
||||
**3. LN Scale**
|
||||
- Multiplied each block's normalized input by `1/sqrt(layer_idx+1)` before attention and MLP
|
||||
- **Result:** Did not help.
|
||||
|
||||
**4. Orthogonal init**
|
||||
- Switched the non-zero transformer matrices to orthogonal init while preserving zero-init output projections
|
||||
- **Result:** Did not help.
|
||||
|
||||
**5. XSA (Exclusive Self Attention)**
|
||||
- Implemented XSA on the deepest 3 non-VE layers only, so it projected against the plain `v` path rather than `v + VE`
|
||||
- **Result:** Slightly better step quality but not wall clock. Not worth the extra compute in the hot attention path.
|
||||
|
||||
### Notes
|
||||
|
||||
- EMA/SWA had already been tried earlier (I skipped recording it) and did not help.
|
||||
- Bigram hash embeddings had already been explored much earlier and did help somewhat, but the added parameters / VRAM / complexity were not justified at larger scale. See the Jan 27-28 entries above.
|
||||
|
||||
### Conclusion
|
||||
|
||||
This pass did not find any cheap parameter-golf transfer that clearly improves nanochat on the metric that matters: wall clock time to capability.
|
||||
|
||||
---
|
||||
|
||||
## 2026-03-04: Remove autocast, explicit dtype management, fp16 GradScaler
|
||||
|
||||
Replaced `torch.amp.autocast` throughout the codebase with explicit dtype management via a single `COMPUTE_DTYPE` global. Also added fp16 training support with GradScaler.
|
||||
|
|
|
|||
Loading…
Reference in New Issue
Block a user