diff --git a/dev/LOG.md b/dev/LOG.md index fd5c3c7..dddfcb0 100644 --- a/dev/LOG.md +++ b/dev/LOG.md @@ -4,6 +4,59 @@ A running summary documenting some experiments and findings. Started ~Jan 7 2026 --- +## 2026-03-24: Parameter-Golf Ideas Sweep (Negative) + +Reviewed `openai/parameter-golf` for small/simple ideas that might transfer to nanochat pretraining without bloating the codebase. Cached notes are in `knowledge/parameter_golf.md`. + +### Rationale + +The parameter-golf leaderboard is a useful source of: + +- tiny architecture tweaks +- short-run optimizer/schedule tricks +- Muon-related systems ideas + +But much of that repo is optimized for a very different objective: + +- fit in a 16MB artifact +- train in under 10 minutes on 8xH100 +- evaluate on compression / bpb + +So only a small subset of ideas looked worth trying in nanochat. + +### Ideas Tried + +**1. LeakyReLU(0.5)^2** +- Replaced `relu^2` in the MLP with `leaky_relu(x, 0.5)^2` +- **Result:** Slightly better per-step quality, but slightly slower. Net worse on wall clock. + +**2. Partial RoPE** +- Applied rotary embeddings to only the first quarter of each head dimension +- **Result:** Slightly worse. + +**3. LN Scale** +- Multiplied each block's normalized input by `1/sqrt(layer_idx+1)` before attention and MLP +- **Result:** Did not help. + +**4. Orthogonal init** +- Switched the non-zero transformer matrices to orthogonal init while preserving zero-init output projections +- **Result:** Did not help. + +**5. XSA (Exclusive Self Attention)** +- Implemented XSA on the deepest 3 non-VE layers only, so it projected against the plain `v` path rather than `v + VE` +- **Result:** Slightly better step quality but not wall clock. Not worth the extra compute in the hot attention path. + +### Notes + +- EMA/SWA had already been tried earlier (I skipped recording it) and did not help. +- Bigram hash embeddings had already been explored much earlier and did help somewhat, but the added parameters / VRAM / complexity were not justified at larger scale. See the Jan 27-28 entries above. + +### Conclusion + +This pass did not find any cheap parameter-golf transfer that clearly improves nanochat on the metric that matters: wall clock time to capability. + +--- + ## 2026-03-04: Remove autocast, explicit dtype management, fp16 GradScaler Replaced `torch.amp.autocast` throughout the codebase with explicit dtype management via a single `COMPUTE_DTYPE` global. Also added fp16 training support with GradScaler.