From 64b48d0e5c502f56d9bfd9af8a5c2a5e901bf1ba Mon Sep 17 00:00:00 2001
From: Andrej Karpathy <andrej.karpathy@gmail.com>
Date: Tue, 13 Jan 2026 17:45:06 +0000
Subject: [PATCH] validated that \p{N}{1,2} is the correct number of digits to
 group up to in the regex pattern of the GPT-4 tokenizer (2 down from 3),
 leading to the best val_bpb for 32K vocabs

---
 dev/LOG.md | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/dev/LOG.md b/dev/LOG.md
index 7944526..4708199 100644
--- a/dev/LOG.md
+++ b/dev/LOG.md
@@ -4,6 +4,21 @@ A running summary documenting some experiments and findings. Started ~Jan 7 2026
 
 ---
 
+## 2026-01-13: Number Token Split Pattern
+
+Validated the `\p{N}{1,2}` pattern in `SPLIT_PATTERN` (tokenizer.py line 30), which I only guessed earlier and had a TODO for to validate. GPT-4 uses `\p{N}{1,3}` to group number sequences of up to 3 digits into tokens, but we suspected smaller vocab sizes benefit from grouping fewer digits per token.
+
+**Results (d12, vocab=32K):**
+| Pattern | val_bpb |
+|---------|---------|
+| `\p{N}{1,1}` | 0.969 |
+| `\p{N}{1,2}` | **0.965** |
+| `\p{N}{1,3}` | 0.972 |
+
+**Conclusion:** `{1,2}` is optimal for vocab size 32K. Grouping 3 digits wastes tokens on rare 3-digit combinations; grouping 1 digit is too fine-grained and bloats token sequences. Keeping `{1,2}` as default.
+
+---
+
 ## 2026-01-13: FP8 Training for lm_head
 
 Attempted to use FP8 (8-bit floating point) for the lm_head layer to speed up the large vocab projection matmul. H100 GPUs have FP8 tensor cores that can theoretically provide ~2x speedup over BF16.