From 64b48d0e5c502f56d9bfd9af8a5c2a5e901bf1ba Mon Sep 17 00:00:00 2001 From: Andrej Karpathy Date: Tue, 13 Jan 2026 17:45:06 +0000 Subject: [PATCH] validated that \p{N}{1,2} is the correct number of digits to group up to in the regex pattern of the GPT-4 tokenizer (2 down from 3), leading to the best val_bpb for 32K vocabs --- dev/LOG.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/dev/LOG.md b/dev/LOG.md index 7944526..4708199 100644 --- a/dev/LOG.md +++ b/dev/LOG.md @@ -4,6 +4,21 @@ A running summary documenting some experiments and findings. Started ~Jan 7 2026 --- +## 2026-01-13: Number Token Split Pattern + +Validated the `\p{N}{1,2}` pattern in `SPLIT_PATTERN` (tokenizer.py line 30), which I only guessed earlier and had a TODO for to validate. GPT-4 uses `\p{N}{1,3}` to group number sequences of up to 3 digits into tokens, but we suspected smaller vocab sizes benefit from grouping fewer digits per token. + +**Results (d12, vocab=32K):** +| Pattern | val_bpb | +|---------|---------| +| `\p{N}{1,1}` | 0.969 | +| `\p{N}{1,2}` | **0.965** | +| `\p{N}{1,3}` | 0.972 | + +**Conclusion:** `{1,2}` is optimal for vocab size 32K. Grouping 3 digits wastes tokens on rare 3-digit combinations; grouping 1 digit is too fine-grained and bloats token sequences. Keeping `{1,2}` as default. + +--- + ## 2026-01-13: FP8 Training for lm_head Attempted to use FP8 (8-bit floating point) for the lm_head layer to speed up the large vocab projection matmul. H100 GPUs have FP8 tensor cores that can theoretically provide ~2x speedup over BF16.