mirror of
https://github.com/karpathy/nanochat.git
synced 2026-01-26 21:34:25 +00:00
adjust the comment on the regex pattern per recent experimnet see dev/LOG.md
This commit is contained in:
parent
64b48d0e5c
commit
23985413aa
|
|
@ -26,7 +26,7 @@ SPECIAL_TOKENS = [
|
|||
|
||||
# NOTE: this split pattern deviates from GPT-4 in that we use \p{N}{1,2} instead of \p{N}{1,3}
|
||||
# I did this because I didn't want to "waste" too many tokens on numbers for smaller vocab sizes.
|
||||
# I haven't validated that this is actually a good idea, TODO.
|
||||
# I verified that 2 is the sweet spot for vocab size of 32K. 1 is a bit worse, 3 was worse still.
|
||||
SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,2}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
|
|
|
|||
Loading…
Reference in New Issue
Block a user