nanochat/log/report/tokenizer-training.md
2026-02-02 08:18:14 -08:00

263 B

Tokenizer training

timestamp: 2026-02-01 14:40:20

  • max_chars: 2,000,000,000
  • doc_cap: 10,000
  • vocab_size: 32,768
  • train_time: 87.9820
  • num_special_tokens: 9
  • token_bytes_min: 1
  • token_bytes_max: 19
  • token_bytes_mean: 6.6029
  • token_bytes_std: 2.8250