nanochat/rustbpe
Sermet Pekin a3f5986f19 feat: Add macOS compatibility fixes
- Change PyTorch dependency from CUDA to CPU version for macOS support
- Update Rust edition from 2024 to 2021 for stable Cargo compatibility
- Relax PyTorch version requirement from >=2.8.0 to >=2.0.0
- Update dependency lock file with compatible versions
2025-10-20 19:02:26 +03:00
..
src fix memory leak bug in rust tokenizer ty @mitsuhiko 2025-10-19 23:54:31 +00:00
Cargo.lock initial commit 2025-10-13 06:49:24 -07:00
Cargo.toml feat: Add macOS compatibility fixes 2025-10-20 19:02:26 +03:00
README.md initial commit 2025-10-13 06:49:24 -07:00

rustbpe

The missing tiktoken training code

A very lightweight Rust library for training a GPT tokenizer. The issue is that the inference library tiktoken is great, but only does inference. Separately, the huggingface tokenizers library does training, but it is rather bloated and really hard to navigate because it has to support all the different historical baggage of how people dealt with tokenizers over the years. More recently, I also wrote the minbpe library which does both training and inference, but only in inefficient Python. Basically what I really want is a non-fancy, super simple, but still relatively efficient training code for GPT tokenizer (more efficient than minbpe, much cleaner/simpler than tokenizers), and then export the trained vocab for inference with tiktoken. Does that make sense? So here we are. There are more opportunities for optimization here, I just stopped a bit early because unlike minbpe before it, rustbpe is now simple and fast enough, and not a significant bottleneck for nanochat.