nanochat/rustbpe
Ruhollah Majdoddin 47960bdbf2
uv workspace
The original installation does not work properly:
- `uv pip show nanochat` worked but importing from a script outside
  the nanochat directory failed.
- rustbpe got installed as a module not a package.

Fix by creating a uv workspace with:
- nanochat: Python package (uv_build backend, depends on rustbpe)
- rustbpe: Rust extension package (maturin backend)
- moving nanochat to src/, as required by uv.

Also updates speedrun.sh, dependency versions, and cleans trailing
whitespace.
2025-10-30 19:38:16 +01:00
..
src fix memory leak bug in rust tokenizer ty @mitsuhiko 2025-10-19 23:54:31 +00:00
Cargo.lock uv workspace 2025-10-30 19:38:16 +01:00
Cargo.toml uv workspace 2025-10-30 19:38:16 +01:00
pyproject.toml uv workspace 2025-10-30 19:38:16 +01:00
README.md initial commit 2025-10-13 06:49:24 -07:00

rustbpe

The missing tiktoken training code

A very lightweight Rust library for training a GPT tokenizer. The issue is that the inference library tiktoken is great, but only does inference. Separately, the huggingface tokenizers library does training, but it is rather bloated and really hard to navigate because it has to support all the different historical baggage of how people dealt with tokenizers over the years. More recently, I also wrote the minbpe library which does both training and inference, but only in inefficient Python. Basically what I really want is a non-fancy, super simple, but still relatively efficient training code for GPT tokenizer (more efficient than minbpe, much cleaner/simpler than tokenizers), and then export the trained vocab for inference with tiktoken. Does that make sense? So here we are. There are more opportunities for optimization here, I just stopped a bit early because unlike minbpe before it, rustbpe is now simple and fast enough, and not a significant bottleneck for nanochat.