mirror of https://github.com/karpathy/nanochat.git synced 2026-06-19 04:29:09 +00:00

History

Ruhollah Majdoddin 47960bdbf2 uv workspace The original installation does not work properly: - `uv pip show nanochat` worked but importing from a script outside the nanochat directory failed. - rustbpe got installed as a module not a package. Fix by creating a uv workspace with: - nanochat: Python package (uv_build backend, depends on rustbpe) - rustbpe: Rust extension package (maturin backend) - moving nanochat to src/, as required by uv. Also updates speedrun.sh, dependency versions, and cleans trailing whitespace.		2025-10-30 19:38:16 +01:00
..
src	fix memory leak bug in rust tokenizer ty @mitsuhiko	2025-10-19 23:54:31 +00:00
Cargo.lock	uv workspace	2025-10-30 19:38:16 +01:00
Cargo.toml	uv workspace	2025-10-30 19:38:16 +01:00
pyproject.toml	uv workspace	2025-10-30 19:38:16 +01:00
README.md	initial commit	2025-10-13 06:49:24 -07:00

README.md

rustbpe

The missing tiktoken training code

A very lightweight Rust library for training a GPT tokenizer. The issue is that the inference library tiktoken is great, but only does inference. Separately, the huggingface tokenizers library does training, but it is rather bloated and really hard to navigate because it has to support all the different historical baggage of how people dealt with tokenizers over the years. More recently, I also wrote the minbpe library which does both training and inference, but only in inefficient Python. Basically what I really want is a non-fancy, super simple, but still relatively efficient training code for GPT tokenizer (more efficient than minbpe, much cleaner/simpler than tokenizers), and then export the trained vocab for inference with tiktoken. Does that make sense? So here we are. There are more opportunities for optimization here, I just stopped a bit early because unlike minbpe before it, rustbpe is now simple and fast enough, and not a significant bottleneck for nanochat.