mirror of https://github.com/karpathy/nanochat.git synced 2025-12-06 04:12:13 +00:00

History

BLACKBOX Agent a8c70377a2 feat(export): add TorchScript/ONNX model export support		2025-11-10 16:56:28 +00:00
..
CMakeLists.txt	feat(export): add TorchScript/ONNX model export support	2025-11-10 16:56:28 +00:00
libtorch_inference.cpp	feat(export): add TorchScript/ONNX model export support	2025-11-10 16:56:28 +00:00
onnx_inference.cpp	feat(export): add TorchScript/ONNX model export support	2025-11-10 16:56:28 +00:00
QUICK_START.md	feat(export): add TorchScript/ONNX model export support	2025-11-10 16:56:28 +00:00
README.md	feat(export): add TorchScript/ONNX model export support	2025-11-10 16:56:28 +00:00

README.md

nanochat C++ Inference Examples

This directory contains C++ examples for running inference with nanochat models exported to TorchScript and ONNX formats.

Prerequisites

For LibTorch (TorchScript) Example

Download LibTorch
- Visit: https://pytorch.org/get-started/locally/
- Select your platform and download the C++ distribution (LibTorch)
- Extract to a location, e.g., /opt/libtorch or C:\libtorch

Set CMAKE_PREFIX_PATH

export CMAKE_PREFIX_PATH=/path/to/libtorch

For ONNX Runtime Example

Download ONNX Runtime
- Visit: https://github.com/microsoft/onnxruntime/releases
- Download the appropriate package for your platform
- Extract to a location, e.g., /opt/onnxruntime or C:\onnxruntime

Set ONNXRUNTIME_DIR

export ONNXRUNTIME_DIR=/path/to/onnxruntime

Building

Linux/macOS

# Create build directory
mkdir build && cd build

# Configure (LibTorch only)
cmake -DCMAKE_PREFIX_PATH=/path/to/libtorch ..

# Configure (ONNX Runtime only)
cmake -DONNXRUNTIME_DIR=/path/to/onnxruntime -DBUILD_LIBTORCH_EXAMPLE=OFF ..

# Configure (both)
cmake -DCMAKE_PREFIX_PATH=/path/to/libtorch -DONNXRUNTIME_DIR=/path/to/onnxruntime ..

# Build
cmake --build . --config Release

# Or use make directly
make -j$(nproc)

Windows

# Create build directory
mkdir build
cd build

# Configure
cmake -DCMAKE_PREFIX_PATH=C:\libtorch -DONNXRUNTIME_DIR=C:\onnxruntime ..

# Build
cmake --build . --config Release

Exporting Models

Before running the C++ examples, you need to export your trained nanochat model:

Export to TorchScript

# Export SFT model to TorchScript
python -m scripts.export_model --source sft --format torchscript --output model.pt

# Export with specific model tag
python -m scripts.export_model --source mid --model-tag d20 --format torchscript --output model_d20.pt

Export to ONNX

# Export SFT model to ONNX
python -m scripts.export_model --source sft --format onnx --output model.onnx

# Export both formats at once
python -m scripts.export_model --source sft --format both

Running

LibTorch Example

# CPU inference
./libtorch_inference /path/to/model.pt

# CUDA inference (if available)
./libtorch_inference /path/to/model.pt 1

ONNX Runtime Example

# CPU inference
./onnx_inference /path/to/model.onnx

# CUDA inference (if ONNX Runtime with CUDA is installed)
./onnx_inference /path/to/model.onnx 1

Example Output

Loading model from: model.pt
✓ Model loaded successfully

Prompt token IDs: 1 464 11742 15150 315 3090 374

--- Single Forward Pass ---
Output shape: [1, 7, 50304]
Next token (greedy): 473

--- Autoregressive Generation ---
Generating 20 tokens...
  Generated 10/20 tokens
  Generated 20/20 tokens

Generated token IDs: 1 464 11742 15150 315 3090 374 473 ...

✓ Inference completed successfully!

Note: To decode tokens to text, you need to implement
      a tokenizer in C++ or use the Python tokenizer.

Tokenization

The C++ examples work with token IDs directly. To convert text to tokens and back:

Option 1: Use Python for Tokenization

Create a simple Python script to tokenize your input:

from nanochat.checkpoint_manager import load_model
from nanochat.common import compute_init

device_type = "cpu"
ddp, ddp_rank, ddp_local_rank, ddp_world_size, device = compute_init(device_type)
model, tokenizer, meta = load_model("sft", device, phase="eval")

# Tokenize
text = "The chemical formula of water is"
bos = tokenizer.get_bos_token_id()
tokens = tokenizer.encode(text, prepend=bos)
print("Token IDs:", tokens)

# Detokenize
generated_tokens = [1, 464, 11742, 15150, 315, 3090, 374, 473]
text = tokenizer.decode(generated_tokens)
print("Text:", text)

Option 2: Implement Tokenizer in C++

You can implement a BPE tokenizer in C++ using the vocabulary file from the trained model. The nanochat tokenizer is compatible with tiktoken format.

Performance Tips

Use CUDA: If you have a GPU, use CUDA for much faster inference
Batch Processing: Modify the examples to process multiple sequences in parallel
KV Cache: For production use, implement KV caching to avoid recomputing past tokens
Quantization: Consider quantizing the model for faster inference and lower memory usage

Limitations

The exported models have some limitations compared to the Python version:

No Tool Use: Calculator and other tool features are not included in the exported model
No Special Token Handling: Special tokens like <|python_start|> are not automatically handled
Simplified Generation: The examples use basic sampling; you may want to implement more sophisticated decoding strategies

Troubleshooting

LibTorch Issues

Error: "libtorch not found": Make sure CMAKE_PREFIX_PATH points to the LibTorch directory
Runtime errors: Ensure the LibTorch version matches the PyTorch version used for export
CUDA errors: Verify CUDA versions match between LibTorch and your system

ONNX Runtime Issues

Error: "onnxruntime not found": Set ONNXRUNTIME_DIR environment variable
Model loading fails: Ensure the ONNX model was exported successfully
Numerical differences: Small differences (<1e-3) are normal due to floating-point precision

General Issues

Out of memory: Reduce batch size or sequence length
Slow inference: Use GPU acceleration or consider model quantization
Wrong outputs: Verify the exported model produces correct outputs in Python first

License

MIT License - see the main repository LICENSE file.