mirror of
https://github.com/karpathy/nanochat.git
synced 2025-12-09 13:52:17 +00:00
216 lines
5.8 KiB
Markdown
216 lines
5.8 KiB
Markdown
# nanochat C++ Inference Examples
|
|
|
|
This directory contains C++ examples for running inference with nanochat models exported to TorchScript and ONNX formats.
|
|
|
|
## Prerequisites
|
|
|
|
### For LibTorch (TorchScript) Example
|
|
|
|
1. **Download LibTorch**
|
|
- Visit: https://pytorch.org/get-started/locally/
|
|
- Select your platform and download the C++ distribution (LibTorch)
|
|
- Extract to a location, e.g., `/opt/libtorch` or `C:\libtorch`
|
|
|
|
2. **Set CMAKE_PREFIX_PATH**
|
|
```bash
|
|
export CMAKE_PREFIX_PATH=/path/to/libtorch
|
|
```
|
|
|
|
### For ONNX Runtime Example
|
|
|
|
1. **Download ONNX Runtime**
|
|
- Visit: https://github.com/microsoft/onnxruntime/releases
|
|
- Download the appropriate package for your platform
|
|
- Extract to a location, e.g., `/opt/onnxruntime` or `C:\onnxruntime`
|
|
|
|
2. **Set ONNXRUNTIME_DIR**
|
|
```bash
|
|
export ONNXRUNTIME_DIR=/path/to/onnxruntime
|
|
```
|
|
|
|
## Building
|
|
|
|
### Linux/macOS
|
|
|
|
```bash
|
|
# Create build directory
|
|
mkdir build && cd build
|
|
|
|
# Configure (LibTorch only)
|
|
cmake -DCMAKE_PREFIX_PATH=/path/to/libtorch ..
|
|
|
|
# Configure (ONNX Runtime only)
|
|
cmake -DONNXRUNTIME_DIR=/path/to/onnxruntime -DBUILD_LIBTORCH_EXAMPLE=OFF ..
|
|
|
|
# Configure (both)
|
|
cmake -DCMAKE_PREFIX_PATH=/path/to/libtorch -DONNXRUNTIME_DIR=/path/to/onnxruntime ..
|
|
|
|
# Build
|
|
cmake --build . --config Release
|
|
|
|
# Or use make directly
|
|
make -j$(nproc)
|
|
```
|
|
|
|
### Windows
|
|
|
|
```bash
|
|
# Create build directory
|
|
mkdir build
|
|
cd build
|
|
|
|
# Configure
|
|
cmake -DCMAKE_PREFIX_PATH=C:\libtorch -DONNXRUNTIME_DIR=C:\onnxruntime ..
|
|
|
|
# Build
|
|
cmake --build . --config Release
|
|
```
|
|
|
|
## Exporting Models
|
|
|
|
Before running the C++ examples, you need to export your trained nanochat model:
|
|
|
|
### Export to TorchScript
|
|
|
|
```bash
|
|
# Export SFT model to TorchScript
|
|
python -m scripts.export_model --source sft --format torchscript --output model.pt
|
|
|
|
# Export with specific model tag
|
|
python -m scripts.export_model --source mid --model-tag d20 --format torchscript --output model_d20.pt
|
|
```
|
|
|
|
### Export to ONNX
|
|
|
|
```bash
|
|
# Export SFT model to ONNX
|
|
python -m scripts.export_model --source sft --format onnx --output model.onnx
|
|
|
|
# Export both formats at once
|
|
python -m scripts.export_model --source sft --format both
|
|
```
|
|
|
|
## Running
|
|
|
|
### LibTorch Example
|
|
|
|
```bash
|
|
# CPU inference
|
|
./libtorch_inference /path/to/model.pt
|
|
|
|
# CUDA inference (if available)
|
|
./libtorch_inference /path/to/model.pt 1
|
|
```
|
|
|
|
### ONNX Runtime Example
|
|
|
|
```bash
|
|
# CPU inference
|
|
./onnx_inference /path/to/model.onnx
|
|
|
|
# CUDA inference (if ONNX Runtime with CUDA is installed)
|
|
./onnx_inference /path/to/model.onnx 1
|
|
```
|
|
|
|
## Example Output
|
|
|
|
```
|
|
Loading model from: model.pt
|
|
✓ Model loaded successfully
|
|
|
|
Prompt token IDs: 1 464 11742 15150 315 3090 374
|
|
|
|
--- Single Forward Pass ---
|
|
Output shape: [1, 7, 50304]
|
|
Next token (greedy): 473
|
|
|
|
--- Autoregressive Generation ---
|
|
Generating 20 tokens...
|
|
Generated 10/20 tokens
|
|
Generated 20/20 tokens
|
|
|
|
Generated token IDs: 1 464 11742 15150 315 3090 374 473 ...
|
|
|
|
✓ Inference completed successfully!
|
|
|
|
Note: To decode tokens to text, you need to implement
|
|
a tokenizer in C++ or use the Python tokenizer.
|
|
```
|
|
|
|
## Tokenization
|
|
|
|
The C++ examples work with token IDs directly. To convert text to tokens and back:
|
|
|
|
### Option 1: Use Python for Tokenization
|
|
|
|
Create a simple Python script to tokenize your input:
|
|
|
|
```python
|
|
from nanochat.checkpoint_manager import load_model
|
|
from nanochat.common import compute_init
|
|
|
|
device_type = "cpu"
|
|
ddp, ddp_rank, ddp_local_rank, ddp_world_size, device = compute_init(device_type)
|
|
model, tokenizer, meta = load_model("sft", device, phase="eval")
|
|
|
|
# Tokenize
|
|
text = "The chemical formula of water is"
|
|
bos = tokenizer.get_bos_token_id()
|
|
tokens = tokenizer.encode(text, prepend=bos)
|
|
print("Token IDs:", tokens)
|
|
|
|
# Detokenize
|
|
generated_tokens = [1, 464, 11742, 15150, 315, 3090, 374, 473]
|
|
text = tokenizer.decode(generated_tokens)
|
|
print("Text:", text)
|
|
```
|
|
|
|
### Option 2: Implement Tokenizer in C++
|
|
|
|
You can implement a BPE tokenizer in C++ using the vocabulary file from the trained model. The nanochat tokenizer is compatible with tiktoken format.
|
|
|
|
## Performance Tips
|
|
|
|
1. **Use CUDA**: If you have a GPU, use CUDA for much faster inference
|
|
2. **Batch Processing**: Modify the examples to process multiple sequences in parallel
|
|
3. **KV Cache**: For production use, implement KV caching to avoid recomputing past tokens
|
|
4. **Quantization**: Consider quantizing the model for faster inference and lower memory usage
|
|
|
|
## Limitations
|
|
|
|
The exported models have some limitations compared to the Python version:
|
|
|
|
1. **No Tool Use**: Calculator and other tool features are not included in the exported model
|
|
2. **No Special Token Handling**: Special tokens like `<|python_start|>` are not automatically handled
|
|
3. **Simplified Generation**: The examples use basic sampling; you may want to implement more sophisticated decoding strategies
|
|
|
|
## Troubleshooting
|
|
|
|
### LibTorch Issues
|
|
|
|
- **Error: "libtorch not found"**: Make sure `CMAKE_PREFIX_PATH` points to the LibTorch directory
|
|
- **Runtime errors**: Ensure the LibTorch version matches the PyTorch version used for export
|
|
- **CUDA errors**: Verify CUDA versions match between LibTorch and your system
|
|
|
|
### ONNX Runtime Issues
|
|
|
|
- **Error: "onnxruntime not found"**: Set `ONNXRUNTIME_DIR` environment variable
|
|
- **Model loading fails**: Ensure the ONNX model was exported successfully
|
|
- **Numerical differences**: Small differences (<1e-3) are normal due to floating-point precision
|
|
|
|
### General Issues
|
|
|
|
- **Out of memory**: Reduce batch size or sequence length
|
|
- **Slow inference**: Use GPU acceleration or consider model quantization
|
|
- **Wrong outputs**: Verify the exported model produces correct outputs in Python first
|
|
|
|
## Further Reading
|
|
|
|
- [LibTorch Documentation](https://pytorch.org/cppdocs/)
|
|
- [ONNX Runtime Documentation](https://onnxruntime.ai/docs/)
|
|
- [nanochat Export Documentation](../../README.md#model-export)
|
|
|
|
## License
|
|
|
|
MIT License - see the main repository LICENSE file.
|