mirror of
https://github.com/karpathy/nanochat.git
synced 2025-12-09 22:02:14 +00:00
5.1 KiB
5.1 KiB
Quick Start Guide: C++ Inference with nanochat
This guide will get you up and running with C++ inference in under 10 minutes.
Prerequisites
Choose one of the following:
Option A: LibTorch (TorchScript)
- Download LibTorch from https://pytorch.org/get-started/locally/
- Extract to a location (e.g.,
/opt/libtorch) - Set environment variable:
export CMAKE_PREFIX_PATH=/opt/libtorch
Option B: ONNX Runtime
- Download from https://github.com/microsoft/onnxruntime/releases
- Extract to a location (e.g.,
/opt/onnxruntime) - Set environment variable:
export ONNXRUNTIME_DIR=/opt/onnxruntime
Step 1: Export Your Model
From the nanochat root directory:
# For LibTorch
python -m scripts.export_model --source sft --format torchscript --output model.pt
# For ONNX Runtime
python -m scripts.export_model --source sft --format onnx --output model.onnx
This will create model.pt or model.onnx in the current directory.
Step 2: Build the C++ Example
cd examples/cpp_inference
mkdir build && cd build
# For LibTorch only
cmake -DCMAKE_PREFIX_PATH=/opt/libtorch -DBUILD_ONNX_EXAMPLE=OFF ..
# For ONNX Runtime only
cmake -DONNXRUNTIME_DIR=/opt/onnxruntime -DBUILD_LIBTORCH_EXAMPLE=OFF ..
# For both
cmake -DCMAKE_PREFIX_PATH=/opt/libtorch -DONNXRUNTIME_DIR=/opt/onnxruntime ..
# Build
make -j$(nproc)
Step 3: Run Inference
# LibTorch (CPU)
./libtorch_inference ../../../model.pt
# LibTorch (CUDA)
./libtorch_inference ../../../model.pt 1
# ONNX Runtime (CPU)
./onnx_inference ../../../model.onnx
# ONNX Runtime (CUDA)
./onnx_inference ../../../model.onnx 1
Expected Output
Loading model from: model.pt
✓ Model loaded successfully
Prompt token IDs: 1 464 11742 15150 315 3090 374
--- Single Forward Pass ---
Output shape: [1, 7, 50304]
Next token (greedy): 473
--- Autoregressive Generation ---
Generating 20 tokens...
Generated 10/20 tokens
Generated 20/20 tokens
Generated token IDs: 1 464 11742 15150 315 3090 374 473 ...
✓ Inference completed successfully!
Next Steps
1. Tokenization
The examples use hardcoded token IDs. To use real text:
Option A: Python Tokenization
from nanochat.checkpoint_manager import load_model
from nanochat.common import compute_init
device_type = "cpu"
ddp, ddp_rank, ddp_local_rank, ddp_world_size, device = compute_init(device_type)
model, tokenizer, meta = load_model("sft", device, phase="eval")
# Encode
text = "Hello, how are you?"
bos = tokenizer.get_bos_token_id()
tokens = tokenizer.encode(text, prepend=bos)
print(tokens) # Use these in C++
# Decode
generated_tokens = [1, 464, 11742, ...]
text = tokenizer.decode(generated_tokens)
print(text)
Option B: C++ Tokenization
Implement a BPE tokenizer in C++ using the vocabulary file. The nanochat tokenizer is tiktoken-compatible.
2. Customize Generation
Modify the C++ code to adjust:
temperature: Controls randomness (0.0 = greedy, 1.0 = default, 2.0 = very random)top_k: Limits sampling to top-k tokens (50 is a good default)max_tokens: Maximum number of tokens to generate
3. Production Deployment
For production use:
- Implement KV Caching: Use
ExportableGPTWithCachefor faster generation - Batch Processing: Modify code to process multiple sequences in parallel
- Error Handling: Add robust error handling and logging
- Model Quantization: Consider INT8/FP16 quantization for faster inference
Troubleshooting
"libtorch not found"
Make sure CMAKE_PREFIX_PATH points to the LibTorch directory:
export CMAKE_PREFIX_PATH=/path/to/libtorch
"onnxruntime not found"
Make sure ONNXRUNTIME_DIR is set:
export ONNXRUNTIME_DIR=/path/to/onnxruntime
"Model loading failed"
Verify the model was exported successfully:
python -m scripts.export_model --source sft --format torchscript --output test.pt
"Out of memory"
Reduce batch size or use CPU instead of GPU:
./libtorch_inference model.pt 0 # Use CPU
Performance Tips
- Use CUDA: GPU inference is 10-100x faster than CPU
- Optimize Batch Size: Process multiple sequences together
- Use KV Cache: Avoid recomputing past tokens
- Quantize Models: INT8 quantization can provide 2-4x speedup
Getting Help
- See README.md for detailed documentation
- Check EXPORT_IMPLEMENTATION.md for implementation details
- Open an issue on GitHub for bugs or questions
Example: Complete Workflow
# 1. Train a model (or use existing)
cd /path/to/nanochat
bash speedrun.sh
# 2. Export the model
python -m scripts.export_model --source sft --format torchscript --output model.pt
# 3. Build C++ example
cd examples/cpp_inference
mkdir build && cd build
cmake -DCMAKE_PREFIX_PATH=/opt/libtorch ..
make
# 4. Run inference
./libtorch_inference ../../../model.pt 1
# 5. Integrate into your application
# Copy the inference code into your project and customize as needed
That's it! You now have a working C++ inference setup for nanochat models.