# Quick Start Guide: C++ Inference with nanochat This guide will get you up and running with C++ inference in under 10 minutes. ## Prerequisites Choose one of the following: ### Option A: LibTorch (TorchScript) 1. Download LibTorch from https://pytorch.org/get-started/locally/ 2. Extract to a location (e.g., `/opt/libtorch`) 3. Set environment variable: ```bash export CMAKE_PREFIX_PATH=/opt/libtorch ``` ### Option B: ONNX Runtime 1. Download from https://github.com/microsoft/onnxruntime/releases 2. Extract to a location (e.g., `/opt/onnxruntime`) 3. Set environment variable: ```bash export ONNXRUNTIME_DIR=/opt/onnxruntime ``` ## Step 1: Export Your Model From the nanochat root directory: ```bash # For LibTorch python -m scripts.export_model --source sft --format torchscript --output model.pt # For ONNX Runtime python -m scripts.export_model --source sft --format onnx --output model.onnx ``` This will create `model.pt` or `model.onnx` in the current directory. ## Step 2: Build the C++ Example ```bash cd examples/cpp_inference mkdir build && cd build # For LibTorch only cmake -DCMAKE_PREFIX_PATH=/opt/libtorch -DBUILD_ONNX_EXAMPLE=OFF .. # For ONNX Runtime only cmake -DONNXRUNTIME_DIR=/opt/onnxruntime -DBUILD_LIBTORCH_EXAMPLE=OFF .. # For both cmake -DCMAKE_PREFIX_PATH=/opt/libtorch -DONNXRUNTIME_DIR=/opt/onnxruntime .. # Build make -j$(nproc) ``` ## Step 3: Run Inference ```bash # LibTorch (CPU) ./libtorch_inference ../../../model.pt # LibTorch (CUDA) ./libtorch_inference ../../../model.pt 1 # ONNX Runtime (CPU) ./onnx_inference ../../../model.onnx # ONNX Runtime (CUDA) ./onnx_inference ../../../model.onnx 1 ``` ## Expected Output ``` Loading model from: model.pt ✓ Model loaded successfully Prompt token IDs: 1 464 11742 15150 315 3090 374 --- Single Forward Pass --- Output shape: [1, 7, 50304] Next token (greedy): 473 --- Autoregressive Generation --- Generating 20 tokens... Generated 10/20 tokens Generated 20/20 tokens Generated token IDs: 1 464 11742 15150 315 3090 374 473 ... ✓ Inference completed successfully! ``` ## Next Steps ### 1. Tokenization The examples use hardcoded token IDs. To use real text: **Option A: Python Tokenization** ```python from nanochat.checkpoint_manager import load_model from nanochat.common import compute_init device_type = "cpu" ddp, ddp_rank, ddp_local_rank, ddp_world_size, device = compute_init(device_type) model, tokenizer, meta = load_model("sft", device, phase="eval") # Encode text = "Hello, how are you?" bos = tokenizer.get_bos_token_id() tokens = tokenizer.encode(text, prepend=bos) print(tokens) # Use these in C++ # Decode generated_tokens = [1, 464, 11742, ...] text = tokenizer.decode(generated_tokens) print(text) ``` **Option B: C++ Tokenization** Implement a BPE tokenizer in C++ using the vocabulary file. The nanochat tokenizer is tiktoken-compatible. ### 2. Customize Generation Modify the C++ code to adjust: - `temperature`: Controls randomness (0.0 = greedy, 1.0 = default, 2.0 = very random) - `top_k`: Limits sampling to top-k tokens (50 is a good default) - `max_tokens`: Maximum number of tokens to generate ### 3. Production Deployment For production use: 1. **Implement KV Caching**: Use `ExportableGPTWithCache` for faster generation 2. **Batch Processing**: Modify code to process multiple sequences in parallel 3. **Error Handling**: Add robust error handling and logging 4. **Model Quantization**: Consider INT8/FP16 quantization for faster inference ## Troubleshooting ### "libtorch not found" Make sure `CMAKE_PREFIX_PATH` points to the LibTorch directory: ```bash export CMAKE_PREFIX_PATH=/path/to/libtorch ``` ### "onnxruntime not found" Make sure `ONNXRUNTIME_DIR` is set: ```bash export ONNXRUNTIME_DIR=/path/to/onnxruntime ``` ### "Model loading failed" Verify the model was exported successfully: ```bash python -m scripts.export_model --source sft --format torchscript --output test.pt ``` ### "Out of memory" Reduce batch size or use CPU instead of GPU: ```bash ./libtorch_inference model.pt 0 # Use CPU ``` ## Performance Tips 1. **Use CUDA**: GPU inference is 10-100x faster than CPU 2. **Optimize Batch Size**: Process multiple sequences together 3. **Use KV Cache**: Avoid recomputing past tokens 4. **Quantize Models**: INT8 quantization can provide 2-4x speedup ## Getting Help - See [README.md](README.md) for detailed documentation - Check [EXPORT_IMPLEMENTATION.md](../../EXPORT_IMPLEMENTATION.md) for implementation details - Open an issue on GitHub for bugs or questions ## Example: Complete Workflow ```bash # 1. Train a model (or use existing) cd /path/to/nanochat bash speedrun.sh # 2. Export the model python -m scripts.export_model --source sft --format torchscript --output model.pt # 3. Build C++ example cd examples/cpp_inference mkdir build && cd build cmake -DCMAKE_PREFIX_PATH=/opt/libtorch .. make # 4. Run inference ./libtorch_inference ../../../model.pt 1 # 5. Integrate into your application # Copy the inference code into your project and customize as needed ``` That's it! You now have a working C++ inference setup for nanochat models.