nanochat/examples/cpp_inference/README.md

# nanochat C++ Inference Examples

This directory contains C++ examples for running inference with nanochat models exported to TorchScript and ONNX formats.

## Prerequisites

### For LibTorch (TorchScript) Example

1. **Download LibTorch**
   - Visit: https://pytorch.org/get-started/locally/
   - Select your platform and download the C++ distribution (LibTorch)
   - Extract to a location, e.g., `/opt/libtorch` or `C:\libtorch`

2. **Set CMAKE_PREFIX_PATH**
   ```bash
   export CMAKE_PREFIX_PATH=/path/to/libtorch
   ```

### For ONNX Runtime Example

1. **Download ONNX Runtime**
   - Visit: https://github.com/microsoft/onnxruntime/releases
   - Download the appropriate package for your platform
   - Extract to a location, e.g., `/opt/onnxruntime` or `C:\onnxruntime`

2. **Set ONNXRUNTIME_DIR**
   ```bash
   export ONNXRUNTIME_DIR=/path/to/onnxruntime
   ```

## Building

### Linux/macOS

```bash
# Create build directory
mkdir build && cd build

# Configure (LibTorch only)
cmake -DCMAKE_PREFIX_PATH=/path/to/libtorch ..

# Configure (ONNX Runtime only)
cmake -DONNXRUNTIME_DIR=/path/to/onnxruntime -DBUILD_LIBTORCH_EXAMPLE=OFF ..

# Configure (both)
cmake -DCMAKE_PREFIX_PATH=/path/to/libtorch -DONNXRUNTIME_DIR=/path/to/onnxruntime ..

# Build
cmake --build . --config Release

# Or use make directly
make -j$(nproc)
```

### Windows

```bash
# Create build directory
mkdir build
cd build

# Configure
cmake -DCMAKE_PREFIX_PATH=C:\libtorch -DONNXRUNTIME_DIR=C:\onnxruntime ..

# Build
cmake --build . --config Release
```

## Exporting Models

Before running the C++ examples, you need to export your trained nanochat model:

### Export to TorchScript

```bash
# Export SFT model to TorchScript
python -m scripts.export_model --source sft --format torchscript --output model.pt

# Export with specific model tag
python -m scripts.export_model --source mid --model-tag d20 --format torchscript --output model_d20.pt
```

### Export to ONNX

```bash
# Export SFT model to ONNX
python -m scripts.export_model --source sft --format onnx --output model.onnx

# Export both formats at once
python -m scripts.export_model --source sft --format both
```

## Running

### LibTorch Example

```bash
# CPU inference
./libtorch_inference /path/to/model.pt

# CUDA inference (if available)
./libtorch_inference /path/to/model.pt 1
```

### ONNX Runtime Example

```bash
# CPU inference
./onnx_inference /path/to/model.onnx

# CUDA inference (if ONNX Runtime with CUDA is installed)
./onnx_inference /path/to/model.onnx 1
```

## Example Output

```
Loading model from: model.pt
✓ Model loaded successfully

Prompt token IDs: 1 464 11742 15150 315 3090 374

--- Single Forward Pass ---
Output shape: [1, 7, 50304]
Next token (greedy): 473

--- Autoregressive Generation ---
Generating 20 tokens...
  Generated 10/20 tokens
  Generated 20/20 tokens

Generated token IDs: 1 464 11742 15150 315 3090 374 473 ...

✓ Inference completed successfully!

Note: To decode tokens to text, you need to implement
      a tokenizer in C++ or use the Python tokenizer.
```

## Tokenization

The C++ examples work with token IDs directly. To convert text to tokens and back:

### Option 1: Use Python for Tokenization

Create a simple Python script to tokenize your input:

```python
from nanochat.checkpoint_manager import load_model
from nanochat.common import compute_init

device_type = "cpu"
ddp, ddp_rank, ddp_local_rank, ddp_world_size, device = compute_init(device_type)
model, tokenizer, meta = load_model("sft", device, phase="eval")

# Tokenize
text = "The chemical formula of water is"
bos = tokenizer.get_bos_token_id()
tokens = tokenizer.encode(text, prepend=bos)
print("Token IDs:", tokens)

# Detokenize
generated_tokens = [1, 464, 11742, 15150, 315, 3090, 374, 473]
text = tokenizer.decode(generated_tokens)
print("Text:", text)
```

### Option 2: Implement Tokenizer in C++

You can implement a BPE tokenizer in C++ using the vocabulary file from the trained model. The nanochat tokenizer is compatible with tiktoken format.

## Performance Tips

1. **Use CUDA**: If you have a GPU, use CUDA for much faster inference
2. **Batch Processing**: Modify the examples to process multiple sequences in parallel
3. **KV Cache**: For production use, implement KV caching to avoid recomputing past tokens
4. **Quantization**: Consider quantizing the model for faster inference and lower memory usage

## Limitations

The exported models have some limitations compared to the Python version:

1. **No Tool Use**: Calculator and other tool features are not included in the exported model
2. **No Special Token Handling**: Special tokens like `<|python_start|>` are not automatically handled
3. **Simplified Generation**: The examples use basic sampling; you may want to implement more sophisticated decoding strategies

## Troubleshooting

### LibTorch Issues

- **Error: "libtorch not found"**: Make sure `CMAKE_PREFIX_PATH` points to the LibTorch directory
- **Runtime errors**: Ensure the LibTorch version matches the PyTorch version used for export
- **CUDA errors**: Verify CUDA versions match between LibTorch and your system

### ONNX Runtime Issues

- **Error: "onnxruntime not found"**: Set `ONNXRUNTIME_DIR` environment variable
- **Model loading fails**: Ensure the ONNX model was exported successfully
- **Numerical differences**: Small differences (<1e-3) are normal due to floating-point precision

### General Issues

- **Out of memory**: Reduce batch size or sequence length
- **Slow inference**: Use GPU acceleration or consider model quantization
- **Wrong outputs**: Verify the exported model produces correct outputs in Python first

## Further Reading

- [LibTorch Documentation](https://pytorch.org/cppdocs/)
- [ONNX Runtime Documentation](https://onnxruntime.ai/docs/)
- [nanochat Export Documentation](../../README.md#model-export)

## License

MIT License - see the main repository LICENSE file.