nanochat/docs/DEPLOYMENT.md
Ralph 967c408d3a docs: Add Q-Lite deployment guide
This PR adds deployment documentation for edge devices, mentioning Q-Lite as the ultra-lightweight gateway option for NanoChat-trained models.

Changes:
- Add 'Deployment' section to README
- Create docs/DEPLOYMENT.md with detailed Q-Lite workflow
- Document NanoChat → Q-Lite integration (train → serve → deploy)

Inspired by OpenClaw Discussion #14132:
https://github.com/openclaw/openclaw/discussions/14132

Special thanks to @karpathy.
2026-02-11 23:59:40 +08:00

4.2 KiB

Deployment Guide

Once you've trained a model with nanochat, you'll want to deploy it.


🚀 Quick Start (Desktop)

For local testing and development, you can use the built-in web UI:

# Start the chat web server
python -m scripts.chat_web

Then open your browser to the URL shown (usually http://localhost:8000).


🌐 Desktop / Server Deployment

For production deployment on servers or desktop machines:

# Create Ollama Modelfile
cat > Modelfile <<EOF
FROM ./checkpoint.pth
PARAMETER num_layer 26
PARAMETER num_head 16
PARAMETER num_embd 1024
LICENSE MIT
EOF

# Create model in Ollama
ollama create nanochat-model -f Modelfile

# Run inference
ollama run nanochat-model

Option 2: vLLM (High Performance)

pip install vllm

python -m vllm.entrypoints.api_server \
    --model ./checkpoint.pth \
    --host 0.0.0.0 \
    --port 8000

🔌 Edge Deployment (Microcontrollers)

For deploying to resource-constrained devices like ESP32, STM32, or Raspberry Pi Pico:

Q-Lite Gateway

Q-Lite is an ultra-lightweight LLM gateway designed specifically for edge devices.

Why Q-Lite?

  • <1MB RAM - Runs on ESP32, Raspberry Pi Zero
  • 69KB binary - Smaller than most LLM models
  • Ollama-compatible - Drop-in replacement for Ollama API
  • Pure C - Zero dependencies, runs everywhere

Architecture:

Edge Device (Q-Lite, <1MB RAM)
    ↓ HTTP
Desktop (Ollama, 128GB RAM)
    ↓
Response

Example Workflow

# 1. Train your model with nanochat
bash runs/speedrun.sh

# 2. Convert checkpoint to Ollama format (TODO: add export script)
# 3. Serve model with Ollama on desktop
ollama serve

# 4. Deploy gateway to ESP32
cd q-lite/platforms/esp32
idf.py build
idf.py flash

# 5. Start Q-Lite gateway on ESP32
# (It connects to your desktop's Ollama via WiFi)

Hardware Examples:

Device RAM Flash Network Q-Lite Binary
ESP32-S3 520KB 4MB WiFi ~100KB
STM32F4 128KB 512KB Ethernet ~80KB
Raspberry Pi Pico 264KB 2MB WiFi (ESP8266) ~60KB
Desktop Plenty N/A Ethernet/WiFi 69KB

Q-Lite Quick Start

# Clone Q-Lite
git clone https://github.com/RalphBigBear/q-lite.git
cd q-lite

# One-click demo
./examples/quickstart.sh

NanoChat → Q-Lite Integration:

  1. Train with NanoChat (on desktop/HPC)
  2. Serve with Ollama (on desktop/server)
  3. Deploy with Q-Lite (to edge devices)

Use Cases:

  • Home automation (ESP32 gateway + Ollama on NAS)
  • IoT devices (Pico gateway + cloud LLM)
  • Offline inference (Pico + local LLM)
  • Teaching embedded AI (minimal hardware)

📊 Performance Comparison

Deployment Latency Cost Hardware
Ollama Desktop ~50ms High ($1000 GPU) Desktop
vLLM Server ~20ms Very High ($10K GPU) Server
Q-Lite + Ollama ~100ms Low ($10 ESP32 + desktop) Distributed
Q-Lite + Cloud ~500ms Low (data costs) Edge

Trade-offs:

  • Desktop: Lowest latency, highest cost
  • Edge: Higher latency, lowest cost, offline capable

🔧 Configuration

Model Size Selection

Nanochat's --depth parameter controls model size:

Depth Parameters RAM (inference) Use Case
12 ~300M ~1GB Raspberry Pi, weak laptops
20 ~1B ~3GB Desktop, gaming PC
26 (GPT-2) ~1.6B ~5GB Server, powerful desktop
30+ ~3B+ ~10GB+ HPC, cloud

For edge deployment:

  • Use smaller models (d12-d16) for edge devices
  • Run larger models (d20-d26) on desktop/server
  • Q-Lite acts as gateway between edge and model

📖 References


Inspired by: OpenClaw Discussion #14132

Special thanks to @karpathy for proving that minimalism beats feature bloat.