Merge 967c408d3a into 1076f97059

2026-03-30 08:35:19 +00:00 · 2026-03-06 15:07:06 +08:00 · 2026-03-06 15:07:06 +08:00 · f661297931
commit f661297931
parent 1076f97059 967c408d3a
1 changed files with 178 additions and 0 deletions
--- a/docs/DEPLOYMENT.md
+++ b/docs/DEPLOYMENT.md
@ -0,0 +1,178 @@
+# Deployment Guide
+
+**Once you've trained a model with nanochat, you'll want to deploy it.**
+
+---
+
+## 🚀 Quick Start (Desktop)
+
+For local testing and development, you can use the built-in web UI:
+
+```bash
+# Start the chat web server
+python -m scripts.chat_web
+```
+
+Then open your browser to the URL shown (usually `http://localhost:8000`).
+
+---
+
+## 🌐 Desktop / Server Deployment
+
+For production deployment on servers or desktop machines:
+
+### Option 1: Ollama (Recommended)
+
+```bash
+# Create Ollama Modelfile
+cat > Modelfile <<EOF
+FROM ./checkpoint.pth
+PARAMETER num_layer 26
+PARAMETER num_head 16
+PARAMETER num_embd 1024
+LICENSE MIT
+EOF
+
+# Create model in Ollama
+ollama create nanochat-model -f Modelfile
+
+# Run inference
+ollama run nanochat-model
+```
+
+### Option 2: vLLM (High Performance)
+
+```bash
+pip install vllm
+
+python -m vllm.entrypoints.api_server \
+    --model ./checkpoint.pth \
+    --host 0.0.0.0 \
+    --port 8000
+```
+
+---
+
+## 🔌 Edge Deployment (Microcontrollers)
+
+For deploying to resource-constrained devices like ESP32, STM32, or Raspberry Pi Pico:
+
+### Q-Lite Gateway
+
+**[Q-Lite](https://github.com/RalphBigBear/q-lite)** is an ultra-lightweight LLM gateway designed specifically for edge devices.
+
+**Why Q-Lite?**
+- **<1MB RAM** - Runs on ESP32, Raspberry Pi Zero
+- **69KB binary** - Smaller than most LLM models
+- **Ollama-compatible** - Drop-in replacement for Ollama API
+- **Pure C** - Zero dependencies, runs everywhere
+
+**Architecture**:
+```
+Edge Device (Q-Lite, <1MB RAM)
+    ↓ HTTP
+Desktop (Ollama, 128GB RAM)
+    ↓
+Response
+```
+
+### Example Workflow
+
+```bash
+# 1. Train your model with nanochat
+bash runs/speedrun.sh
+
+# 2. Convert checkpoint to Ollama format (TODO: add export script)
+# 3. Serve model with Ollama on desktop
+ollama serve
+
+# 4. Deploy gateway to ESP32
+cd q-lite/platforms/esp32
+idf.py build
+idf.py flash
+
+# 5. Start Q-Lite gateway on ESP32
+# (It connects to your desktop's Ollama via WiFi)
+```
+
+**Hardware Examples**:
+
+| Device | RAM | Flash | Network | Q-Lite Binary |
+|--------|-----|-------|---------|---------------|
+| ESP32-S3 | 520KB | 4MB | WiFi | ~100KB |
+| STM32F4 | 128KB | 512KB | Ethernet | ~80KB |
+| Raspberry Pi Pico | 264KB | 2MB | WiFi (ESP8266) | ~60KB |
+| Desktop | Plenty | N/A | Ethernet/WiFi | 69KB |
+
+### Q-Lite Quick Start
+
+```bash
+# Clone Q-Lite
+git clone https://github.com/RalphBigBear/q-lite.git
+cd q-lite
+
+# One-click demo
+./examples/quickstart.sh
+```
+
+**NanoChat → Q-Lite Integration**:
+
+1. **Train** with NanoChat (on desktop/HPC)
+2. **Serve** with Ollama (on desktop/server)
+3. **Deploy** with Q-Lite (to edge devices)
+
+**Use Cases**:
+- Home automation (ESP32 gateway + Ollama on NAS)
+- IoT devices (Pico gateway + cloud LLM)
+- Offline inference (Pico + local LLM)
+- Teaching embedded AI (minimal hardware)
+
+---
+
+## 📊 Performance Comparison
+
+| Deployment | Latency | Cost | Hardware |
+|------------|---------|------|----------|
+| Ollama Desktop | ~50ms | High ($1000 GPU) | Desktop |
+| vLLM Server | ~20ms | Very High ($10K GPU) | Server |
+| Q-Lite + Ollama | ~100ms | Low ($10 ESP32 + desktop) | Distributed |
+| Q-Lite + Cloud | ~500ms | Low (data costs) | Edge |
+
+**Trade-offs**:
+- **Desktop**: Lowest latency, highest cost
+- **Edge**: Higher latency, lowest cost, offline capable
+
+---
+
+## 🔧 Configuration
+
+### Model Size Selection
+
+Nanochat's `--depth` parameter controls model size:
+
+| Depth | Parameters | RAM (inference) | Use Case |
+|-------|-----------|------------------|----------|
+| 12 | ~300M | ~1GB | Raspberry Pi, weak laptops |
+| 20 | ~1B | ~3GB | Desktop, gaming PC |
+| 26 (GPT-2) | ~1.6B | ~5GB | Server, powerful desktop |
+| 30+ | ~3B+ | ~10GB+ | HPC, cloud |
+
+**For edge deployment**:
+- Use smaller models (d12-d16) for edge devices
+- Run larger models (d20-d26) on desktop/server
+- Q-Lite acts as gateway between edge and model
+
+---
+
+## 📖 References
+
+- **Q-Lite GitHub**: https://github.com/RalphBigBear/q-lite
+- **NanoChat Training**: https://github.com/karpathy/nanochat
+- **Ollama Docs**: https://ollama.com/docs
+- **vLLM Docs**: https://docs.vllm.ai
+
+---
+
+**Inspired by**: [OpenClaw Discussion #14132](https://github.com/openclaw/openclaw/discussions/14132)
+
+**Special thanks** to @karpathy for proving that minimalism beats feature bloat.