Created a complete educational resource covering the implementation of nanochat from scratch, including: - Mathematical foundations (linear algebra, optimization, attention) - Tokenization with detailed BPE algorithm explanation - Transformer architecture and GPT model implementation - Self-attention mechanism with RoPE and Multi-Query Attention - Training process, data loading, and distributed training - Advanced optimization techniques (Muon + AdamW) - Practical implementation guide with debugging tips - Automated PDF compilation script The guide includes deep code walkthroughs with line-by-line explanations of key components, making it accessible for beginners while covering advanced techniques used in modern LLMs. Total content: ~4,300 lines across 8 chapters plus README and tooling. PDF compilation available via compile_to_pdf.py script.
7.1 KiB
Introduction to nanochat: Building a ChatGPT from Scratch
What is nanochat?
nanochat is a complete, minimal implementation of a Large Language Model (LLM) similar to ChatGPT. Unlike most LLM projects that rely on heavy external frameworks, nanochat is built from scratch with minimal dependencies, making it perfect for learning how modern LLMs actually work.
Key Philosophy:
- From Scratch: Implement core algorithms yourself rather than using black-box libraries
- Minimal Dependencies: Only essential libraries (PyTorch, tokenizers, etc.)
- Educational: Clean, readable code that you can understand completely
- Full Stack: Everything from tokenization to web serving
- Practical: Actually trains a working model for ~$100
What You'll Learn
By studying this repository, you will understand:
- Tokenization: How text is converted to numbers using Byte Pair Encoding (BPE)
- Model Architecture: The Transformer architecture with modern improvements
- Training Pipeline:
- Pretraining: Learning language patterns from raw text
- Midtraining: Specialized training on curated data
- Supervised Fine-Tuning (SFT): Teaching the model to chat
- Reinforcement Learning (RL): Optimizing for quality
- Optimization: Advanced optimizers like Muon and AdamW
- Evaluation: Measuring model performance
- Inference: Running the trained model efficiently
- Deployment: Serving the model via a web interface
Repository Structure
nanochat/
├── nanochat/ # Core library
│ ├── gpt.py # GPT model architecture
│ ├── tokenizer.py # BPE tokenizer wrapper
│ ├── dataloader.py # Data loading and tokenization
│ ├── engine.py # Inference engine
│ ├── adamw.py # AdamW optimizer
│ ├── muon.py # Muon optimizer
│ └── ...
├── rustbpe/ # High-performance Rust tokenizer
│ └── src/lib.rs # BPE implementation in Rust
├── scripts/ # Training and evaluation scripts
│ ├── base_train.py # Pretraining script
│ ├── mid_train.py # Midtraining script
│ ├── chat_sft.py # Supervised fine-tuning
│ ├── chat_rl.py # Reinforcement learning
│ └── chat_web.py # Web interface
├── tasks/ # Evaluation benchmarks
├── tests/ # Unit tests
└── speedrun.sh # Complete pipeline script
The Training Pipeline
nanochat implements the complete modern LLM training pipeline:
1. Tokenization (tok_train.py)
First, we need to convert text into numbers. We train a Byte Pair Encoding (BPE) tokenizer on a corpus of text. This creates a vocabulary of ~32,000 tokens that efficiently represent common words and subwords.
Time: ~10 minutes on CPU
2. Base Pretraining (base_train.py)
The model learns to predict the next token in sequences of text. This is where most of the "knowledge" is learned - language patterns, facts, reasoning abilities, etc.
Data: ~10 billion tokens from FineWeb (high-quality web text) Objective: Next-token prediction Time: ~2-4 hours on 8×H100 GPUs Cost: ~$100
3. Midtraining (mid_train.py)
Continue pretraining on a smaller, more curated dataset to improve quality and reduce the need for instruction following data.
Data: ~1 billion high-quality tokens Time: ~30 minutes Cost: ~$12
4. Supervised Fine-Tuning (chat_sft.py)
Teach the model to follow instructions and chat like ChatGPT. We train on conversation examples.
Data: ~80,000 conversations from SmolTalk Objective: Predict assistant responses given user prompts Time: ~15 minutes Cost: ~$6
5. Reinforcement Learning (chat_rl.py)
Further optimize the model using reinforcement learning to improve response quality.
Technique: Self-improvement via sampling and filtering Time: ~10 minutes Cost: ~$4
Key Technical Features
Modern Architecture Choices
The GPT model in nanochat includes modern improvements over the original GPT-2:
- Rotary Position Embeddings (RoPE): Better position encoding
- RMSNorm: Simpler, more efficient normalization
- Multi-Query Attention (MQA): Faster inference
- QK Normalization: Stability improvement
- ReLU² Activation: Better than GELU for small models
- Untied Embeddings: Separate input/output embeddings
- Logit Softcapping: Prevents extreme logits
Efficient Implementation
- Mixed Precision: BF16 for most operations
- Gradient Accumulation: Larger effective batch sizes
- Distributed Training: Multi-GPU support with DDP
- Compiled Models: PyTorch compilation for speed
- Streaming Data: Memory-efficient data loading
- Rust Tokenizer: Fast tokenization with parallel processing
Mathematical Notation
Throughout this guide, we'll use the following notation:
d_{model}: Model dimension (embedding size)n_{layers}: Number of Transformer layersn_{heads}: Number of attention headsd_{head}: Dimension per attention head ($d_{model} / n_{heads}$)V: Vocabulary sizeTorL: Sequence lengthB: Batch size\theta: Model parameters\mathcal{L}: Loss functionp(x): Probability distribution
Prerequisites
To fully understand this material, you should have:
Essential:
- Python programming
- Basic linear algebra (matrices, vectors, dot products)
- Basic calculus (derivatives, chain rule)
- Basic probability (distributions, expectation)
Helpful but not required:
- PyTorch basics
- Deep learning fundamentals
- Transformer architecture awareness
Don't worry if you're not an expert! We'll explain everything step by step.
How to Use This Guide
The educational materials are organized as follows:
- 01_introduction.md (this file): Overview and context
- 02_mathematical_foundations.md: Math concepts you need
- 03_tokenization.md: BPE algorithm and implementation
- 04_transformer_architecture.md: The GPT model structure
- 05_attention_mechanism.md: Self-attention in detail
- 06_training_process.md: How training works
- 07_optimization.md: Advanced optimizers (Muon, AdamW)
- 08_implementation_details.md: Code walkthrough
- 09_evaluation.md: Measuring model performance
- 10_rust_implementation.md: High-performance Rust tokenizer
Each section builds on previous ones, so it's best to read them in order.
Running the Code
To get started with nanochat:
# Clone the repository
git clone https://github.com/karpathy/nanochat.git
cd nanochat
# Install dependencies (requires Python 3.10+)
pip install uv
uv sync
# Run the complete pipeline (requires 8×H100 GPUs)
bash speedrun.sh
For learning purposes, you can also:
# Run tests
python -m pytest tests/ -v
# Train tokenizer only
python -m scripts.tok_train
# Train small model on 1 GPU
python -m scripts.base_train --depth=6
Next Steps
In the next section, we'll cover the Mathematical Foundations - all the math concepts you need to understand how LLMs work, explained from first principles.
Let's begin! 🚀