mirror of https://github.com/karpathy/nanochat.git synced 2026-04-17 06:18:34 +00:00

Matt Suiche c5ef68cea2 Add comprehensive educational guide for nanochat

Created a complete educational resource covering the implementation of
nanochat from scratch, including:

- Mathematical foundations (linear algebra, optimization, attention)
- Tokenization with detailed BPE algorithm explanation
- Transformer architecture and GPT model implementation
- Self-attention mechanism with RoPE and Multi-Query Attention
- Training process, data loading, and distributed training
- Advanced optimization techniques (Muon + AdamW)
- Practical implementation guide with debugging tips
- Automated PDF compilation script

The guide includes deep code walkthroughs with line-by-line explanations
of key components, making it accessible for beginners while covering
advanced techniques used in modern LLMs.

Total content: ~4,300 lines across 8 chapters plus README and tooling.
PDF compilation available via compile_to_pdf.py script.

2025-10-21 18:36:26 +04:00

7.1 KiB

Raw Blame History

Introduction to nanochat: Building a ChatGPT from Scratch

What is nanochat?

nanochat is a complete, minimal implementation of a Large Language Model (LLM) similar to ChatGPT. Unlike most LLM projects that rely on heavy external frameworks, nanochat is built from scratch with minimal dependencies, making it perfect for learning how modern LLMs actually work.

Key Philosophy:

From Scratch: Implement core algorithms yourself rather than using black-box libraries
Minimal Dependencies: Only essential libraries (PyTorch, tokenizers, etc.)
Educational: Clean, readable code that you can understand completely
Full Stack: Everything from tokenization to web serving
Practical: Actually trains a working model for ~$100

What You'll Learn

By studying this repository, you will understand:

Tokenization: How text is converted to numbers using Byte Pair Encoding (BPE)
Model Architecture: The Transformer architecture with modern improvements
Training Pipeline:
- Pretraining: Learning language patterns from raw text
- Midtraining: Specialized training on curated data
- Supervised Fine-Tuning (SFT): Teaching the model to chat
- Reinforcement Learning (RL): Optimizing for quality
Optimization: Advanced optimizers like Muon and AdamW
Evaluation: Measuring model performance
Inference: Running the trained model efficiently
Deployment: Serving the model via a web interface

Repository Structure

nanochat/
├── nanochat/              # Core library
│   ├── gpt.py            # GPT model architecture
│   ├── tokenizer.py      # BPE tokenizer wrapper
│   ├── dataloader.py     # Data loading and tokenization
│   ├── engine.py         # Inference engine
│   ├── adamw.py          # AdamW optimizer
│   ├── muon.py           # Muon optimizer
│   └── ...
├── rustbpe/              # High-performance Rust tokenizer
│   └── src/lib.rs        # BPE implementation in Rust
├── scripts/              # Training and evaluation scripts
│   ├── base_train.py     # Pretraining script
│   ├── mid_train.py      # Midtraining script
│   ├── chat_sft.py       # Supervised fine-tuning
│   ├── chat_rl.py        # Reinforcement learning
│   └── chat_web.py       # Web interface
├── tasks/                # Evaluation benchmarks
├── tests/                # Unit tests
└── speedrun.sh           # Complete pipeline script

The Training Pipeline

nanochat implements the complete modern LLM training pipeline:

1. Tokenization (tok_train.py)

First, we need to convert text into numbers. We train a Byte Pair Encoding (BPE) tokenizer on a corpus of text. This creates a vocabulary of ~32,000 tokens that efficiently represent common words and subwords.

Time: ~10 minutes on CPU

2. Base Pretraining (base_train.py)

The model learns to predict the next token in sequences of text. This is where most of the "knowledge" is learned - language patterns, facts, reasoning abilities, etc.

Data: ~10 billion tokens from FineWeb (high-quality web text) Objective: Next-token prediction Time: ~2-4 hours on 8×H100 GPUs Cost: ~$100

3. Midtraining (mid_train.py)

Continue pretraining on a smaller, more curated dataset to improve quality and reduce the need for instruction following data.

Data: ~1 billion high-quality tokens Time: ~30 minutes Cost: ~$12

4. Supervised Fine-Tuning (chat_sft.py)

Teach the model to follow instructions and chat like ChatGPT. We train on conversation examples.

Data: ~80,000 conversations from SmolTalk Objective: Predict assistant responses given user prompts Time: ~15 minutes Cost: ~$6

5. Reinforcement Learning (chat_rl.py)

Further optimize the model using reinforcement learning to improve response quality.

Technique: Self-improvement via sampling and filtering Time: ~10 minutes Cost: ~$4

Key Technical Features

Modern Architecture Choices

The GPT model in nanochat includes modern improvements over the original GPT-2:

Rotary Position Embeddings (RoPE): Better position encoding
RMSNorm: Simpler, more efficient normalization
Multi-Query Attention (MQA): Faster inference
QK Normalization: Stability improvement
ReLU² Activation: Better than GELU for small models
Untied Embeddings: Separate input/output embeddings
Logit Softcapping: Prevents extreme logits

Efficient Implementation

Mixed Precision: BF16 for most operations
Gradient Accumulation: Larger effective batch sizes
Distributed Training: Multi-GPU support with DDP
Compiled Models: PyTorch compilation for speed
Streaming Data: Memory-efficient data loading
Rust Tokenizer: Fast tokenization with parallel processing

Mathematical Notation

Throughout this guide, we'll use the following notation:

d_{model}: Model dimension (embedding size)
n_{layers}: Number of Transformer layers
n_{heads}: Number of attention heads
d_{head}: Dimension per attention head ($d_{model} / n_{heads}$)
V: Vocabulary size
T or L: Sequence length
B: Batch size
\theta: Model parameters
\mathcal{L}: Loss function
p(x): Probability distribution

Prerequisites

To fully understand this material, you should have:

Essential:

Python programming
Basic linear algebra (matrices, vectors, dot products)
Basic calculus (derivatives, chain rule)
Basic probability (distributions, expectation)

Helpful but not required:

PyTorch basics
Deep learning fundamentals
Transformer architecture awareness

Don't worry if you're not an expert! We'll explain everything step by step.

How to Use This Guide

The educational materials are organized as follows:

01_introduction.md (this file): Overview and context
02_mathematical_foundations.md: Math concepts you need
03_tokenization.md: BPE algorithm and implementation
04_transformer_architecture.md: The GPT model structure
05_attention_mechanism.md: Self-attention in detail
06_training_process.md: How training works
07_optimization.md: Advanced optimizers (Muon, AdamW)
08_implementation_details.md: Code walkthrough
09_evaluation.md: Measuring model performance
10_rust_implementation.md: High-performance Rust tokenizer

Each section builds on previous ones, so it's best to read them in order.

Running the Code

To get started with nanochat:

# Clone the repository
git clone https://github.com/karpathy/nanochat.git
cd nanochat

# Install dependencies (requires Python 3.10+)
pip install uv
uv sync

# Run the complete pipeline (requires 8×H100 GPUs)
bash speedrun.sh

For learning purposes, you can also:

# Run tests
python -m pytest tests/ -v

# Train tokenizer only
python -m scripts.tok_train

# Train small model on 1 GPU
python -m scripts.base_train --depth=6

Next Steps

In the next section, we'll cover the Mathematical Foundations - all the math concepts you need to understand how LLMs work, explained from first principles.

Let's begin! 🚀

7.1 KiB Raw Blame History Unescape Escape