diff --git a/README.md b/README.md index bc01055..5072191 100644 --- a/README.md +++ b/README.md @@ -109,6 +109,254 @@ I haven't invested too much here but some tests exist, especially for the tokeni python -m pytest tests/test_rustbpe.py -v -s ``` +## For Students + +nanochat is designed as an educational full-stack LLM implementation. If you're learning about how modern language models work from tokenization to deployment, this section will guide you through the codebase systematically. + +### Learning Path + +The best way to understand nanochat is to follow the same order as the training pipeline. Here's the recommended reading sequence: + +#### **Phase 1: Foundations (Start Here)** + +1. **`nanochat/common.py`** - Common utilities, distributed setup, logging + - *What to learn*: How distributed training is initialized, basic helper functions + - *Key concepts*: DDP (Distributed Data Parallel), device management, logging patterns + +2. **`nanochat/tokenizer.py`** - Text tokenization and the BPE algorithm + - *What to learn*: How text becomes numbers that neural networks can process + - *Key concepts*: Byte Pair Encoding (BPE), vocabulary, special tokens + - *Related*: `rustbpe/src/lib.rs` (Rust implementation for speed) + +3. **`scripts/tok_train.py`** - Tokenizer training script + - *What to learn*: How to train a tokenizer from scratch on your dataset + - *Try it*: Run `python -m scripts.tok_train --max_chars=2000000000` (after downloading data) + +#### **Phase 2: Model Architecture** + +4. **`nanochat/gpt.py`** ⭐ **CORE FILE** + - *What to learn*: The Transformer architecture with modern improvements + - *Key concepts*: + - Rotary embeddings (RoPE) for positional encoding + - QK normalization for training stability + - Multi-Query Attention (MQA) for efficient inference + - ReLU² activation function + - RMSNorm (no learnable parameters) + - *Architecture highlights*: + - `CausalSelfAttention`: The attention mechanism + - `MLP`: Feed-forward network with ReLU² activation + - `Block`: One transformer layer (attention + MLP) + - `GPT`: The full model putting it all together + +5. **`nanochat/muon.py`** and **`nanochat/adamw.py`** - Optimizers + - *What to learn*: How different parameters need different optimization strategies + - *Key insight*: Muon optimizer for matrix parameters, AdamW for embeddings + - *Why dual optimizers?*: Different parameter types benefit from different update rules + +#### **Phase 3: Data & Training** + +6. **`nanochat/dataset.py`** - Dataset downloading and preparation + - *What to learn*: How to download and manage large training datasets (FineWeb) + - *Key concepts*: Data sharding, streaming, efficient storage + +7. **`nanochat/dataloader.py`** - Data loading during training + - *What to learn*: How to efficiently feed data to the model during training + - *Key concepts*: Tokenization on-the-fly, distributed data loading, batching + +8. **`scripts/base_train.py`** ⭐ **CORE FILE** + - *What to learn*: The complete pretraining loop + - *Key concepts*: + - Gradient accumulation for large batch sizes + - Mixed precision training (bfloat16) + - Learning rate schedules + - Checkpointing + - Distributed training coordination + - *Try it*: Read through the main training loop starting from `for step in range(num_iterations + 1):` + +#### **Phase 4: Evaluation** + +9. **`nanochat/loss_eval.py`** - Training/validation loss evaluation + - *What to learn*: How to measure model perplexity on held-out data + - *Key concepts*: Bits per byte (BPB), perplexity + +10. **`nanochat/core_eval.py`** - CORE benchmark evaluation + - *What to learn*: How to evaluate language modeling capability + - *Key concepts*: Next-token prediction accuracy as a metric + +11. **`tasks/*.py`** - Task-specific evaluations + - `tasks/arc.py` - Reasoning benchmark + - `tasks/gsm8k.py` - Math word problems + - `tasks/humaneval.py` - Code generation + - `tasks/mmlu.py` - General knowledge + - `tasks/smoltalk.py` - Conversational ability + - *What to learn*: How to evaluate LLMs on different capabilities + +#### **Phase 5: Inference & Serving** + +12. **`nanochat/engine.py`** ⭐ **CORE FILE** + - *What to learn*: Efficient text generation with KV caching + - *Key concepts*: + - KV cache for fast autoregressive generation + - Sampling strategies (temperature, top-k) + - Tool use (calculator integration) + - Batch generation + - *Cool feature*: The calculator tool demonstrates how LLMs can use tools during generation + +13. **`scripts/chat_cli.py`** - Command-line chat interface + - *What to learn*: How to build a simple chat interface + - *Try it*: `python -m scripts.chat_cli -p "Why is the sky blue?"` + +14. **`scripts/chat_web.py`** - Web-based chat interface + - *What to learn*: How to serve an LLM over HTTP + - *Try it*: `python -m scripts.chat_web` (after training) + +#### **Phase 6: Advanced Training** + +15. **`scripts/mid_train.py`** - Midtraining + - *What to learn*: Teaching the model special tokens and conversational format + - *Key insight*: Bridge between pretraining and task-specific finetuning + +16. **`scripts/chat_sft.py`** - Supervised Fine-Tuning + - *What to learn*: Adapting the model to follow instructions + - *Key concepts*: Instruction tuning, chat templates + +17. **`scripts/chat_rl.py`** - Reinforcement Learning + - *What to learn*: Using RL to improve specific capabilities (math) + - *Key concepts*: Reward models, policy optimization + +#### **Phase 7: Infrastructure** + +18. **`nanochat/checkpoint_manager.py`** - Model checkpointing + - *What to learn*: How to save and load model weights efficiently + +19. **`nanochat/report.py`** - Automated reporting + - *What to learn*: How to track experiments and generate reports + +20. **`nanochat/configurator.py`** - Configuration management + - *What to learn*: Command-line argument parsing for ML experiments + +### Key Architectural Decisions & Why + +1. **Rotary Embeddings instead of learned positional embeddings** + - *Why?*: Better length generalization, no extra parameters + - *Where?*: `gpt.py` - see the `apply_rotary_emb()` function + +2. **Untied embeddings** (separate input and output embedding matrices) + - *Why?*: More expressive, worth the extra parameters + - *Where?*: `gpt.py` - `GPT` class has separate `wte` and `lm_head` parameters + +3. **QK Normalization** + - *Why?*: Training stability, prevents attention logits from exploding + - *Where?*: `gpt.py` - in `CausalSelfAttention.forward()` after rotary embeddings + +4. **Multi-Query Attention (MQA)** + - *Why?*: Faster inference with minimal quality loss + - *Where?*: `gpt.py` - `GPTConfig` has separate `n_head` and `n_kv_head`, see `repeat_kv()` function + +5. **ReLU² activation** + - *Why?*: Better than GELU for smaller models, simple and effective + - *Where?*: `gpt.py` - `MLP.forward()` uses `F.relu(x).square()` + +6. **Dual optimizer strategy** (Muon + AdamW) + - *Why?*: Matrix parameters and embeddings benefit from different optimization + - *Where?*: `gpt.py` - see `GPT.setup_optimizers()` method + +7. **Logit soft-capping** + - *Why?*: Prevents extreme logit values, improves training stability + - *Where?*: `gpt.py` - in `GPT.forward()`, search for "softcap" + +### The Complete Pipeline Visualized + +``` +1. Data Preparation + ├─ Download FineWeb shards (dataset.py) + ├─ Train BPE tokenizer (tok_train.py) + └─ Tokenize data on-the-fly (dataloader.py) + +2. Pretraining + ├─ Initialize model (gpt.py) + ├─ Setup optimizers (muon.py, adamw.py) + ├─ Train on tokens (base_train.py) + └─ Evaluate on CORE (base_eval.py) + +3. Midtraining + ├─ Load base checkpoint + ├─ Train on formatted data (mid_train.py) + └─ Evaluate on chat tasks (chat_eval.py) + +4. Fine-tuning + ├─ Supervised learning (chat_sft.py) + ├─ [Optional] RL training (chat_rl.py) + └─ Final evaluation (chat_eval.py) + +5. Deployment + ├─ Load best checkpoint + ├─ Serve via CLI (chat_cli.py) + └─ Serve via Web (chat_web.py) +``` + +### Concepts to Master + +As you read through the code, make sure you understand these fundamental concepts: + +**Tokenization:** +- Why we need tokenization +- How BPE works (greedy merge of most frequent pairs) +- Special tokens and their purpose + +**Model Architecture:** +- Self-attention mechanism (Q, K, V matrices) +- Causal masking (can only attend to past tokens) +- Residual connections (x + attention(x)) +- Layer normalization (RMSNorm variant) +- Why we stack many layers + +**Training:** +- Gradient descent and backpropagation +- Loss function (cross-entropy for next token prediction) +- Learning rate schedules (warmup + cosine decay) +- Gradient accumulation (simulating larger batches) +- Mixed precision training (bfloat16 for speed) + +**Distributed Training:** +- Data parallelism (same model, different data shards) +- Gradient synchronization across GPUs +- All-reduce operations + +**Inference:** +- Autoregressive generation (one token at a time) +- KV caching (reuse past computations) +- Sampling strategies (temperature, top-k) + +### Recommended Experiments + +Once you've read through the code, try these experiments to deepen understanding: + +1. **Modify the tokenizer vocabulary size** - See how it affects compression and training +2. **Change model depth** - Train a smaller/larger model, observe parameter count vs. performance +3. **Experiment with batch sizes** - Understand the speed/memory tradeoff +4. **Try different sampling temperatures** - See how it affects generation creativity +5. **Implement a simple evaluation task** - Add your own benchmark in `tasks/` +6. **Add a new tool** - Extend the calculator to support more operations + +### Quick Start for Learning + +If you just want to understand the core without running anything: + +1. Read `gpt.py` - Understand the Transformer architecture +2. Read `engine.py` - Understand how generation works +3. Read `base_train.py` - Understand the training loop + +These three files (~1000 lines total) contain the essence of how modern LLMs work. + +### Resources for Deeper Learning + +- **Attention paper**: "Attention Is All You Need" (Vaswani et al.) +- **GPT-2 paper**: "Language Models are Unsupervised Multitask Learners" +- **Rotary embeddings**: "RoFormer: Enhanced Transformer with Rotary Position Embedding" +- **Andrej's videos**: Neural Networks: Zero to Hero series on YouTube +- **LLM101n course**: The course this project was built for (when released) + ## Contributing nanochat is nowhere finished. The goal is to improve the state of the art in micro models that are accessible to work with end to end on budgets of < $1000 dollars. Accessibility is about overall cost but also about cognitive complexity - nanochat is not an exhaustively configurable LLM "framework"; there will be no giant configuration objects, model factories, or if-then-else monsters in the code base. It is a single, cohesive, minimal, readable, hackable, maximally-forkable "strong baseline" codebase designed to run start to end and produce a concrete ChatGPT clone and its report card.