Add 'For Students' section with structured learning path through the codebase

2025-12-07 12:52:16 +00:00 · 2025-10-13 18:50:35 +01:00 · 2025-10-13 18:50:35 +01:00 · b230ab8a0b
commit b230ab8a0b
parent 5fd0b13886
1 changed files with 248 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -109,6 +109,254 @@ I haven't invested too much here but some tests exist, especially for the tokeni
 python -m pytest tests/test_rustbpe.py -v -s
 ```
 ## For Students
 nanochat is designed as an educational full-stack LLM implementation. If you're learning about how modern language models work from tokenization to deployment, this section will guide you through the codebase systematically.
 ### Learning Path
 The best way to understand nanochat is to follow the same order as the training pipeline. Here's the recommended reading sequence:
 #### **Phase 1: Foundations (Start Here)**
 1. **`nanochat/common.py`** - Common utilities, distributed setup, logging
   - *What to learn*: How distributed training is initialized, basic helper functions
   - *Key concepts*: DDP (Distributed Data Parallel), device management, logging patterns
 2. **`nanochat/tokenizer.py`** - Text tokenization and the BPE algorithm
   - *What to learn*: How text becomes numbers that neural networks can process
   - *Key concepts*: Byte Pair Encoding (BPE), vocabulary, special tokens
   - *Related*: `rustbpe/src/lib.rs` (Rust implementation for speed)
 3. **`scripts/tok_train.py`** - Tokenizer training script
   - *What to learn*: How to train a tokenizer from scratch on your dataset
   - *Try it*: Run `python -m scripts.tok_train --max_chars=2000000000` (after downloading data)
 #### **Phase 2: Model Architecture**
 4. **`nanochat/gpt.py`** ⭐ **CORE FILE**
   - *What to learn*: The Transformer architecture with modern improvements
   - *Key concepts*:
     - Rotary embeddings (RoPE) for positional encoding
     - QK normalization for training stability
     - Multi-Query Attention (MQA) for efficient inference
     - ReLU² activation function
     - RMSNorm (no learnable parameters)
   - *Architecture highlights*:
     - `CausalSelfAttention`: The attention mechanism
     - `MLP`: Feed-forward network with ReLU² activation
     - `Block`: One transformer layer (attention + MLP)
     - `GPT`: The full model putting it all together
 5. **`nanochat/muon.py`** and **`nanochat/adamw.py`** - Optimizers
   - *What to learn*: How different parameters need different optimization strategies
   - *Key insight*: Muon optimizer for matrix parameters, AdamW for embeddings
   - *Why dual optimizers?*: Different parameter types benefit from different update rules
 #### **Phase 3: Data & Training**
 6. **`nanochat/dataset.py`** - Dataset downloading and preparation
   - *What to learn*: How to download and manage large training datasets (FineWeb)
   - *Key concepts*: Data sharding, streaming, efficient storage
 7. **`nanochat/dataloader.py`** - Data loading during training
   - *What to learn*: How to efficiently feed data to the model during training
   - *Key concepts*: Tokenization on-the-fly, distributed data loading, batching
 8. **`scripts/base_train.py`** ⭐ **CORE FILE**
   - *What to learn*: The complete pretraining loop
   - *Key concepts*:
     - Gradient accumulation for large batch sizes
     - Mixed precision training (bfloat16)
     - Learning rate schedules
     - Checkpointing
     - Distributed training coordination
   - *Try it*: Read through the main training loop starting from `for step in range(num_iterations + 1):`
 #### **Phase 4: Evaluation**
 9. **`nanochat/loss_eval.py`** - Training/validation loss evaluation
   - *What to learn*: How to measure model perplexity on held-out data
   - *Key concepts*: Bits per byte (BPB), perplexity
 10. **`nanochat/core_eval.py`** - CORE benchmark evaluation
    - *What to learn*: How to evaluate language modeling capability
    - *Key concepts*: Next-token prediction accuracy as a metric
 11. **`tasks/*.py`** - Task-specific evaluations
    - `tasks/arc.py` - Reasoning benchmark
    - `tasks/gsm8k.py` - Math word problems
    - `tasks/humaneval.py` - Code generation
    - `tasks/mmlu.py` - General knowledge
    - `tasks/smoltalk.py` - Conversational ability
    - *What to learn*: How to evaluate LLMs on different capabilities
 #### **Phase 5: Inference & Serving**
 12. **`nanochat/engine.py`** ⭐ **CORE FILE**
    - *What to learn*: Efficient text generation with KV caching
    - *Key concepts*:
      - KV cache for fast autoregressive generation
      - Sampling strategies (temperature, top-k)
      - Tool use (calculator integration)
      - Batch generation
    - *Cool feature*: The calculator tool demonstrates how LLMs can use tools during generation
 13. **`scripts/chat_cli.py`** - Command-line chat interface
    - *What to learn*: How to build a simple chat interface
    - *Try it*: `python -m scripts.chat_cli -p "Why is the sky blue?"`
 14. **`scripts/chat_web.py`** - Web-based chat interface
    - *What to learn*: How to serve an LLM over HTTP
    - *Try it*: `python -m scripts.chat_web` (after training)
 #### **Phase 6: Advanced Training**
 15. **`scripts/mid_train.py`** - Midtraining
    - *What to learn*: Teaching the model special tokens and conversational format
    - *Key insight*: Bridge between pretraining and task-specific finetuning
 16. **`scripts/chat_sft.py`** - Supervised Fine-Tuning
    - *What to learn*: Adapting the model to follow instructions
    - *Key concepts*: Instruction tuning, chat templates
 17. **`scripts/chat_rl.py`** - Reinforcement Learning
    - *What to learn*: Using RL to improve specific capabilities (math)
    - *Key concepts*: Reward models, policy optimization
 #### **Phase 7: Infrastructure**
 18. **`nanochat/checkpoint_manager.py`** - Model checkpointing
    - *What to learn*: How to save and load model weights efficiently
 19. **`nanochat/report.py`** - Automated reporting
    - *What to learn*: How to track experiments and generate reports
 20. **`nanochat/configurator.py`** - Configuration management
    - *What to learn*: Command-line argument parsing for ML experiments
 ### Key Architectural Decisions & Why
 1. **Rotary Embeddings instead of learned positional embeddings**
   - *Why?*: Better length generalization, no extra parameters
   - *Where?*: `gpt.py` - see the `apply_rotary_emb()` function
 2. **Untied embeddings** (separate input and output embedding matrices)
   - *Why?*: More expressive, worth the extra parameters
   - *Where?*: `gpt.py` - `GPT` class has separate `wte` and `lm_head` parameters
 3. **QK Normalization**
   - *Why?*: Training stability, prevents attention logits from exploding
   - *Where?*: `gpt.py` - in `CausalSelfAttention.forward()` after rotary embeddings
 4. **Multi-Query Attention (MQA)**
   - *Why?*: Faster inference with minimal quality loss
   - *Where?*: `gpt.py` - `GPTConfig` has separate `n_head` and `n_kv_head`, see `repeat_kv()` function
 5. **ReLU² activation**
   - *Why?*: Better than GELU for smaller models, simple and effective
   - *Where?*: `gpt.py` - `MLP.forward()` uses `F.relu(x).square()`
 6. **Dual optimizer strategy** (Muon + AdamW)
   - *Why?*: Matrix parameters and embeddings benefit from different optimization
   - *Where?*: `gpt.py` - see `GPT.setup_optimizers()` method
 7. **Logit soft-capping**
   - *Why?*: Prevents extreme logit values, improves training stability
   - *Where?*: `gpt.py` - in `GPT.forward()`, search for "softcap"
 ### The Complete Pipeline Visualized
 ```
 1. Data Preparation
   ├─ Download FineWeb shards (dataset.py)
   ├─ Train BPE tokenizer (tok_train.py)
   └─ Tokenize data on-the-fly (dataloader.py)
 2. Pretraining
   ├─ Initialize model (gpt.py)
   ├─ Setup optimizers (muon.py, adamw.py)
   ├─ Train on tokens (base_train.py)
   └─ Evaluate on CORE (base_eval.py)
 3. Midtraining
   ├─ Load base checkpoint
   ├─ Train on formatted data (mid_train.py)
   └─ Evaluate on chat tasks (chat_eval.py)
 4. Fine-tuning
   ├─ Supervised learning (chat_sft.py)
   ├─ [Optional] RL training (chat_rl.py)
   └─ Final evaluation (chat_eval.py)
 5. Deployment
   ├─ Load best checkpoint
   ├─ Serve via CLI (chat_cli.py)
   └─ Serve via Web (chat_web.py)
 ```
 ### Concepts to Master
 As you read through the code, make sure you understand these fundamental concepts:
 **Tokenization:**
 - Why we need tokenization
 - How BPE works (greedy merge of most frequent pairs)
 - Special tokens and their purpose
 **Model Architecture:**
 - Self-attention mechanism (Q, K, V matrices)
 - Causal masking (can only attend to past tokens)
 - Residual connections (x + attention(x))
 - Layer normalization (RMSNorm variant)
 - Why we stack many layers
 **Training:**
 - Gradient descent and backpropagation
 - Loss function (cross-entropy for next token prediction)
 - Learning rate schedules (warmup + cosine decay)
 - Gradient accumulation (simulating larger batches)
 - Mixed precision training (bfloat16 for speed)
 **Distributed Training:**
 - Data parallelism (same model, different data shards)
 - Gradient synchronization across GPUs
 - All-reduce operations
 **Inference:**
 - Autoregressive generation (one token at a time)
 - KV caching (reuse past computations)
 - Sampling strategies (temperature, top-k)
 ### Recommended Experiments
 Once you've read through the code, try these experiments to deepen understanding:
 1. **Modify the tokenizer vocabulary size** - See how it affects compression and training
 2. **Change model depth** - Train a smaller/larger model, observe parameter count vs. performance
 3. **Experiment with batch sizes** - Understand the speed/memory tradeoff
 4. **Try different sampling temperatures** - See how it affects generation creativity
 5. **Implement a simple evaluation task** - Add your own benchmark in `tasks/`
 6. **Add a new tool** - Extend the calculator to support more operations
 ### Quick Start for Learning
 If you just want to understand the core without running anything:
 1. Read `gpt.py` - Understand the Transformer architecture
 2. Read `engine.py` - Understand how generation works
 3. Read `base_train.py` - Understand the training loop
 These three files (~1000 lines total) contain the essence of how modern LLMs work.
 ### Resources for Deeper Learning
 - **Attention paper**: "Attention Is All You Need" (Vaswani et al.)
 - **GPT-2 paper**: "Language Models are Unsupervised Multitask Learners"
 - **Rotary embeddings**: "RoFormer: Enhanced Transformer with Rotary Position Embedding"
 - **Andrej's videos**: Neural Networks: Zero to Hero series on YouTube
 - **LLM101n course**: The course this project was built for (when released)
 ## Contributing
 nanochat is nowhere finished. The goal is to improve the state of the art in micro models that are accessible to work with end to end on budgets of < $1000 dollars. Accessibility is about overall cost but also about cognitive complexity - nanochat is not an exhaustively configurable LLM "framework"; there will be no giant configuration objects, model factories, or if-then-else monsters in the code base. It is a single, cohesive, minimal, readable, hackable, maximally-forkable "strong baseline" codebase designed to run start to end and produce a concrete ChatGPT clone and its report card.