Add 'For Students' section with structured learning path through the codebase

2025-12-06 04:12:13 +00:00 · 2025-10-13 18:50:35 +01:00 · 2025-10-13 18:50:35 +01:00 · b230ab8a0b
commit b230ab8a0b
parent 5fd0b13886
1 changed files with 248 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -109,6 +109,254 @@ I haven't invested too much here but some tests exist, especially for the tokeni
 python -m pytest tests/test_rustbpe.py -v -s
 ```

+## For Students
+
+nanochat is designed as an educational full-stack LLM implementation. If you're learning about how modern language models work from tokenization to deployment, this section will guide you through the codebase systematically.
+
+### Learning Path
+
+The best way to understand nanochat is to follow the same order as the training pipeline. Here's the recommended reading sequence:
+
+#### **Phase 1: Foundations (Start Here)**
+
+1. **`nanochat/common.py`** - Common utilities, distributed setup, logging
+   - *What to learn*: How distributed training is initialized, basic helper functions
+   - *Key concepts*: DDP (Distributed Data Parallel), device management, logging patterns
+
+2. **`nanochat/tokenizer.py`** - Text tokenization and the BPE algorithm
+   - *What to learn*: How text becomes numbers that neural networks can process
+   - *Key concepts*: Byte Pair Encoding (BPE), vocabulary, special tokens
+   - *Related*: `rustbpe/src/lib.rs` (Rust implementation for speed)
+
+3. **`scripts/tok_train.py`** - Tokenizer training script
+   - *What to learn*: How to train a tokenizer from scratch on your dataset
+   - *Try it*: Run `python -m scripts.tok_train --max_chars=2000000000` (after downloading data)
+
+#### **Phase 2: Model Architecture**
+
+4. **`nanochat/gpt.py`** ⭐ **CORE FILE**
+   - *What to learn*: The Transformer architecture with modern improvements
+   - *Key concepts*:
+     - Rotary embeddings (RoPE) for positional encoding
+     - QK normalization for training stability
+     - Multi-Query Attention (MQA) for efficient inference
+     - ReLU² activation function
+     - RMSNorm (no learnable parameters)
+   - *Architecture highlights*:
+     - `CausalSelfAttention`: The attention mechanism
+     - `MLP`: Feed-forward network with ReLU² activation
+     - `Block`: One transformer layer (attention + MLP)
+     - `GPT`: The full model putting it all together
+
+5. **`nanochat/muon.py`** and **`nanochat/adamw.py`** - Optimizers
+   - *What to learn*: How different parameters need different optimization strategies
+   - *Key insight*: Muon optimizer for matrix parameters, AdamW for embeddings
+   - *Why dual optimizers?*: Different parameter types benefit from different update rules
+
+#### **Phase 3: Data & Training**
+
+6. **`nanochat/dataset.py`** - Dataset downloading and preparation
+   - *What to learn*: How to download and manage large training datasets (FineWeb)
+   - *Key concepts*: Data sharding, streaming, efficient storage
+
+7. **`nanochat/dataloader.py`** - Data loading during training
+   - *What to learn*: How to efficiently feed data to the model during training
+   - *Key concepts*: Tokenization on-the-fly, distributed data loading, batching
+
+8. **`scripts/base_train.py`** ⭐ **CORE FILE**
+   - *What to learn*: The complete pretraining loop
+   - *Key concepts*:
+     - Gradient accumulation for large batch sizes
+     - Mixed precision training (bfloat16)
+     - Learning rate schedules
+     - Checkpointing
+     - Distributed training coordination
+   - *Try it*: Read through the main training loop starting from `for step in range(num_iterations + 1):`
+
+#### **Phase 4: Evaluation**
+
+9. **`nanochat/loss_eval.py`** - Training/validation loss evaluation
+   - *What to learn*: How to measure model perplexity on held-out data
+   - *Key concepts*: Bits per byte (BPB), perplexity
+
+10. **`nanochat/core_eval.py`** - CORE benchmark evaluation
+    - *What to learn*: How to evaluate language modeling capability
+    - *Key concepts*: Next-token prediction accuracy as a metric
+
+11. **`tasks/*.py`** - Task-specific evaluations
+    - `tasks/arc.py` - Reasoning benchmark
+    - `tasks/gsm8k.py` - Math word problems
+    - `tasks/humaneval.py` - Code generation
+    - `tasks/mmlu.py` - General knowledge
+    - `tasks/smoltalk.py` - Conversational ability
+    - *What to learn*: How to evaluate LLMs on different capabilities
+
+#### **Phase 5: Inference & Serving**
+
+12. **`nanochat/engine.py`** ⭐ **CORE FILE**
+    - *What to learn*: Efficient text generation with KV caching
+    - *Key concepts*:
+      - KV cache for fast autoregressive generation
+      - Sampling strategies (temperature, top-k)
+      - Tool use (calculator integration)
+      - Batch generation
+    - *Cool feature*: The calculator tool demonstrates how LLMs can use tools during generation
+
+13. **`scripts/chat_cli.py`** - Command-line chat interface
+    - *What to learn*: How to build a simple chat interface
+    - *Try it*: `python -m scripts.chat_cli -p "Why is the sky blue?"`
+
+14. **`scripts/chat_web.py`** - Web-based chat interface
+    - *What to learn*: How to serve an LLM over HTTP
+    - *Try it*: `python -m scripts.chat_web` (after training)
+
+#### **Phase 6: Advanced Training**
+
+15. **`scripts/mid_train.py`** - Midtraining
+    - *What to learn*: Teaching the model special tokens and conversational format
+    - *Key insight*: Bridge between pretraining and task-specific finetuning
+
+16. **`scripts/chat_sft.py`** - Supervised Fine-Tuning
+    - *What to learn*: Adapting the model to follow instructions
+    - *Key concepts*: Instruction tuning, chat templates
+
+17. **`scripts/chat_rl.py`** - Reinforcement Learning
+    - *What to learn*: Using RL to improve specific capabilities (math)
+    - *Key concepts*: Reward models, policy optimization
+
+#### **Phase 7: Infrastructure**
+
+18. **`nanochat/checkpoint_manager.py`** - Model checkpointing
+    - *What to learn*: How to save and load model weights efficiently
+
+19. **`nanochat/report.py`** - Automated reporting
+    - *What to learn*: How to track experiments and generate reports
+
+20. **`nanochat/configurator.py`** - Configuration management
+    - *What to learn*: Command-line argument parsing for ML experiments
+
+### Key Architectural Decisions & Why
+
+1. **Rotary Embeddings instead of learned positional embeddings**
+   - *Why?*: Better length generalization, no extra parameters
+   - *Where?*: `gpt.py` - see the `apply_rotary_emb()` function
+
+2. **Untied embeddings** (separate input and output embedding matrices)
+   - *Why?*: More expressive, worth the extra parameters
+   - *Where?*: `gpt.py` - `GPT` class has separate `wte` and `lm_head` parameters
+
+3. **QK Normalization**
+   - *Why?*: Training stability, prevents attention logits from exploding
+   - *Where?*: `gpt.py` - in `CausalSelfAttention.forward()` after rotary embeddings
+
+4. **Multi-Query Attention (MQA)**
+   - *Why?*: Faster inference with minimal quality loss
+   - *Where?*: `gpt.py` - `GPTConfig` has separate `n_head` and `n_kv_head`, see `repeat_kv()` function
+
+5. **ReLU² activation**
+   - *Why?*: Better than GELU for smaller models, simple and effective
+   - *Where?*: `gpt.py` - `MLP.forward()` uses `F.relu(x).square()`
+
+6. **Dual optimizer strategy** (Muon + AdamW)
+   - *Why?*: Matrix parameters and embeddings benefit from different optimization
+   - *Where?*: `gpt.py` - see `GPT.setup_optimizers()` method
+
+7. **Logit soft-capping**
+   - *Why?*: Prevents extreme logit values, improves training stability
+   - *Where?*: `gpt.py` - in `GPT.forward()`, search for "softcap"
+
+### The Complete Pipeline Visualized
+
+```
+1. Data Preparation
+   ├─ Download FineWeb shards (dataset.py)
+   ├─ Train BPE tokenizer (tok_train.py)
+   └─ Tokenize data on-the-fly (dataloader.py)
+   
+2. Pretraining
+   ├─ Initialize model (gpt.py)
+   ├─ Setup optimizers (muon.py, adamw.py)
+   ├─ Train on tokens (base_train.py)
+   └─ Evaluate on CORE (base_eval.py)
+   
+3. Midtraining
+   ├─ Load base checkpoint
+   ├─ Train on formatted data (mid_train.py)
+   └─ Evaluate on chat tasks (chat_eval.py)
+   
+4. Fine-tuning
+   ├─ Supervised learning (chat_sft.py)
+   ├─ [Optional] RL training (chat_rl.py)
+   └─ Final evaluation (chat_eval.py)
+   
+5. Deployment
+   ├─ Load best checkpoint
+   ├─ Serve via CLI (chat_cli.py)
+   └─ Serve via Web (chat_web.py)
+```
+
+### Concepts to Master
+
+As you read through the code, make sure you understand these fundamental concepts:
+
+**Tokenization:**
+- Why we need tokenization
+- How BPE works (greedy merge of most frequent pairs)
+- Special tokens and their purpose
+
+**Model Architecture:**
+- Self-attention mechanism (Q, K, V matrices)
+- Causal masking (can only attend to past tokens)
+- Residual connections (x + attention(x))
+- Layer normalization (RMSNorm variant)
+- Why we stack many layers
+
+**Training:**
+- Gradient descent and backpropagation
+- Loss function (cross-entropy for next token prediction)
+- Learning rate schedules (warmup + cosine decay)
+- Gradient accumulation (simulating larger batches)
+- Mixed precision training (bfloat16 for speed)
+
+**Distributed Training:**
+- Data parallelism (same model, different data shards)
+- Gradient synchronization across GPUs
+- All-reduce operations
+
+**Inference:**
+- Autoregressive generation (one token at a time)
+- KV caching (reuse past computations)
+- Sampling strategies (temperature, top-k)
+
+### Recommended Experiments
+
+Once you've read through the code, try these experiments to deepen understanding:
+
+1. **Modify the tokenizer vocabulary size** - See how it affects compression and training
+2. **Change model depth** - Train a smaller/larger model, observe parameter count vs. performance
+3. **Experiment with batch sizes** - Understand the speed/memory tradeoff
+4. **Try different sampling temperatures** - See how it affects generation creativity
+5. **Implement a simple evaluation task** - Add your own benchmark in `tasks/`
+6. **Add a new tool** - Extend the calculator to support more operations
+
+### Quick Start for Learning
+
+If you just want to understand the core without running anything:
+
+1. Read `gpt.py` - Understand the Transformer architecture
+2. Read `engine.py` - Understand how generation works
+3. Read `base_train.py` - Understand the training loop
+
+These three files (~1000 lines total) contain the essence of how modern LLMs work.
+
+### Resources for Deeper Learning
+
+- **Attention paper**: "Attention Is All You Need" (Vaswani et al.)
+- **GPT-2 paper**: "Language Models are Unsupervised Multitask Learners"
+- **Rotary embeddings**: "RoFormer: Enhanced Transformer with Rotary Position Embedding"
+- **Andrej's videos**: Neural Networks: Zero to Hero series on YouTube
+- **LLM101n course**: The course this project was built for (when released)
+
 ## Contributing

 nanochat is nowhere finished. The goal is to improve the state of the art in micro models that are accessible to work with end to end on budgets of < $1000 dollars. Accessibility is about overall cost but also about cognitive complexity - nanochat is not an exhaustively configurable LLM "framework"; there will be no giant configuration objects, model factories, or if-then-else monsters in the code base. It is a single, cohesive, minimal, readable, hackable, maximally-forkable "strong baseline" codebase designed to run start to end and produce a concrete ChatGPT clone and its report card.