Add 'For Students' section with structured learning path through the codebase

This commit is contained in:
Rimom Costa 2025-10-13 18:50:35 +01:00
parent 5fd0b13886
commit b230ab8a0b

248
README.md
View File

@ -109,6 +109,254 @@ I haven't invested too much here but some tests exist, especially for the tokeni
python -m pytest tests/test_rustbpe.py -v -s
```
## For Students
nanochat is designed as an educational full-stack LLM implementation. If you're learning about how modern language models work from tokenization to deployment, this section will guide you through the codebase systematically.
### Learning Path
The best way to understand nanochat is to follow the same order as the training pipeline. Here's the recommended reading sequence:
#### **Phase 1: Foundations (Start Here)**
1. **`nanochat/common.py`** - Common utilities, distributed setup, logging
- *What to learn*: How distributed training is initialized, basic helper functions
- *Key concepts*: DDP (Distributed Data Parallel), device management, logging patterns
2. **`nanochat/tokenizer.py`** - Text tokenization and the BPE algorithm
- *What to learn*: How text becomes numbers that neural networks can process
- *Key concepts*: Byte Pair Encoding (BPE), vocabulary, special tokens
- *Related*: `rustbpe/src/lib.rs` (Rust implementation for speed)
3. **`scripts/tok_train.py`** - Tokenizer training script
- *What to learn*: How to train a tokenizer from scratch on your dataset
- *Try it*: Run `python -m scripts.tok_train --max_chars=2000000000` (after downloading data)
#### **Phase 2: Model Architecture**
4. **`nanochat/gpt.py`** ⭐ **CORE FILE**
- *What to learn*: The Transformer architecture with modern improvements
- *Key concepts*:
- Rotary embeddings (RoPE) for positional encoding
- QK normalization for training stability
- Multi-Query Attention (MQA) for efficient inference
- ReLU² activation function
- RMSNorm (no learnable parameters)
- *Architecture highlights*:
- `CausalSelfAttention`: The attention mechanism
- `MLP`: Feed-forward network with ReLU² activation
- `Block`: One transformer layer (attention + MLP)
- `GPT`: The full model putting it all together
5. **`nanochat/muon.py`** and **`nanochat/adamw.py`** - Optimizers
- *What to learn*: How different parameters need different optimization strategies
- *Key insight*: Muon optimizer for matrix parameters, AdamW for embeddings
- *Why dual optimizers?*: Different parameter types benefit from different update rules
#### **Phase 3: Data & Training**
6. **`nanochat/dataset.py`** - Dataset downloading and preparation
- *What to learn*: How to download and manage large training datasets (FineWeb)
- *Key concepts*: Data sharding, streaming, efficient storage
7. **`nanochat/dataloader.py`** - Data loading during training
- *What to learn*: How to efficiently feed data to the model during training
- *Key concepts*: Tokenization on-the-fly, distributed data loading, batching
8. **`scripts/base_train.py`** ⭐ **CORE FILE**
- *What to learn*: The complete pretraining loop
- *Key concepts*:
- Gradient accumulation for large batch sizes
- Mixed precision training (bfloat16)
- Learning rate schedules
- Checkpointing
- Distributed training coordination
- *Try it*: Read through the main training loop starting from `for step in range(num_iterations + 1):`
#### **Phase 4: Evaluation**
9. **`nanochat/loss_eval.py`** - Training/validation loss evaluation
- *What to learn*: How to measure model perplexity on held-out data
- *Key concepts*: Bits per byte (BPB), perplexity
10. **`nanochat/core_eval.py`** - CORE benchmark evaluation
- *What to learn*: How to evaluate language modeling capability
- *Key concepts*: Next-token prediction accuracy as a metric
11. **`tasks/*.py`** - Task-specific evaluations
- `tasks/arc.py` - Reasoning benchmark
- `tasks/gsm8k.py` - Math word problems
- `tasks/humaneval.py` - Code generation
- `tasks/mmlu.py` - General knowledge
- `tasks/smoltalk.py` - Conversational ability
- *What to learn*: How to evaluate LLMs on different capabilities
#### **Phase 5: Inference & Serving**
12. **`nanochat/engine.py`** ⭐ **CORE FILE**
- *What to learn*: Efficient text generation with KV caching
- *Key concepts*:
- KV cache for fast autoregressive generation
- Sampling strategies (temperature, top-k)
- Tool use (calculator integration)
- Batch generation
- *Cool feature*: The calculator tool demonstrates how LLMs can use tools during generation
13. **`scripts/chat_cli.py`** - Command-line chat interface
- *What to learn*: How to build a simple chat interface
- *Try it*: `python -m scripts.chat_cli -p "Why is the sky blue?"`
14. **`scripts/chat_web.py`** - Web-based chat interface
- *What to learn*: How to serve an LLM over HTTP
- *Try it*: `python -m scripts.chat_web` (after training)
#### **Phase 6: Advanced Training**
15. **`scripts/mid_train.py`** - Midtraining
- *What to learn*: Teaching the model special tokens and conversational format
- *Key insight*: Bridge between pretraining and task-specific finetuning
16. **`scripts/chat_sft.py`** - Supervised Fine-Tuning
- *What to learn*: Adapting the model to follow instructions
- *Key concepts*: Instruction tuning, chat templates
17. **`scripts/chat_rl.py`** - Reinforcement Learning
- *What to learn*: Using RL to improve specific capabilities (math)
- *Key concepts*: Reward models, policy optimization
#### **Phase 7: Infrastructure**
18. **`nanochat/checkpoint_manager.py`** - Model checkpointing
- *What to learn*: How to save and load model weights efficiently
19. **`nanochat/report.py`** - Automated reporting
- *What to learn*: How to track experiments and generate reports
20. **`nanochat/configurator.py`** - Configuration management
- *What to learn*: Command-line argument parsing for ML experiments
### Key Architectural Decisions & Why
1. **Rotary Embeddings instead of learned positional embeddings**
- *Why?*: Better length generalization, no extra parameters
- *Where?*: `gpt.py` - see the `apply_rotary_emb()` function
2. **Untied embeddings** (separate input and output embedding matrices)
- *Why?*: More expressive, worth the extra parameters
- *Where?*: `gpt.py` - `GPT` class has separate `wte` and `lm_head` parameters
3. **QK Normalization**
- *Why?*: Training stability, prevents attention logits from exploding
- *Where?*: `gpt.py` - in `CausalSelfAttention.forward()` after rotary embeddings
4. **Multi-Query Attention (MQA)**
- *Why?*: Faster inference with minimal quality loss
- *Where?*: `gpt.py` - `GPTConfig` has separate `n_head` and `n_kv_head`, see `repeat_kv()` function
5. **ReLU² activation**
- *Why?*: Better than GELU for smaller models, simple and effective
- *Where?*: `gpt.py` - `MLP.forward()` uses `F.relu(x).square()`
6. **Dual optimizer strategy** (Muon + AdamW)
- *Why?*: Matrix parameters and embeddings benefit from different optimization
- *Where?*: `gpt.py` - see `GPT.setup_optimizers()` method
7. **Logit soft-capping**
- *Why?*: Prevents extreme logit values, improves training stability
- *Where?*: `gpt.py` - in `GPT.forward()`, search for "softcap"
### The Complete Pipeline Visualized
```
1. Data Preparation
├─ Download FineWeb shards (dataset.py)
├─ Train BPE tokenizer (tok_train.py)
└─ Tokenize data on-the-fly (dataloader.py)
2. Pretraining
├─ Initialize model (gpt.py)
├─ Setup optimizers (muon.py, adamw.py)
├─ Train on tokens (base_train.py)
└─ Evaluate on CORE (base_eval.py)
3. Midtraining
├─ Load base checkpoint
├─ Train on formatted data (mid_train.py)
└─ Evaluate on chat tasks (chat_eval.py)
4. Fine-tuning
├─ Supervised learning (chat_sft.py)
├─ [Optional] RL training (chat_rl.py)
└─ Final evaluation (chat_eval.py)
5. Deployment
├─ Load best checkpoint
├─ Serve via CLI (chat_cli.py)
└─ Serve via Web (chat_web.py)
```
### Concepts to Master
As you read through the code, make sure you understand these fundamental concepts:
**Tokenization:**
- Why we need tokenization
- How BPE works (greedy merge of most frequent pairs)
- Special tokens and their purpose
**Model Architecture:**
- Self-attention mechanism (Q, K, V matrices)
- Causal masking (can only attend to past tokens)
- Residual connections (x + attention(x))
- Layer normalization (RMSNorm variant)
- Why we stack many layers
**Training:**
- Gradient descent and backpropagation
- Loss function (cross-entropy for next token prediction)
- Learning rate schedules (warmup + cosine decay)
- Gradient accumulation (simulating larger batches)
- Mixed precision training (bfloat16 for speed)
**Distributed Training:**
- Data parallelism (same model, different data shards)
- Gradient synchronization across GPUs
- All-reduce operations
**Inference:**
- Autoregressive generation (one token at a time)
- KV caching (reuse past computations)
- Sampling strategies (temperature, top-k)
### Recommended Experiments
Once you've read through the code, try these experiments to deepen understanding:
1. **Modify the tokenizer vocabulary size** - See how it affects compression and training
2. **Change model depth** - Train a smaller/larger model, observe parameter count vs. performance
3. **Experiment with batch sizes** - Understand the speed/memory tradeoff
4. **Try different sampling temperatures** - See how it affects generation creativity
5. **Implement a simple evaluation task** - Add your own benchmark in `tasks/`
6. **Add a new tool** - Extend the calculator to support more operations
### Quick Start for Learning
If you just want to understand the core without running anything:
1. Read `gpt.py` - Understand the Transformer architecture
2. Read `engine.py` - Understand how generation works
3. Read `base_train.py` - Understand the training loop
These three files (~1000 lines total) contain the essence of how modern LLMs work.
### Resources for Deeper Learning
- **Attention paper**: "Attention Is All You Need" (Vaswani et al.)
- **GPT-2 paper**: "Language Models are Unsupervised Multitask Learners"
- **Rotary embeddings**: "RoFormer: Enhanced Transformer with Rotary Position Embedding"
- **Andrej's videos**: Neural Networks: Zero to Hero series on YouTube
- **LLM101n course**: The course this project was built for (when released)
## Contributing
nanochat is nowhere finished. The goal is to improve the state of the art in micro models that are accessible to work with end to end on budgets of < $1000 dollars. Accessibility is about overall cost but also about cognitive complexity - nanochat is not an exhaustively configurable LLM "framework"; there will be no giant configuration objects, model factories, or if-then-else monsters in the code base. It is a single, cohesive, minimal, readable, hackable, maximally-forkable "strong baseline" codebase designed to run start to end and produce a concrete ChatGPT clone and its report card.