Add 'For Students' section with structured learning path through the codebase

This commit is contained in:
Rimom Costa 2025-10-13 18:50:35 +01:00
parent 5fd0b13886
commit b230ab8a0b

248
README.md
View File

@ -109,6 +109,254 @@ I haven't invested too much here but some tests exist, especially for the tokeni
python -m pytest tests/test_rustbpe.py -v -s python -m pytest tests/test_rustbpe.py -v -s
``` ```
## For Students
nanochat is designed as an educational full-stack LLM implementation. If you're learning about how modern language models work from tokenization to deployment, this section will guide you through the codebase systematically.
### Learning Path
The best way to understand nanochat is to follow the same order as the training pipeline. Here's the recommended reading sequence:
#### **Phase 1: Foundations (Start Here)**
1. **`nanochat/common.py`** - Common utilities, distributed setup, logging
- *What to learn*: How distributed training is initialized, basic helper functions
- *Key concepts*: DDP (Distributed Data Parallel), device management, logging patterns
2. **`nanochat/tokenizer.py`** - Text tokenization and the BPE algorithm
- *What to learn*: How text becomes numbers that neural networks can process
- *Key concepts*: Byte Pair Encoding (BPE), vocabulary, special tokens
- *Related*: `rustbpe/src/lib.rs` (Rust implementation for speed)
3. **`scripts/tok_train.py`** - Tokenizer training script
- *What to learn*: How to train a tokenizer from scratch on your dataset
- *Try it*: Run `python -m scripts.tok_train --max_chars=2000000000` (after downloading data)
#### **Phase 2: Model Architecture**
4. **`nanochat/gpt.py`** ⭐ **CORE FILE**
- *What to learn*: The Transformer architecture with modern improvements
- *Key concepts*:
- Rotary embeddings (RoPE) for positional encoding
- QK normalization for training stability
- Multi-Query Attention (MQA) for efficient inference
- ReLU² activation function
- RMSNorm (no learnable parameters)
- *Architecture highlights*:
- `CausalSelfAttention`: The attention mechanism
- `MLP`: Feed-forward network with ReLU² activation
- `Block`: One transformer layer (attention + MLP)
- `GPT`: The full model putting it all together
5. **`nanochat/muon.py`** and **`nanochat/adamw.py`** - Optimizers
- *What to learn*: How different parameters need different optimization strategies
- *Key insight*: Muon optimizer for matrix parameters, AdamW for embeddings
- *Why dual optimizers?*: Different parameter types benefit from different update rules
#### **Phase 3: Data & Training**
6. **`nanochat/dataset.py`** - Dataset downloading and preparation
- *What to learn*: How to download and manage large training datasets (FineWeb)
- *Key concepts*: Data sharding, streaming, efficient storage
7. **`nanochat/dataloader.py`** - Data loading during training
- *What to learn*: How to efficiently feed data to the model during training
- *Key concepts*: Tokenization on-the-fly, distributed data loading, batching
8. **`scripts/base_train.py`** ⭐ **CORE FILE**
- *What to learn*: The complete pretraining loop
- *Key concepts*:
- Gradient accumulation for large batch sizes
- Mixed precision training (bfloat16)
- Learning rate schedules
- Checkpointing
- Distributed training coordination
- *Try it*: Read through the main training loop starting from `for step in range(num_iterations + 1):`
#### **Phase 4: Evaluation**
9. **`nanochat/loss_eval.py`** - Training/validation loss evaluation
- *What to learn*: How to measure model perplexity on held-out data
- *Key concepts*: Bits per byte (BPB), perplexity
10. **`nanochat/core_eval.py`** - CORE benchmark evaluation
- *What to learn*: How to evaluate language modeling capability
- *Key concepts*: Next-token prediction accuracy as a metric
11. **`tasks/*.py`** - Task-specific evaluations
- `tasks/arc.py` - Reasoning benchmark
- `tasks/gsm8k.py` - Math word problems
- `tasks/humaneval.py` - Code generation
- `tasks/mmlu.py` - General knowledge
- `tasks/smoltalk.py` - Conversational ability
- *What to learn*: How to evaluate LLMs on different capabilities
#### **Phase 5: Inference & Serving**
12. **`nanochat/engine.py`** ⭐ **CORE FILE**
- *What to learn*: Efficient text generation with KV caching
- *Key concepts*:
- KV cache for fast autoregressive generation
- Sampling strategies (temperature, top-k)
- Tool use (calculator integration)
- Batch generation
- *Cool feature*: The calculator tool demonstrates how LLMs can use tools during generation
13. **`scripts/chat_cli.py`** - Command-line chat interface
- *What to learn*: How to build a simple chat interface
- *Try it*: `python -m scripts.chat_cli -p "Why is the sky blue?"`
14. **`scripts/chat_web.py`** - Web-based chat interface
- *What to learn*: How to serve an LLM over HTTP
- *Try it*: `python -m scripts.chat_web` (after training)
#### **Phase 6: Advanced Training**
15. **`scripts/mid_train.py`** - Midtraining
- *What to learn*: Teaching the model special tokens and conversational format
- *Key insight*: Bridge between pretraining and task-specific finetuning
16. **`scripts/chat_sft.py`** - Supervised Fine-Tuning
- *What to learn*: Adapting the model to follow instructions
- *Key concepts*: Instruction tuning, chat templates
17. **`scripts/chat_rl.py`** - Reinforcement Learning
- *What to learn*: Using RL to improve specific capabilities (math)
- *Key concepts*: Reward models, policy optimization
#### **Phase 7: Infrastructure**
18. **`nanochat/checkpoint_manager.py`** - Model checkpointing
- *What to learn*: How to save and load model weights efficiently
19. **`nanochat/report.py`** - Automated reporting
- *What to learn*: How to track experiments and generate reports
20. **`nanochat/configurator.py`** - Configuration management
- *What to learn*: Command-line argument parsing for ML experiments
### Key Architectural Decisions & Why
1. **Rotary Embeddings instead of learned positional embeddings**
- *Why?*: Better length generalization, no extra parameters
- *Where?*: `gpt.py` - see the `apply_rotary_emb()` function
2. **Untied embeddings** (separate input and output embedding matrices)
- *Why?*: More expressive, worth the extra parameters
- *Where?*: `gpt.py` - `GPT` class has separate `wte` and `lm_head` parameters
3. **QK Normalization**
- *Why?*: Training stability, prevents attention logits from exploding
- *Where?*: `gpt.py` - in `CausalSelfAttention.forward()` after rotary embeddings
4. **Multi-Query Attention (MQA)**
- *Why?*: Faster inference with minimal quality loss
- *Where?*: `gpt.py` - `GPTConfig` has separate `n_head` and `n_kv_head`, see `repeat_kv()` function
5. **ReLU² activation**
- *Why?*: Better than GELU for smaller models, simple and effective
- *Where?*: `gpt.py` - `MLP.forward()` uses `F.relu(x).square()`
6. **Dual optimizer strategy** (Muon + AdamW)
- *Why?*: Matrix parameters and embeddings benefit from different optimization
- *Where?*: `gpt.py` - see `GPT.setup_optimizers()` method
7. **Logit soft-capping**
- *Why?*: Prevents extreme logit values, improves training stability
- *Where?*: `gpt.py` - in `GPT.forward()`, search for "softcap"
### The Complete Pipeline Visualized
```
1. Data Preparation
├─ Download FineWeb shards (dataset.py)
├─ Train BPE tokenizer (tok_train.py)
└─ Tokenize data on-the-fly (dataloader.py)
2. Pretraining
├─ Initialize model (gpt.py)
├─ Setup optimizers (muon.py, adamw.py)
├─ Train on tokens (base_train.py)
└─ Evaluate on CORE (base_eval.py)
3. Midtraining
├─ Load base checkpoint
├─ Train on formatted data (mid_train.py)
└─ Evaluate on chat tasks (chat_eval.py)
4. Fine-tuning
├─ Supervised learning (chat_sft.py)
├─ [Optional] RL training (chat_rl.py)
└─ Final evaluation (chat_eval.py)
5. Deployment
├─ Load best checkpoint
├─ Serve via CLI (chat_cli.py)
└─ Serve via Web (chat_web.py)
```
### Concepts to Master
As you read through the code, make sure you understand these fundamental concepts:
**Tokenization:**
- Why we need tokenization
- How BPE works (greedy merge of most frequent pairs)
- Special tokens and their purpose
**Model Architecture:**
- Self-attention mechanism (Q, K, V matrices)
- Causal masking (can only attend to past tokens)
- Residual connections (x + attention(x))
- Layer normalization (RMSNorm variant)
- Why we stack many layers
**Training:**
- Gradient descent and backpropagation
- Loss function (cross-entropy for next token prediction)
- Learning rate schedules (warmup + cosine decay)
- Gradient accumulation (simulating larger batches)
- Mixed precision training (bfloat16 for speed)
**Distributed Training:**
- Data parallelism (same model, different data shards)
- Gradient synchronization across GPUs
- All-reduce operations
**Inference:**
- Autoregressive generation (one token at a time)
- KV caching (reuse past computations)
- Sampling strategies (temperature, top-k)
### Recommended Experiments
Once you've read through the code, try these experiments to deepen understanding:
1. **Modify the tokenizer vocabulary size** - See how it affects compression and training
2. **Change model depth** - Train a smaller/larger model, observe parameter count vs. performance
3. **Experiment with batch sizes** - Understand the speed/memory tradeoff
4. **Try different sampling temperatures** - See how it affects generation creativity
5. **Implement a simple evaluation task** - Add your own benchmark in `tasks/`
6. **Add a new tool** - Extend the calculator to support more operations
### Quick Start for Learning
If you just want to understand the core without running anything:
1. Read `gpt.py` - Understand the Transformer architecture
2. Read `engine.py` - Understand how generation works
3. Read `base_train.py` - Understand the training loop
These three files (~1000 lines total) contain the essence of how modern LLMs work.
### Resources for Deeper Learning
- **Attention paper**: "Attention Is All You Need" (Vaswani et al.)
- **GPT-2 paper**: "Language Models are Unsupervised Multitask Learners"
- **Rotary embeddings**: "RoFormer: Enhanced Transformer with Rotary Position Embedding"
- **Andrej's videos**: Neural Networks: Zero to Hero series on YouTube
- **LLM101n course**: The course this project was built for (when released)
## Contributing ## Contributing
nanochat is nowhere finished. The goal is to improve the state of the art in micro models that are accessible to work with end to end on budgets of < $1000 dollars. Accessibility is about overall cost but also about cognitive complexity - nanochat is not an exhaustively configurable LLM "framework"; there will be no giant configuration objects, model factories, or if-then-else monsters in the code base. It is a single, cohesive, minimal, readable, hackable, maximally-forkable "strong baseline" codebase designed to run start to end and produce a concrete ChatGPT clone and its report card. nanochat is nowhere finished. The goal is to improve the state of the art in micro models that are accessible to work with end to end on budgets of < $1000 dollars. Accessibility is about overall cost but also about cognitive complexity - nanochat is not an exhaustively configurable LLM "framework"; there will be no giant configuration objects, model factories, or if-then-else monsters in the code base. It is a single, cohesive, minimal, readable, hackable, maximally-forkable "strong baseline" codebase designed to run start to end and produce a concrete ChatGPT clone and its report card.