mirror of
https://github.com/karpathy/nanochat.git
synced 2025-12-06 04:12:13 +00:00
Add 'For Students' section with structured learning path through the codebase
This commit is contained in:
parent
5fd0b13886
commit
b230ab8a0b
248
README.md
248
README.md
|
|
@ -109,6 +109,254 @@ I haven't invested too much here but some tests exist, especially for the tokeni
|
|||
python -m pytest tests/test_rustbpe.py -v -s
|
||||
```
|
||||
|
||||
## For Students
|
||||
|
||||
nanochat is designed as an educational full-stack LLM implementation. If you're learning about how modern language models work from tokenization to deployment, this section will guide you through the codebase systematically.
|
||||
|
||||
### Learning Path
|
||||
|
||||
The best way to understand nanochat is to follow the same order as the training pipeline. Here's the recommended reading sequence:
|
||||
|
||||
#### **Phase 1: Foundations (Start Here)**
|
||||
|
||||
1. **`nanochat/common.py`** - Common utilities, distributed setup, logging
|
||||
- *What to learn*: How distributed training is initialized, basic helper functions
|
||||
- *Key concepts*: DDP (Distributed Data Parallel), device management, logging patterns
|
||||
|
||||
2. **`nanochat/tokenizer.py`** - Text tokenization and the BPE algorithm
|
||||
- *What to learn*: How text becomes numbers that neural networks can process
|
||||
- *Key concepts*: Byte Pair Encoding (BPE), vocabulary, special tokens
|
||||
- *Related*: `rustbpe/src/lib.rs` (Rust implementation for speed)
|
||||
|
||||
3. **`scripts/tok_train.py`** - Tokenizer training script
|
||||
- *What to learn*: How to train a tokenizer from scratch on your dataset
|
||||
- *Try it*: Run `python -m scripts.tok_train --max_chars=2000000000` (after downloading data)
|
||||
|
||||
#### **Phase 2: Model Architecture**
|
||||
|
||||
4. **`nanochat/gpt.py`** ⭐ **CORE FILE**
|
||||
- *What to learn*: The Transformer architecture with modern improvements
|
||||
- *Key concepts*:
|
||||
- Rotary embeddings (RoPE) for positional encoding
|
||||
- QK normalization for training stability
|
||||
- Multi-Query Attention (MQA) for efficient inference
|
||||
- ReLU² activation function
|
||||
- RMSNorm (no learnable parameters)
|
||||
- *Architecture highlights*:
|
||||
- `CausalSelfAttention`: The attention mechanism
|
||||
- `MLP`: Feed-forward network with ReLU² activation
|
||||
- `Block`: One transformer layer (attention + MLP)
|
||||
- `GPT`: The full model putting it all together
|
||||
|
||||
5. **`nanochat/muon.py`** and **`nanochat/adamw.py`** - Optimizers
|
||||
- *What to learn*: How different parameters need different optimization strategies
|
||||
- *Key insight*: Muon optimizer for matrix parameters, AdamW for embeddings
|
||||
- *Why dual optimizers?*: Different parameter types benefit from different update rules
|
||||
|
||||
#### **Phase 3: Data & Training**
|
||||
|
||||
6. **`nanochat/dataset.py`** - Dataset downloading and preparation
|
||||
- *What to learn*: How to download and manage large training datasets (FineWeb)
|
||||
- *Key concepts*: Data sharding, streaming, efficient storage
|
||||
|
||||
7. **`nanochat/dataloader.py`** - Data loading during training
|
||||
- *What to learn*: How to efficiently feed data to the model during training
|
||||
- *Key concepts*: Tokenization on-the-fly, distributed data loading, batching
|
||||
|
||||
8. **`scripts/base_train.py`** ⭐ **CORE FILE**
|
||||
- *What to learn*: The complete pretraining loop
|
||||
- *Key concepts*:
|
||||
- Gradient accumulation for large batch sizes
|
||||
- Mixed precision training (bfloat16)
|
||||
- Learning rate schedules
|
||||
- Checkpointing
|
||||
- Distributed training coordination
|
||||
- *Try it*: Read through the main training loop starting from `for step in range(num_iterations + 1):`
|
||||
|
||||
#### **Phase 4: Evaluation**
|
||||
|
||||
9. **`nanochat/loss_eval.py`** - Training/validation loss evaluation
|
||||
- *What to learn*: How to measure model perplexity on held-out data
|
||||
- *Key concepts*: Bits per byte (BPB), perplexity
|
||||
|
||||
10. **`nanochat/core_eval.py`** - CORE benchmark evaluation
|
||||
- *What to learn*: How to evaluate language modeling capability
|
||||
- *Key concepts*: Next-token prediction accuracy as a metric
|
||||
|
||||
11. **`tasks/*.py`** - Task-specific evaluations
|
||||
- `tasks/arc.py` - Reasoning benchmark
|
||||
- `tasks/gsm8k.py` - Math word problems
|
||||
- `tasks/humaneval.py` - Code generation
|
||||
- `tasks/mmlu.py` - General knowledge
|
||||
- `tasks/smoltalk.py` - Conversational ability
|
||||
- *What to learn*: How to evaluate LLMs on different capabilities
|
||||
|
||||
#### **Phase 5: Inference & Serving**
|
||||
|
||||
12. **`nanochat/engine.py`** ⭐ **CORE FILE**
|
||||
- *What to learn*: Efficient text generation with KV caching
|
||||
- *Key concepts*:
|
||||
- KV cache for fast autoregressive generation
|
||||
- Sampling strategies (temperature, top-k)
|
||||
- Tool use (calculator integration)
|
||||
- Batch generation
|
||||
- *Cool feature*: The calculator tool demonstrates how LLMs can use tools during generation
|
||||
|
||||
13. **`scripts/chat_cli.py`** - Command-line chat interface
|
||||
- *What to learn*: How to build a simple chat interface
|
||||
- *Try it*: `python -m scripts.chat_cli -p "Why is the sky blue?"`
|
||||
|
||||
14. **`scripts/chat_web.py`** - Web-based chat interface
|
||||
- *What to learn*: How to serve an LLM over HTTP
|
||||
- *Try it*: `python -m scripts.chat_web` (after training)
|
||||
|
||||
#### **Phase 6: Advanced Training**
|
||||
|
||||
15. **`scripts/mid_train.py`** - Midtraining
|
||||
- *What to learn*: Teaching the model special tokens and conversational format
|
||||
- *Key insight*: Bridge between pretraining and task-specific finetuning
|
||||
|
||||
16. **`scripts/chat_sft.py`** - Supervised Fine-Tuning
|
||||
- *What to learn*: Adapting the model to follow instructions
|
||||
- *Key concepts*: Instruction tuning, chat templates
|
||||
|
||||
17. **`scripts/chat_rl.py`** - Reinforcement Learning
|
||||
- *What to learn*: Using RL to improve specific capabilities (math)
|
||||
- *Key concepts*: Reward models, policy optimization
|
||||
|
||||
#### **Phase 7: Infrastructure**
|
||||
|
||||
18. **`nanochat/checkpoint_manager.py`** - Model checkpointing
|
||||
- *What to learn*: How to save and load model weights efficiently
|
||||
|
||||
19. **`nanochat/report.py`** - Automated reporting
|
||||
- *What to learn*: How to track experiments and generate reports
|
||||
|
||||
20. **`nanochat/configurator.py`** - Configuration management
|
||||
- *What to learn*: Command-line argument parsing for ML experiments
|
||||
|
||||
### Key Architectural Decisions & Why
|
||||
|
||||
1. **Rotary Embeddings instead of learned positional embeddings**
|
||||
- *Why?*: Better length generalization, no extra parameters
|
||||
- *Where?*: `gpt.py` - see the `apply_rotary_emb()` function
|
||||
|
||||
2. **Untied embeddings** (separate input and output embedding matrices)
|
||||
- *Why?*: More expressive, worth the extra parameters
|
||||
- *Where?*: `gpt.py` - `GPT` class has separate `wte` and `lm_head` parameters
|
||||
|
||||
3. **QK Normalization**
|
||||
- *Why?*: Training stability, prevents attention logits from exploding
|
||||
- *Where?*: `gpt.py` - in `CausalSelfAttention.forward()` after rotary embeddings
|
||||
|
||||
4. **Multi-Query Attention (MQA)**
|
||||
- *Why?*: Faster inference with minimal quality loss
|
||||
- *Where?*: `gpt.py` - `GPTConfig` has separate `n_head` and `n_kv_head`, see `repeat_kv()` function
|
||||
|
||||
5. **ReLU² activation**
|
||||
- *Why?*: Better than GELU for smaller models, simple and effective
|
||||
- *Where?*: `gpt.py` - `MLP.forward()` uses `F.relu(x).square()`
|
||||
|
||||
6. **Dual optimizer strategy** (Muon + AdamW)
|
||||
- *Why?*: Matrix parameters and embeddings benefit from different optimization
|
||||
- *Where?*: `gpt.py` - see `GPT.setup_optimizers()` method
|
||||
|
||||
7. **Logit soft-capping**
|
||||
- *Why?*: Prevents extreme logit values, improves training stability
|
||||
- *Where?*: `gpt.py` - in `GPT.forward()`, search for "softcap"
|
||||
|
||||
### The Complete Pipeline Visualized
|
||||
|
||||
```
|
||||
1. Data Preparation
|
||||
├─ Download FineWeb shards (dataset.py)
|
||||
├─ Train BPE tokenizer (tok_train.py)
|
||||
└─ Tokenize data on-the-fly (dataloader.py)
|
||||
|
||||
2. Pretraining
|
||||
├─ Initialize model (gpt.py)
|
||||
├─ Setup optimizers (muon.py, adamw.py)
|
||||
├─ Train on tokens (base_train.py)
|
||||
└─ Evaluate on CORE (base_eval.py)
|
||||
|
||||
3. Midtraining
|
||||
├─ Load base checkpoint
|
||||
├─ Train on formatted data (mid_train.py)
|
||||
└─ Evaluate on chat tasks (chat_eval.py)
|
||||
|
||||
4. Fine-tuning
|
||||
├─ Supervised learning (chat_sft.py)
|
||||
├─ [Optional] RL training (chat_rl.py)
|
||||
└─ Final evaluation (chat_eval.py)
|
||||
|
||||
5. Deployment
|
||||
├─ Load best checkpoint
|
||||
├─ Serve via CLI (chat_cli.py)
|
||||
└─ Serve via Web (chat_web.py)
|
||||
```
|
||||
|
||||
### Concepts to Master
|
||||
|
||||
As you read through the code, make sure you understand these fundamental concepts:
|
||||
|
||||
**Tokenization:**
|
||||
- Why we need tokenization
|
||||
- How BPE works (greedy merge of most frequent pairs)
|
||||
- Special tokens and their purpose
|
||||
|
||||
**Model Architecture:**
|
||||
- Self-attention mechanism (Q, K, V matrices)
|
||||
- Causal masking (can only attend to past tokens)
|
||||
- Residual connections (x + attention(x))
|
||||
- Layer normalization (RMSNorm variant)
|
||||
- Why we stack many layers
|
||||
|
||||
**Training:**
|
||||
- Gradient descent and backpropagation
|
||||
- Loss function (cross-entropy for next token prediction)
|
||||
- Learning rate schedules (warmup + cosine decay)
|
||||
- Gradient accumulation (simulating larger batches)
|
||||
- Mixed precision training (bfloat16 for speed)
|
||||
|
||||
**Distributed Training:**
|
||||
- Data parallelism (same model, different data shards)
|
||||
- Gradient synchronization across GPUs
|
||||
- All-reduce operations
|
||||
|
||||
**Inference:**
|
||||
- Autoregressive generation (one token at a time)
|
||||
- KV caching (reuse past computations)
|
||||
- Sampling strategies (temperature, top-k)
|
||||
|
||||
### Recommended Experiments
|
||||
|
||||
Once you've read through the code, try these experiments to deepen understanding:
|
||||
|
||||
1. **Modify the tokenizer vocabulary size** - See how it affects compression and training
|
||||
2. **Change model depth** - Train a smaller/larger model, observe parameter count vs. performance
|
||||
3. **Experiment with batch sizes** - Understand the speed/memory tradeoff
|
||||
4. **Try different sampling temperatures** - See how it affects generation creativity
|
||||
5. **Implement a simple evaluation task** - Add your own benchmark in `tasks/`
|
||||
6. **Add a new tool** - Extend the calculator to support more operations
|
||||
|
||||
### Quick Start for Learning
|
||||
|
||||
If you just want to understand the core without running anything:
|
||||
|
||||
1. Read `gpt.py` - Understand the Transformer architecture
|
||||
2. Read `engine.py` - Understand how generation works
|
||||
3. Read `base_train.py` - Understand the training loop
|
||||
|
||||
These three files (~1000 lines total) contain the essence of how modern LLMs work.
|
||||
|
||||
### Resources for Deeper Learning
|
||||
|
||||
- **Attention paper**: "Attention Is All You Need" (Vaswani et al.)
|
||||
- **GPT-2 paper**: "Language Models are Unsupervised Multitask Learners"
|
||||
- **Rotary embeddings**: "RoFormer: Enhanced Transformer with Rotary Position Embedding"
|
||||
- **Andrej's videos**: Neural Networks: Zero to Hero series on YouTube
|
||||
- **LLM101n course**: The course this project was built for (when released)
|
||||
|
||||
## Contributing
|
||||
|
||||
nanochat is nowhere finished. The goal is to improve the state of the art in micro models that are accessible to work with end to end on budgets of < $1000 dollars. Accessibility is about overall cost but also about cognitive complexity - nanochat is not an exhaustively configurable LLM "framework"; there will be no giant configuration objects, model factories, or if-then-else monsters in the code base. It is a single, cohesive, minimal, readable, hackable, maximally-forkable "strong baseline" codebase designed to run start to end and produce a concrete ChatGPT clone and its report card.
|
||||
|
|
|
|||
Loading…
Reference in New Issue
Block a user