Created a complete educational resource covering the implementation of nanochat from scratch, including: - Mathematical foundations (linear algebra, optimization, attention) - Tokenization with detailed BPE algorithm explanation - Transformer architecture and GPT model implementation - Self-attention mechanism with RoPE and Multi-Query Attention - Training process, data loading, and distributed training - Advanced optimization techniques (Muon + AdamW) - Practical implementation guide with debugging tips - Automated PDF compilation script The guide includes deep code walkthroughs with line-by-line explanations of key components, making it accessible for beginners while covering advanced techniques used in modern LLMs. Total content: ~4,300 lines across 8 chapters plus README and tooling. PDF compilation available via compile_to_pdf.py script. |
||
|---|---|---|
| .. | ||
| 01_introduction.md | ||
| 02_mathematical_foundations.md | ||
| 03_tokenization.md | ||
| 04_transformer_architecture.md | ||
| 05_attention_mechanism.md | ||
| 06_training_process.md | ||
| 07_optimization.md | ||
| 08_putting_it_together.md | ||
| compile_to_pdf.py | ||
| nanochat_educational_guide.pdf | ||
| README.md | ||
Educational Guide to nanochat
This folder contains a comprehensive educational guide to understanding and building your own Large Language Model (LLM) from scratch, using nanochat as a reference implementation.
What's Included
This guide covers everything from mathematical foundations to practical implementation:
📚 Core Materials
- 01_introduction.md - Overview of nanochat and the LLM training pipeline
- 02_mathematical_foundations.md - All the math you need (linear algebra, probability, optimization)
- 03_tokenization.md - Byte Pair Encoding (BPE) algorithm with detailed code walkthrough
- 04_transformer_architecture.md - GPT model architecture and components
- 05_attention_mechanism.md - Deep dive into self-attention with implementation details
- 06_training_process.md - Complete training pipeline from data loading to checkpointing
- 07_optimization.md - Advanced optimizers (Muon + AdamW) with detailed explanations
- 08_putting_it_together.md - Practical implementation guide and debugging tips
🎯 Who This Is For
- Beginners: Start from first principles with clear explanations
- Intermediate: Deep dive into implementation details and code
- Advanced: Learn cutting-edge techniques (RoPE, Muon, MQA)
How to Use This Guide
Sequential Reading (Recommended for Beginners)
Read in order from 01 to 08. Each section builds on previous ones:
Introduction → Math → Tokenization → Architecture →
Attention → Training → Optimization → Implementation
Topic-Based Reading (For Experienced Practitioners)
Jump directly to topics of interest:
- Want to understand tokenization? → Read
03_tokenization.md - Need to implement attention? → Read
05_attention_mechanism.md - Optimizing training? → Read
07_optimization.md
Code Walkthrough (Best for Implementation)
Read alongside the nanochat codebase:
- Read a section (e.g., "Transformer Architecture")
- Open the corresponding file (
nanochat/gpt.py) - Follow along with the code examples
- Modify and experiment
Compiling to PDF
To create a single PDF document from all sections:
cd educational
python compile_to_pdf.py
This will generate nanochat_educational_guide.pdf.
Requirements:
- Python 3.7+
- pandoc
- LaTeX distribution (e.g., TeX Live, MiKTeX)
Install dependencies:
# macOS
brew install pandoc
brew install basictex # or MacTeX for full distribution
# Ubuntu/Debian
sudo apt-get install pandoc texlive-full
# Python packages
pip install pandoc
Key Features of This Guide
🎓 Educational Approach
- From first principles: Assumes only basic Python and math knowledge
- Progressive complexity: Start simple, build up gradually
- Concrete examples: Real code from nanochat, not pseudocode
💻 Code-Focused
- Deep code explanations: Every important function is explained line-by-line
- Implementation patterns: Learn best practices and design patterns
- Debugging tips: Common pitfalls and how to avoid them
🔬 Comprehensive
- Mathematical foundations: Understand the "why" behind every technique
- Modern techniques: RoPE, MQA, Muon optimizer, softcapping
- Full pipeline: From raw text to deployed chatbot
🚀 Practical
- Runnable examples: All code can be tested immediately
- Optimization tips: Make training fast and efficient
- Scaling guidance: From toy models to production systems
What You'll Learn
By the end of this guide, you'll understand:
✅ How tokenization works (BPE algorithm) ✅ Transformer architecture in detail ✅ Self-attention mechanism (with RoPE, MQA) ✅ Training loop and data pipeline ✅ Advanced optimization (Muon + AdamW) ✅ Mixed precision training (BF16) ✅ Distributed training (DDP) ✅ Evaluation and metrics ✅ How to implement your own LLM
Prerequisites
Essential:
- Python programming
- Basic linear algebra (matrices, vectors, dot products)
- Basic calculus (derivatives, chain rule)
- Basic probability (distributions)
Helpful but not required:
- PyTorch basics
- Deep learning fundamentals
- Familiarity with Transformers
Additional Resources
Official Documentation
Related Projects
Papers
- Attention Is All You Need - Original Transformer
- Language Models are Few-Shot Learners - GPT-3
- Training Compute-Optimal LLMs - Chinchilla scaling laws
Contributing
Found an error or want to improve the guide?
- Open an issue on the main nanochat repository
- Suggest improvements or clarifications
- Share what topics you'd like to see covered
License
This educational material follows the same MIT license as nanochat.
Acknowledgments
This guide is based on the nanochat implementation by Andrej Karpathy. All code examples are from the nanochat repository.
Special thanks to the open-source community for making LLM education accessible!
Happy learning! 🚀
If you find this guide helpful, please star the nanochat repository!