readme tweak, link to new discussion and add file structure

2025-12-06 12:22:18 +00:00 · 2025-10-25 19:39:16 +00:00 · 2025-10-25 19:39:16 +00:00 · c75fe54aa7
commit c75fe54aa7
parent 05a051dbe9
1 changed files with 67 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -101,6 +101,8 @@ nanochat cn be run on CPU or on MPS (if you're on Macbook), and will automatical

 To customize your nanochat, see [Guide: infusing identity to your nanochat](https://github.com/karpathy/nanochat/discussions/139) in Discussions, which describes how you can tune your nanochat's personality through synthetic data generation and mixing that data into midtraining and SFT stages.

+Additionally, to add new abilities to nanochat, see [Guide: counting r in strawberry (and how to add abilities generally)](https://github.com/karpathy/nanochat/discussions/164).
+
 ## Questions

 nanochat is designed to be short and sweet. One big advantage of this is that we can package up all of the files together and copy paste them to your favorite LLM to ask arbitrary questions. As an example, I like to package up the repo using the [files-to-prompt](https://github.com/simonw/files-to-prompt) utility like so:
@ -121,6 +123,71 @@ I haven't invested too much here but some tests exist, especially for the tokeni
 python -m pytest tests/test_rustbpe.py -v -s
 ```

+## File structure
+
+```
+.
+├── LICENSE
+├── README.md
+├── dev
+│   ├── gen_synthetic_data.py       # Example synthetic data for identity
+│   ├── generate_logo.html
+│   ├── nanochat.png
+│   ├── repackage_data_reference.py # Pretraining data shard generation
+│   └── runcpu.sh                   # Small example of how to run on CPU/MPS
+├── nanochat
+│   ├── __init__.py                 # empty
+│   ├── adamw.py                    # Distributed AdamW optimizer
+│   ├── checkpoint_manager.py       # Save/Load model checkpoints
+│   ├── common.py                   # Misc small utilities, quality of life
+│   ├── configurator.py             # A superior alternative to argparse
+│   ├── core_eval.py                # Evaluates base model CORE score (DCLM paper)
+│   ├── dataloader.py               # Tokenizing Distributed Data Loader
+│   ├── dataset.py                  # Download/read utils for pretraining data
+│   ├── engine.py                   # Efficient model inference with KV Cache
+│   ├── execution.py                # Allows the LLM to execute Python code as tool
+│   ├── gpt.py                      # The GPT nn.Module Transformer
+│   ├── logo.svg
+│   ├── loss_eval.py                # Evaluate bits per byte (instead of loss)
+│   ├── muon.py                     # Distributed Muon optimizer
+│   ├── report.py                   # Utilities for writing the nanochat Report
+│   ├── tokenizer.py                # BPE Tokenizer wrapper in style of GPT-4
+│   └── ui.html                     # HTML/CSS/JS for nanochat frontend
+├── pyproject.toml
+├── run1000.sh                      # Train the ~$800 nanochat d32
+├── rustbpe                         # Custom Rust BPE tokenizer trainer
+│   ├── Cargo.lock
+│   ├── Cargo.toml
+│   ├── README.md                   # see for why this even exists
+│   └── src
+│       └── lib.rs
+├── scripts
+│   ├── base_eval.py                # Base model: calculate CORE score
+│   ├── base_loss.py                # Base model: calculate bits per byte, sample
+│   ├── base_train.py               # Base model: train
+│   ├── chat_cli.py                 # Chat model (SFT/Mid): talk to over CLI
+│   ├── chat_eval.py                # Chat model (SFT/Mid): eval tasks
+│   ├── chat_rl.py                  # Chat model (SFT/Mid): reinforcement learning
+│   ├── chat_sft.py                 # Chat model: train SFT
+│   ├── chat_web.py                 # Chat model (SFT/Mid): talk to over WebUI
+│   ├── mid_train.py                # Chat model: midtraining
+│   ├── tok_eval.py                 # Tokenizer: evaluate compression rate
+│   └── tok_train.py                # Tokenizer: train it
+├── speedrun.sh                     # Train the ~$100 nanochat d20
+├── tasks
+│   ├── arc.py                      # Multiple choice science questions
+│   ├── common.py                   # TaskMixture | TaskSequence
+│   ├── customjson.py               # Make Task from arbitrary jsonl convos
+│   ├── gsm8k.py                    # 8K Grade School Math questions
+│   ├── humaneval.py                # Misnomer; Simple Python coding task
+│   ├── mmlu.py                     # Multiple choice questions, broad topics
+│   ├── smoltalk.py                 # Conglomarate dataset of SmolTalk from HF
+│   └── spellingbee.py              # Task teaching model to spell/count letters
+├── tests
+│   └── test_rustbpe.py
+└── uv.lock
+```
+
 ## Contributing

 nanochat is nowhere finished. The goal is to improve the state of the art in micro models that are accessible to work with end to end on budgets of < $1000 dollars. Accessibility is about overall cost but also about cognitive complexity - nanochat is not an exhaustively configurable LLM "framework"; there will be no giant configuration objects, model factories, or if-then-else monsters in the code base. It is a single, cohesive, minimal, readable, hackable, maximally-forkable "strong baseline" codebase designed to run start to end and produce a concrete ChatGPT clone and its report card.