fixing all the typos to make the pull requests stop

Batch of typo fixes
This commit is contained in:
Andrej 2025-10-28 13:36:07 -07:00 committed by GitHub
commit ee00f523d0
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
6 changed files with 10 additions and 10 deletions

View File

@ -8,7 +8,7 @@ This repo is a full-stack implementation of an LLM like ChatGPT in a single, cle
## Talk to it ## Talk to it
To get a sense of the endpoint of this repo, you can currently find [nanochat d32](https://github.com/karpathy/nanochat/discussions/8) hosted on [nanochat.karpathy.ai](https://nanochat.karpathy.ai/). "d32" means that this model has 32 layers in the Transformer neural network. This model has 1.9 billion parameters, it was trained on 38 billion tokens by simply running the single script [run1000.sh](run1000.sh), and the total cost of training was ~$800 (about 33 hours training time on 8XH100 GPU node). While today this is enough to outperform GPT-2 of 2019, it falls dramatically short of moden Large Language Models like GPT-5. When talking to these micro models, you'll see that they make a lot of mistakes, they are a little bit naive and silly and they hallucinate a ton, a bit like children. It's kind of amusing. But what makes nanochat unique is that it is fully yours - fully configurable, tweakable, hackable, and trained by you from start to end. To train and talk to your own, we turn to... To get a sense of the endpoint of this repo, you can currently find [nanochat d32](https://github.com/karpathy/nanochat/discussions/8) hosted on [nanochat.karpathy.ai](https://nanochat.karpathy.ai/). "d32" means that this model has 32 layers in the Transformer neural network. This model has 1.9 billion parameters, it was trained on 38 billion tokens by simply running the single script [run1000.sh](run1000.sh), and the total cost of training was ~$800 (about 33 hours training time on 8XH100 GPU node). While today this is enough to outperform GPT-2 of 2019, it falls dramatically short of modern Large Language Models like GPT-5. When talking to these micro models, you'll see that they make a lot of mistakes, they are a little bit naive and silly and they hallucinate a ton, a bit like children. It's kind of amusing. But what makes nanochat unique is that it is fully yours - fully configurable, tweakable, hackable, and trained by you from start to end. To train and talk to your own, we turn to...
## Quick start ## Quick start
@ -84,7 +84,7 @@ torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26 --d
torchrun --standalone --nproc_per_node=8 -m scripts.mid_train -- --device_batch_size=16 torchrun --standalone --nproc_per_node=8 -m scripts.mid_train -- --device_batch_size=16
``` ```
That's it! The biggest thing to pay attention to is making sure you have enough data shards to train on (the code will loop and do more epochs over the same training set otherwise, decreasing learning speed a bit), and managing your memory/VRAM, primarily by decreasing the `device_batch_size` until things fit (the scripts automatically compensates by increasing the number of gradient accumulation loops, simply turning parallel compute to sequential compute). That's it! The biggest thing to pay attention to is making sure you have enough data shards to train on (the code will loop and do more epochs over the same training set otherwise, decreasing learning speed a bit), and managing your memory/VRAM, primarily by decreasing the `device_batch_size` until things fit (the scripts automatically compensate by increasing the number of gradient accumulation loops, simply turning parallel compute to sequential compute).
And a bit more about computing environments that will run nanochat: And a bit more about computing environments that will run nanochat:
@ -95,7 +95,7 @@ And a bit more about computing environments that will run nanochat:
## Running on CPU / MPS ## Running on CPU / MPS
nanochat cn be run on CPU or on MPS (if you're on Macbook), and will automatically try to detect what device is best to run on. You're not going to get too far without GPUs, but at least you'll be able to run the code paths and maybe train a tiny LLM with some patience. For an example of how to make all the run commands much smaller (feel free to tune!), you can refer to [dev/runcpu.sh](dev/runcpu.sh) file. You'll see that I'm essentially restricting all scripts to train smaller models, to run for shorter number of iterations, etc. This functionality is new, slightly gnarly (touched a lot of code), and was merged in this [CPU|MPS PR](https://github.com/karpathy/nanochat/pull/88) on Oct 21, 2025. nanochat can be run on CPU or on MPS (if you're on Macbook), and will automatically try to detect what device is best to run on. You're not going to get too far without GPUs, but at least you'll be able to run the code paths and maybe train a tiny LLM with some patience. For an example of how to make all the run commands much smaller (feel free to tune!), you can refer to [dev/runcpu.sh](dev/runcpu.sh) file. You'll see that I'm essentially restricting all scripts to train smaller models, to run for shorter number of iterations, etc. This functionality is new, slightly gnarly (touched a lot of code), and was merged in this [CPU|MPS PR](https://github.com/karpathy/nanochat/pull/88) on Oct 21, 2025.
## Customization ## Customization
@ -181,7 +181,7 @@ python -m pytest tests/test_rustbpe.py -v -s
│ ├── gsm8k.py # 8K Grade School Math questions │ ├── gsm8k.py # 8K Grade School Math questions
│ ├── humaneval.py # Misnomer; Simple Python coding task │ ├── humaneval.py # Misnomer; Simple Python coding task
│ ├── mmlu.py # Multiple choice questions, broad topics │ ├── mmlu.py # Multiple choice questions, broad topics
│ ├── smoltalk.py # Conglomarate dataset of SmolTalk from HF │ ├── smoltalk.py # Conglomerate dataset of SmolTalk from HF
│ └── spellingbee.py # Task teaching model to spell/count letters │ └── spellingbee.py # Task teaching model to spell/count letters
├── tests ├── tests
│ └── test_rustbpe.py │ └── test_rustbpe.py
@ -190,7 +190,7 @@ python -m pytest tests/test_rustbpe.py -v -s
## Contributing ## Contributing
nanochat is nowhere finished. The goal is to improve the state of the art in micro models that are accessible to work with end to end on budgets of < $1000 dollars. Accessibility is about overall cost but also about cognitive complexity - nanochat is not an exhaustively configurable LLM "framework"; there will be no giant configuration objects, model factories, or if-then-else monsters in the code base. It is a single, cohesive, minimal, readable, hackable, maximally-forkable "strong baseline" codebase designed to run start to end and produce a concrete ChatGPT clone and its report card. nanochat is nowhere near finished. The goal is to improve the state of the art in micro models that are accessible to work with end to end on budgets of < $1000 dollars. Accessibility is about overall cost but also about cognitive complexity - nanochat is not an exhaustively configurable LLM "framework"; there will be no giant configuration objects, model factories, or if-then-else monsters in the code base. It is a single, cohesive, minimal, readable, hackable, maximally-forkable "strong baseline" codebase designed to run start to end and produce a concrete ChatGPT clone and its report card.
Current LLM policy: disclosure. When submitting a PR, please declare any parts that had substantial LLM contribution and that you have not written or that you do not fully understand. Current LLM policy: disclosure. When submitting a PR, please declare any parts that had substantial LLM contribution and that you have not written or that you do not fully understand.

View File

@ -17,7 +17,7 @@ prompt:
2. You'll see that I added a large diversity of user first messages manually, 2. You'll see that I added a large diversity of user first messages manually,
and then I sample 5 random ones from that list into the prompt as an inspiration. and then I sample 5 random ones from that list into the prompt as an inspiration.
This is really important to do because DIVERSITY CONTROL is key. If you don't This is really important to do because DIVERSITY CONTROL is key. If you don't
manually inject diversity, the LLM might generate extrremely similar and repeptitive manually inject diversity, the LLM might generate extremely similar and repetitive
conversations and things won't work well. Even this example below is not good enough, conversations and things won't work well. Even this example below is not good enough,
for example you might want to actually suggest or inspire conversation topics, or questions, for example you might want to actually suggest or inspire conversation topics, or questions,
and have a list of that. Basically, this is the KEY creative part to get right. Make sure you and have a list of that. Basically, this is the KEY creative part to get right. Make sure you

View File

@ -65,7 +65,7 @@ def evaluate_model(model, tokenizer, device, max_per_task=-1):
data = [json.loads(line.strip()) for line in f] data = [json.loads(line.strip()) for line in f]
# shuffle the data because in many cases it appears ordered but we want # shuffle the data because in many cases it appears ordered but we want
# the abillity to only run a subset of the data for debugging purposes etc. # the ability to only run a subset of the data for debugging purposes etc.
shuffle_rng = random.Random(1337) shuffle_rng = random.Random(1337)
shuffle_rng.shuffle(data) shuffle_rng.shuffle(data)
if max_per_task > 0: if max_per_task > 0:

View File

@ -271,7 +271,7 @@ for step in range(num_iterations + 1):
loss = loss / grad_accum_steps # each .backward() is a grad sum => normalize loss here loss = loss / grad_accum_steps # each .backward() is a grad sum => normalize loss here
loss.backward() loss.backward()
x, y = next(train_loader) # prefetch the next batch while the GPU is busy with forward/backward x, y = next(train_loader) # prefetch the next batch while the GPU is busy with forward/backward
# gradient clipping (TODO possibly expertiment with) # gradient clipping (TODO possibly experiment with)
if grad_clip > 0.0: if grad_clip > 0.0:
torch.nn.utils.clip_grad_norm_(orig_model.parameters(), grad_clip) torch.nn.utils.clip_grad_norm_(orig_model.parameters(), grad_clip)
# step the optimizers # step the optimizers

View File

@ -117,7 +117,7 @@ def run_categorical_eval(task_object, tokenizer, model, batch_size, max_problems
logits = model(prompt_ids) # (B, T, V) logits = model(prompt_ids) # (B, T, V)
# Focus on the available answer on just the letters corresponding to choices # Focus on the available answer on just the letters corresponding to choices
# Note that this helps the evaluation a lot because it specifically narrows the focus to only the avilable letters # Note that this helps the evaluation a lot because it specifically narrows the focus to only the available letters
# The much harder alternative would be to just generate from the Assistant and check if it responded with the correct # The much harder alternative would be to just generate from the Assistant and check if it responded with the correct
# letter (e.g. A, B, C, D), but evaluations typically make the task easier in this way. # letter (e.g. A, B, C, D), but evaluations typically make the task easier in this way.
for idx, conversation in enumerate(conversations): for idx, conversation in enumerate(conversations):

View File

@ -206,7 +206,7 @@ def get_lr_multiplier(it):
lrm = 1.0 - it / num_steps lrm = 1.0 - it / num_steps
return lrm return lrm
# Calculate the number of examples each rank handles to achive the desired examples_per_step # Calculate the number of examples each rank handles to achieve the desired examples_per_step
print0(f"Total sequences per step: {examples_per_step * num_samples}") # total batch size in sequences/step print0(f"Total sequences per step: {examples_per_step * num_samples}") # total batch size in sequences/step
assert examples_per_step % ddp_world_size == 0, "Desired examples per step must be divisible by the number of ranks" assert examples_per_step % ddp_world_size == 0, "Desired examples per step must be divisible by the number of ranks"
examples_per_rank = examples_per_step // ddp_world_size # per GPU examples_per_rank = examples_per_step // ddp_world_size # per GPU