update: speedrundiy.sh 流程跑通

This commit is contained in:
Liu Jiang 2026-03-05 12:36:07 +08:00
parent b07604ebaa
commit c4d3727ba8
15 changed files with 1941 additions and 2800 deletions

6
.env Normal file
View File

@ -0,0 +1,6 @@
NANOCHAT_BASE_DIR="/media/data/liujiang/data/datasets/nanochat_base_dir"
HF_ENDPOINT=https://hf-mirror.com
HF_HUB_ENABLE_HF_HUB_CACHE=true
HF_HUB_CACHE=$NANOCHAT_BASE_DIR/hf_hub_cache
HF_DATASETS_CACHE=$NANOCHAT_BASE_DIR/hf_datasets_cache

2
.gitignore vendored
View File

@ -6,7 +6,7 @@ report.md
eval_bundle/
# Secrets
.env
# .env
# Local setup
CLAUDE.md

View File

@ -1 +1 @@
3.10
3.12

193
README.md
View File

@ -1,182 +1,11 @@
# nanochat
![nanochat logo](dev/nanochat.png)
![scaling laws](dev/scaling_laws_jan26.png)
nanochat is the simplest experimental harness for training LLMs. It is designed to run on a single GPU node, the code is minimal/hackable, and it covers all major LLM stages including tokenization, pretraining, finetuning, evaluation, inference, and a chat UI. For example, you can train your own GPT-2 capability LLM (which cost ~$43,000 to train in 2019) for only $72 (~3 hours of 8XH100 GPU node) and then talk to it in a familiar ChatGPT-like web UI. On a spot instance, the total cost can be closer to ~$20. More generally, nanochat is configured out of the box to train an entire miniseries of compute-optimal models by setting one single complexity dial: `--depth`, the number of layers in the GPT transformer model (GPT-2 capability happens to be approximately depth 26). All other hyperparameters (the width of the transformer, number of heads, learning rate adjustments, training horizons, weight decays, ...) are calculated automatically in an optimal way.
For questions about the repo, I recommend either using [DeepWiki](https://deepwiki.com/karpathy/nanochat) from Devin/Cognition to ask questions about the repo, or use the [Discussions tab](https://github.com/karpathy/nanochat/discussions), or come by the [#nanochat](https://discord.com/channels/1020383067459821711/1427295580895314031) channel on Discord.
## Time-to-GPT-2 Leaderboard
Presently, the main focus of development is on tuning the pretraining stage, which takes the most amount of compute. Inspired by the modded-nanogpt repo and to incentivise progress and community collaboration, nanochat maintains a leaderboard for a "GPT-2 speedrun", which is the wall-clock time required to train a nanochat model to GPT-2 grade capability, as measured by the DCLM CORE score. The [runs/speedrun.sh](runs/speedrun.sh) script always reflects the reference way to train GPT-2 grade model and talk to it. The current leaderboard looks as follows:
| # | time | val_bpb | CORE | Description | Date | Commit | Contributors |
|---|-------------|---------|------|-------------|------|--------|--------------|
| 0 | 168 hours | - | 0.2565 | Original OpenAI GPT-2 checkpoint | 2019 | - | OpenAI |
| 1 | 3.04 | 0.74833 | 0.2585 | d24 baseline, slightly overtrained | Jan 29 2026 | 348fbb3 | @karpathy |
| 2 | 2.91 | 0.74504 | 0.2578 | d26 slightly undertrained **+fp8** | Feb 2 2026 | a67eba3 | @karpathy |
| 3 | 2.76 | 0.74645 | 0.2602 | bump total batch size to 1M tokens | Feb 5 2026 | 2c062aa | @karpathy |
The primary metric we care about is "time to GPT-2" - the wall clock time needed to outperform the GPT-2 (1.6B) CORE metric on an 8XH100 GPU node. The GPT-2 CORE score is 0.256525. In 2019, the training of GPT-2 cost approximately $43,000 so it is incredible that due to many advances over 7 years across the stack, we can now do so much faster and for well below $100 (e.g. at the current ~$3/GPU/hr, an 8XH100 node is ~$24/hr, so 3 hours is ~$72).
See [dev/LEADERBOARD.md](dev/LEADERBOARD.md) for more docs on how to interpret and contribute to the leaderboard.
## Getting started
### Reproduce and talk to GPT-2
The most fun you can have is to train your own GPT-2 and talk to it. The entire pipeline to do so is contained in the single file [runs/speedrun.sh](runs/speedrun.sh), which is designed to be run on an 8XH100 GPU node. Boot up a new 8XH100 GPU box from your favorite provider (e.g. I use and like [Lambda](https://lambda.ai/service/gpu-cloud)), and kick off the training script:
```bash
bash runs/speedrun.sh
```
You may wish to do so in a screen session as this will take ~3 hours to run. Once it's done, you can talk to it via the ChatGPT-like web UI. Make sure again that your local uv virtual environment is active (run `source .venv/bin/activate`), and serve it:
```bash
python -m scripts.chat_web
```
And then visit the URL shown. Make sure to access it correctly, e.g. on Lambda use the public IP of the node you're on, followed by the port, so for example [http://209.20.xxx.xxx:8000/](http://209.20.xxx.xxx:8000/), etc. Then talk to your LLM as you'd normally talk to ChatGPT! Get it to write stories or poems. Ask it to tell you who you are to see a hallucination. Ask it why the sky is blue. Or why it's green. The speedrun is a 4e19 FLOPs capability model so it's a bit like talking to a kindergartener :).
---
<img width="2672" height="1520" alt="image" src="https://github.com/user-attachments/assets/ed39ddf8-2370-437a-bedc-0f39781e76b5" />
---
A few more notes:
- The code will run just fine on the Ampere 8XA100 GPU node as well, but a bit slower.
- All code will run just fine on even a single GPU by omitting `torchrun`, and will produce ~identical results (code will automatically switch to gradient accumulation), but you'll have to wait 8 times longer.
- If your GPU(s) have less than 80GB, you'll have to tune some of the hyperparameters or you will OOM / run out of VRAM. Look for `--device_batch_size` in the scripts and reduce it until things fit. E.g. from 32 (default) to 16, 8, 4, 2, or even 1. Less than that you'll have to know a bit more what you're doing and get more creative.
- Most of the code is fairly vanilla PyTorch so it should run on anything that supports that - xpu, mps, or etc, but I haven't personally exercised all of these code paths so there might be sharp edges.
## Research
If you are a researcher and wish to help improve nanochat, two scripts of interest are [runs/scaling_laws.sh](runs/scaling_laws.sh) and [runs/miniseries.sh](runs/miniseries.sh). See [Jan 7 miniseries v1](https://github.com/karpathy/nanochat/discussions/420) for related documentation. For quick experimentation (~5 min pretraining runs) my favorite scale is to train a 12-layer model (GPT-1 sized), e.g. like this:
```
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
--depth=12 \
--run="d12" \
--model-tag="d12" \
--core-metric-every=999999 \
--sample-every=-1 \
--save-every=-1 \
```
This uses wandb (run name "d12"), only runs the CORE metric on last step, and it doesn't sample and save intermediate checkpoints. I like to change something in the code, re-run a d12 (or a d16 etc) and see if it helped, in an iteration loop. To see if a run helps, I like to monitor the wandb plots for:
1. `val_bpb` (validation loss in vocab-size-invariant units of bits per byte) as a function of `step`, `total_training_time` and `total_training_flops`.
2. `core_metric` (the DCLM CORE socre)
3. VRAM utilization, `train/mfu` (Model FLOPS utilization), `train/tok_per_sec` (training throughput)
See an example [here](https://github.com/karpathy/nanochat/pull/498#issuecomment-3850720044).
The important thing to note is that nanochat is written and configured around one single dial of complexity - the depth of the transformer. This single integer automatically determines all other hyperparameters (the width of the transformer, number of heads, learning rate adjustments, training horizons, weight decays, ...) so that the trained model comes out compute optimal. The idea is that the user doesn't have to think about or set any of this, they are simply asking for a smaller or bigger model using `--depth`, and everything "just works". By sweeping out the depth, you achieve the nanochat miniseries of compute optimal models at various sizes. GPT-2 capability model (which is of most interest at the moment) happens to be somewhere around d24-d26 range with the current code. But any candidate changes to the repo have to be principled enough that they work for all settings of depth.
## Running on CPU / MPS
The script [runs/runcpu.sh](runs/runcpu.sh) shows a very simple example of running on CPU or Apple Silicon. It dramatically shrinks the LLM that is being trained to make things fit into a reasonable time interval of a few ten minutes of training. You will not get strong results in this way.
## Guides
I've published a number of guides that might contain helpful information, most recent to least recent:
- [Feb 1 2026: Beating GPT-2 for <<$100: the nanochat journey](https://github.com/karpathy/nanochat/discussions/481)
- [Jan 7 miniseries v1](https://github.com/karpathy/nanochat/discussions/420) documents the first nanochat miniseries of models.
- To add new abilities to nanochat, see [Guide: counting r in strawberry (and how to add abilities generally)](https://github.com/karpathy/nanochat/discussions/164).
- To customize your nanochat, see [Guide: infusing identity to your nanochat](https://github.com/karpathy/nanochat/discussions/139) in Discussions, which describes how you can tune your nanochat's personality through synthetic data generation and mixing that data into the SFT stage.
- [Oct 13 2025: original nanochat post](https://github.com/karpathy/nanochat/discussions/1) introducing nanochat, though now it contains some deprecated information and the model is a lot older (with worse results) than current master.
## File structure
```
.
├── LICENSE
├── README.md
├── dev
│ ├── gen_synthetic_data.py # Example synthetic data for identity
│ ├── generate_logo.html
│ ├── nanochat.png
│ └── repackage_data_reference.py # Pretraining data shard generation
├── nanochat
│ ├── __init__.py # empty
│ ├── checkpoint_manager.py # Save/Load model checkpoints
│ ├── common.py # Misc small utilities, quality of life
│ ├── core_eval.py # Evaluates base model CORE score (DCLM paper)
│ ├── dataloader.py # Tokenizing Distributed Data Loader
│ ├── dataset.py # Download/read utils for pretraining data
│ ├── engine.py # Efficient model inference with KV Cache
│ ├── execution.py # Allows the LLM to execute Python code as tool
│ ├── gpt.py # The GPT nn.Module Transformer
│ ├── logo.svg
│ ├── loss_eval.py # Evaluate bits per byte (instead of loss)
│ ├── optim.py # AdamW + Muon optimizer, 1GPU and distributed
│ ├── report.py # Utilities for writing the nanochat Report
│ ├── tokenizer.py # BPE Tokenizer wrapper in style of GPT-4
│ └── ui.html # HTML/CSS/JS for nanochat frontend
├── pyproject.toml
├── runs
│ ├── miniseries.sh # Miniseries training script
│ ├── runcpu.sh # Small example of how to run on CPU/MPS
│ ├── scaling_laws.sh # Scaling laws experiments
│ └── speedrun.sh # Train the ~$100 nanochat d20
├── scripts
│ ├── base_eval.py # Base model: CORE score, bits per byte, samples
│ ├── base_train.py # Base model: train
│ ├── chat_cli.py # Chat model: talk to over CLI
│ ├── chat_eval.py # Chat model: eval tasks
│ ├── chat_rl.py # Chat model: reinforcement learning
│ ├── chat_sft.py # Chat model: train SFT
│ ├── chat_web.py # Chat model: talk to over WebUI
│ ├── tok_eval.py # Tokenizer: evaluate compression rate
│ └── tok_train.py # Tokenizer: train it
├── tasks
│ ├── arc.py # Multiple choice science questions
│ ├── common.py # TaskMixture | TaskSequence
│ ├── customjson.py # Make Task from arbitrary jsonl convos
│ ├── gsm8k.py # 8K Grade School Math questions
│ ├── humaneval.py # Misnomer; Simple Python coding task
│ ├── mmlu.py # Multiple choice questions, broad topics
│ ├── smoltalk.py # Conglomerate dataset of SmolTalk from HF
│ └── spellingbee.py # Task teaching model to spell/count letters
├── tests
│ └── test_engine.py
└── uv.lock
```
## Contributing
The goal of nanochat is to improve the state of the art in micro models that are accessible to work with end to end on budgets of < $1000 dollars. Accessibility is about overall cost but also about cognitive complexity - nanochat is not an exhaustively configurable LLM "framework"; there are no giant configuration objects, model factories, or if-then-else monsters in the code base. It is a single, cohesive, minimal, readable, hackable, maximally-forkable "strong baseline" codebase designed to run start to end and produce a ChatGPT model you can talk to. Currently, the most interesting part personally is speeding up the latency to GPT-2 (i.e. getting a CORE score above 0.256525). Currently this takes ~3 hours, but by improving the pretraining stage we can improve this further.
Current AI policy: disclosure. When submitting a PR, please declare any parts that had substantial LLM contribution and that you have not written or that you do not fully understand.
## Acknowledgements
- The name (nanochat) derives from my earlier project [nanoGPT](https://github.com/karpathy/nanoGPT), which only covered pretraining.
- nanochat is also inspired by [modded-nanoGPT](https://github.com/KellerJordan/modded-nanogpt), which gamified the nanoGPT repo with clear metrics and a leaderboard, and borrows a lot of its ideas and some implementation for pretraining.
- Thank you to [HuggingFace](https://huggingface.co/) for fineweb and smoltalk.
- Thank you [Lambda](https://lambda.ai/service/gpu-cloud) for the compute used in developing this project.
- Thank you to chief LLM whisperer 🧙‍♂️ Alec Radford for advice/guidance.
- Thank you to the repo czar Sofie [@svlandeg](https://github.com/svlandeg) for help with managing issues, pull requests and discussions of nanochat.
## Cite
If you find nanochat helpful in your research cite simply as:
```bibtex
@misc{nanochat,
author = {Andrej Karpathy},
title = {nanochat: The best ChatGPT that \$100 can buy},
year = {2025},
publisher = {GitHub},
url = {https://github.com/karpathy/nanochat}
}
```
## License
MIT
### sh
- `python -m nanochat.report reset`
- `python -m scripts.tok_train --max_chars=2000000000`
- `python -m scripts.tok_eval`
- `torchrun --standalone --nproc_per_node=1 -m scripts.base_train -- --depth=18 --device-batch-size=1`
- `torchrun --standalone --nproc_per_node=1 -m scripts.base_eval -- --device-batch-size=1`
- `torchrun --standalone --nproc_per_node=1 -m scripts.chat_sft -- --device-batch-size=1`
- `torchrun --standalone --nproc_per_node=1 -m scripts.chat_eval -- -i sft`
- `python -m scripts.chat_cli -p "Why is the sky blue?"`
- `python -m scripts.chat_web`
- `python -m nanochat.report generate`

View File

@ -6,9 +6,12 @@ import os
import re
import logging
import urllib.request
from dotenv import load_dotenv
from filelock import FileLock
import torch
import torch.distributed as dist
from filelock import FileLock
class ColoredFormatter(logging.Formatter):
"""Custom formatter that adds colors to log messages."""
@ -47,15 +50,24 @@ def setup_default_logging():
setup_default_logging()
logger = logging.getLogger(__name__)
def get_base_dir():
# co-locate nanochat intermediates with other cached data in ~/.cache (by default)
load_dotenv() # 将 .env 注入到环境变量
def get_base_dir() -> str:
"""项目数据存储的基础目录
Args:
None
Returns:
str: 基础目录的路径
"""
if os.environ.get("NANOCHAT_BASE_DIR"):
nanochat_dir = os.environ.get("NANOCHAT_BASE_DIR")
nanochat_dir = os.environ.get("NANOCHAT_BASE_DIR", "")
assert os.path.isdir(nanochat_dir), f"NANOCHAT_BASE_DIR is set to {nanochat_dir} but that is not a directory"
else:
home_dir = os.path.expanduser("~")
cache_dir = os.path.join(home_dir, ".cache")
nanochat_dir = os.path.join(cache_dir, "nanochat")
os.makedirs(nanochat_dir, exist_ok=True)
os.makedirs(nanochat_dir, exist_ok=True)
return nanochat_dir
def download_file_with_lock(url, filename, postprocess_fn=None):

View File

@ -101,8 +101,16 @@ def find_common_length(token_sequences, direction='left'):
return min_len
def stack_sequences(tokens, pad_token_id):
"""Stack up a list of token sequences, pad to longest on the right"""
def stack_sequences(tokens, pad_token_id) -> torch.Tensor:
"""根据最长的 token 序列长度将短序列 padding 到同一长度,并堆叠成一个 batch
Args:
tokens (List[List[int]]): token id 序列列表
pad_token_id (int): 用于 padding token id
Returns:
input_ids (torch.LongTensor): padding 后的 token id 序列张量, 形状为 (batch_size, max_seq_len) tensor
"""
bsz, seq_len = len(tokens), max(len(x) for x in tokens)
input_ids = torch.full((bsz, seq_len), pad_token_id, dtype=torch.long)
for i, x in enumerate(tokens):
@ -143,36 +151,56 @@ def batch_sequences_lm(tokenizer, prompts):
@torch.no_grad()
def forward_model(model, input_ids):
"""
Take BxT tensor of token ids, return BxT tensor of losses and argmax predictions.
The last column of losses is set to nan because we don't have autoregressive targets there.
"""将输入的token id序列输入到模型中, 得到损失和预测结果
Args:
- model: 评测用的语言模型
- input_ids: 输入的token id序列, 形状为 (batch_size, seq_len)
Returns:
- losses: 损失, 形状为 (batch_size, seq_len)
- predictions: 预测结果, 形状为 (batch_size, seq_len)
"""
batch_size, seq_len = input_ids.size()
outputs = model(input_ids)
# Roll the tensor to the left by one position to get the (autoregressive) target ids
# 构造目标token id序列, 通过将输入序列向左滚动一位得到
target_ids = torch.roll(input_ids, shifts=-1, dims=1)
# Calculate cross entropy at all positions
# 计算每个位置的交叉熵损失
losses = torch.nn.functional.cross_entropy(
outputs.view(batch_size * seq_len, -1),
target_ids.view(batch_size * seq_len),
reduction='none'
).view(batch_size, seq_len)
# Set the last column to be nan because there is no autoregressive loss there
# 忽略最后一个位置的损失, 因为它没有对应的目标token
losses[:, -1] = float('nan')
# Get the argmax predictions at each position
# 得到每个位置的预测token id, 通过取输出logits的argmax得到
predictions = outputs.argmax(dim=-1)
return losses, predictions
@torch.no_grad()
def evaluate_example(idx, model, tokenizer, data, device, task_meta):
"""Evaluate a single example, return True if correct, False otherwise"""
def evaluate_example(idx, model, tokenizer, data, device, task_meta) -> bool:
"""测试一个题目的正确性, 返回True/False, 支持多选题、schema题和语言模型题
Args:
idx: 题目在数据中的索引
model: 评测用的语言模型
tokenizer: 评测用的tokenizer
data: 评测数据列表每个元素是一个题目的字典表示
device: 评测用的设备(CPU/GPU)
task_meta: 任务的元信息字典包含以下字段
- task_type: 任务类型字符串取值为'multiple_choice''schema''language_modeling'
- num_fewshot: 每个题目需要采样的few-shot示例数量, 整数
- continuation_delimiter: prompt中上下文和续写之间的分隔符, 字符串
Returns:
is_correct: 该题目的预测是否正确, 布尔值
"""
item = data[idx]
task_type = task_meta['task_type']
num_fewshot = task_meta['num_fewshot']
continuation_delimiter = task_meta['continuation_delimiter']
# Sample few-shot examples (excluding current item)
# 生成模仿示例
fewshot_examples = []
if num_fewshot > 0:
rng = random.Random(1234 + idx)
@ -180,7 +208,7 @@ def evaluate_example(idx, model, tokenizer, data, device, task_meta):
fewshot_indices = rng.sample(available_indices, num_fewshot)
fewshot_examples = [data[i] for i in fewshot_indices]
# Render prompts and batch sequences based on task type
# 将题目和模仿示例渲染成完整的prompt并tokenize成输入模型的token序列
if task_type == 'multiple_choice':
prompts = render_prompts_mc(item, continuation_delimiter, fewshot_examples)
tokens, start_idxs, end_idxs = batch_sequences_mc(tokenizer, prompts)
@ -193,8 +221,7 @@ def evaluate_example(idx, model, tokenizer, data, device, task_meta):
else:
raise ValueError(f"Unsupported task type: {task_type}")
# Some models can't forward sequences beyond a certain length (e.g. GPT-2)
# In these cases, we have to truncate sequences to max length and adjust the indices
# 根据模型的最大输入长度从左侧裁剪token序列, 并相应地调整start/end indices
if hasattr(model, 'max_seq_len') and model.max_seq_len is not None:
max_tokens = model.max_seq_len
new_tokens, new_start_idxs, new_end_idxs = [], [], []
@ -212,25 +239,25 @@ def evaluate_example(idx, model, tokenizer, data, device, task_meta):
new_end_idxs.append(e)
tokens, start_idxs, end_idxs = new_tokens, new_start_idxs, new_end_idxs
# Stack up all the sequences into a batch
# 将token序列堆叠成一个batch并移动到评测设备上
pad_token_id = tokenizer.get_bos_token_id() # use BOS as pad token is ok
input_ids = stack_sequences(tokens, pad_token_id)
input_ids = input_ids.to(device)
# Forward the model, get the autoregressive loss and argmax prediction at each token
# 前向传播得到loss和预测
losses, predictions = forward_model(model, input_ids)
# See if the losses/predictions come out correctly
# lm 题目, inputs 形状为 (1, seq_len), predictions 形状为 (1, seq_len), start_idxs 和 end_idxs 是长度为1的列表
if task_type == 'language_modeling':
# language modeling task is currently always batch size 1
si = start_idxs[0]
ei = end_idxs[0]
# predictions[i] predict input_ids[i+1] autoregressively
# 截取模型对答案的预测部分,并和正确答案进行比较
predicted_tokens = predictions[0, si-1:ei-1]
actual_tokens = input_ids[0, si:ei]
is_correct = torch.all(predicted_tokens == actual_tokens).item()
# choice/schema 题目, inputs 形状为 (num_options, seq_len), predictions 形状为 (num_options, seq_len), start_idxs 和 end_idxs 是长度为 num_options 的列表
elif task_type in ['multiple_choice', 'schema']:
# For MC/schema: find the option with lowest average loss
# 计算模型在每个选项上的平均loss, 选loss最小的选项作为模型的预测, 并和正确答案进行比较
mean_losses = [losses[i, si-1:ei-1].mean().item()
for i, (si, ei) in enumerate(zip(start_idxs, end_idxs))]
pred_idx = mean_losses.index(min(mean_losses))
@ -238,25 +265,37 @@ def evaluate_example(idx, model, tokenizer, data, device, task_meta):
else:
raise ValueError(f"Unsupported task type: {task_type}")
return is_correct
return bool(is_correct)
def evaluate_task(model, tokenizer, data, device, task_meta):
"""
This function is responsible for evaluating one task across many examples.
It also handles dispatch to all processes if the script is run with torchrun.
def evaluate_task(model, tokenizer, data, device, task_meta) -> float:
"""评测一个任务(含有多个题目)的CORE分数, 支持多选题、schema题和语言模型题
Args:
model: 评测用的语言模型
tokenizer: 评测用的tokenizer
data: 评测数据列表每个元素是一个题目的字典表示
device: 评测用的设备(CPU/GPU)
task_meta: 任务的元信息字典包含以下字段
- task_type: 任务类型字符串取值为'multiple_choice''schema''language_modeling'
- num_fewshot: 每个题目需要采样的few-shot示例数量, 整数
- continuation_delimiter: prompt中上下文和续写之间的分隔符, 字符串
Returns:
mean_correct: 该任务的CORE分数, 浮点数, 取值范围为0.0到1.0, 代表模型在该任务上预测正确的题目比例
"""
# 获取当前进程的rank和总的进程数
rank = dist.get_rank() if dist.is_initialized() else 0
world_size = dist.get_world_size() if dist.is_initialized() else 1
correct = torch.zeros(len(data), dtype=torch.float32, device=device)
# stride the examples to each rank
# 为每个进程分配不同的题目进行评测, 以实现并行评测
for idx in range(rank, len(data), world_size):
is_correct = evaluate_example(idx, model, tokenizer, data, device, task_meta)
correct[idx] = float(is_correct)
# sync results across all the processes if running distributed
# 在所有进程之间同步correct张量并求和得到总的正确题目数量
if world_size > 1:
dist.barrier()
dist.all_reduce(correct, op=dist.ReduceOp.SUM)
# compute the mean
# 计算正确题目的比例作为CORE分数
mean_correct = correct.mean().item()
return mean_correct

View File

@ -16,26 +16,33 @@ Fallback to the original if you have very limited data AND long documents:
https://github.com/karpathy/nanochat/blob/3c3a3d7/nanochat/dataloader.py#L78-L117
"""
from typing import Iterator, Tuple
import torch
import pyarrow.parquet as pq
from nanochat.common import get_dist_info
from nanochat.dataset import list_parquet_files
def _document_batches(split, resume_state_dict, tokenizer_batch_size):
"""
Infinite iterator over document batches (list of text strings) from parquet files.
def _document_batches(split, resume_state_dict, tokenizer_batch_size, nums_parallel_files=250) -> Iterator[Tuple[list, Tuple[int, int, int]]]:
"""生成器函数,按文档批次加载数据,并支持从中断处恢复。在预训练的训练/测试中使用
Handles DDP sharding and approximate resume. Each yield is (text_batch, (pq_idx, rg_idx, epoch))
where text_batch is a list of document strings, indices track position for resumption,
and epoch counts how many times we've cycled through the dataset (starts at 1).
Args:
split: "train" or "val"
resume_state_dict: 状态字典包含 "pq_idx", "rg_idx", "epoch" 用于从中断处恢复
tokenizer_batch_size: 定义每个批次的文章数量
nums_parallel_files: 定义整个预训练过程使用的 parquet 文件数量
Yields:
tuple: (batch, (pq_idx, rg_idx, epoch)) 元组包含当前批次的文本列表和对应的 parquet 文件索引row group 索引epoch
"""
ddp, ddp_rank, ddp_local_rank, ddp_world_size = get_dist_info()
parquet_paths = list_parquet_files()
parquet_paths = parquet_paths[:min(len(parquet_paths), nums_parallel_files)]
assert len(parquet_paths) != 0, "No dataset parquet files found, did you run dataset.py?"
# 训练使用除了最后一个文件以外的所有文件,测试使用最后一个文件
parquet_paths = parquet_paths[:-1] if split == "train" else parquet_paths[-1:]
# 从状态字典中恢复位置
resume_pq_idx = resume_state_dict["pq_idx"] if resume_state_dict is not None else 0
resume_rg_idx = resume_state_dict["rg_idx"] if resume_state_dict is not None else None
resume_epoch = resume_state_dict.get("epoch", 1) if resume_state_dict is not None else 1
@ -43,29 +50,35 @@ def _document_batches(split, resume_state_dict, tokenizer_batch_size):
pq_idx = resume_pq_idx
epoch = resume_epoch
while True: # iterate infinitely (multi-epoch)
while True:
pq_idx = resume_pq_idx if first_pass else 0
while pq_idx < len(parquet_paths):
filepath = parquet_paths[pq_idx]
# pf = [
# [text1, text2, ...],
# [textN+1, textN+2, ...],
# ...]
# 每个 pf 包含多个 row group每个 row group 包含一批文本数据
pf = pq.ParquetFile(filepath)
# Start from resume point if resuming on same file, otherwise from DDP rank
# 计算每个 ddp 进程应该从哪个 row group 开始
if first_pass and (resume_rg_idx is not None) and (pq_idx == resume_pq_idx):
base_idx = resume_rg_idx // ddp_world_size
base_idx += 1 # advance by 1 so we don't repeat data after resuming
base_idx += 1 # 从中断位置的下一个 row group 开始,以避免重复
rg_idx = base_idx * ddp_world_size + ddp_rank
if rg_idx >= pf.num_row_groups:
if rg_idx >= pf.num_row_groups: # 若越界则开始遍历下一个 parquet 文件
pq_idx += 1
continue
resume_rg_idx = None # only do this once
else:
rg_idx = ddp_rank
# 遍历 当前 parquet 文件的 row group每个 ddp 进程处理不同的 row group
while rg_idx < pf.num_row_groups:
rg = pf.read_row_group(rg_idx)
batch = rg.column('text').to_pylist()
for i in range(0, len(batch), tokenizer_batch_size):
rg = pf.read_row_group(rg_idx) # 读取一个 row group得到一个表格对象 rg
batch = rg.column('text').to_pylist() # 从表格对象中提取文本列,得到一个文本列表 batch
for i in range(0, len(batch), tokenizer_batch_size): # 将文本列表分割成更小的批次,每个批次包含 tokenizer_batch_size 个文本
yield batch[i:i+tokenizer_batch_size], (pq_idx, rg_idx, epoch)
rg_idx += ddp_world_size
pq_idx += 1
rg_idx += ddp_world_size # 每个 ddp 进程跳过已经被其他进程处理的 row group
pq_idx += 1 # 处理下一个 parquet 文件
first_pass = False
epoch += 1
@ -76,24 +89,27 @@ def tokenizing_distributed_data_loader_with_state_bos_bestfit(
device="cuda", resume_state_dict=None,
buffer_size=1000
):
"""
BOS-aligned dataloader with Best-Fit Cropping.
"""分布式数据加载器,使用 BOS-aligned best-fit 算法将文本数据打包成固定长度的输入和目标序列,适用于预训练的训练/测试
Reduces token waste compared to simple greedy cropping by searching a buffer
for documents that fit well, while maintaining 100% utilization (no padding).
Algorithm for each row:
1. From buffered docs, pick the LARGEST doc that fits entirely
2. Repeat until no doc fits
3. When nothing fits, crop a doc to fill remaining space exactly
Key properties:
- Every row starts with BOS
- 100% utilization (no padding, every token is trained on)
- Approximately 35% of all tokens are discarded due to cropping
Args:
tokenizer: 用于将文本编码为 token tokenizer 对象
B: 每个批次的行数(batch size), 用于定义 inputs targets 的批次大小
T: 每行的 token 数量(序列长度), 用于定义 inputs targets 的序列长度
split: "train" "val"指定数据集的划分
tokenizer_threads: 用于编码文本的线程数量
tokenizer_batch_size: 每个批次的文本数量用于编码过程
device: "cuda" "cpu"指定数据加载器输出的设备
resume_state_dict: 状态字典包含 "pq_idx", "rg_idx", "epoch" 用于从中断处恢复
buffer_size: 文档缓冲区的大小定义了在填充行缓冲区之前从生成器中加载多少文档
Returns:
Iterator[Tuple[torch.Tensor, torch.Tensor, dict]]: 生成器迭代返回 (inputs, targets, state_dict) 元组其中 inputs targets 是形状为 (B, T) 的张量state_dict 包含当前的 "pq_idx", "rg_idx", "epoch" 用于恢复
"""
assert split in ["train", "val"], "split must be 'train' or 'val'"
# row = [1, 2, 3, 4..., T, T + 1]
# input = [1, 2, 3, 4..., T]
# target = [2, 3, 4, 5..., T + 1]
row_capacity = T + 1
batches = _document_batches(split, resume_state_dict, tokenizer_batch_size)
bos_token = tokenizer.get_bos_token_id()
@ -101,19 +117,22 @@ def tokenizing_distributed_data_loader_with_state_bos_bestfit(
pq_idx, rg_idx, epoch = 0, 0, 1
def refill_buffer():
"""填充文档缓冲区, 直到达到指定的 buffer_size,
每次调用都会从 batches 生成器中获取下一个批次的文本数据, 并将其编码为 token 列表后添加到 doc_buffer
"""
nonlocal pq_idx, rg_idx, epoch
doc_batch, (pq_idx, rg_idx, epoch) = next(batches)
# 一次性编码一个文本列表, 将文章开头设置为 bos_token, 并行使用多个线程加速编码过程
token_lists = tokenizer.encode(doc_batch, prepend=bos_token, num_threads=tokenizer_threads)
for tokens in token_lists:
doc_buffer.append(tokens)
# Pre-allocate buffers once: layout is [inputs (B*T) | targets (B*T)]
# This gives us contiguous views and a single HtoD transfer
# 预分配 tokens 的 CPU 和 GPU 缓冲区
use_cuda = device == "cuda"
row_buffer = torch.empty((B, row_capacity), dtype=torch.long) # for building rows without creating Python lists
cpu_buffer = torch.empty(2 * B * T, dtype=torch.long, pin_memory=use_cuda) # staging area (CPU)
gpu_buffer = torch.empty(2 * B * T, dtype=torch.long, device=device) # on-device buffer
cpu_inputs = cpu_buffer[:B * T].view(B, T) # a few views into these buffers just for convenience
row_buffer = torch.empty((B, row_capacity), dtype=torch.long)
cpu_buffer = torch.empty(2 * B * T, dtype=torch.long, pin_memory=use_cuda)
gpu_buffer = torch.empty(2 * B * T, dtype=torch.long, device=device)
cpu_inputs = cpu_buffer[:B * T].view(B, T)
cpu_targets = cpu_buffer[B * T:].view(B, T)
inputs = gpu_buffer[:B * T].view(B, T)
targets = gpu_buffer[B * T:].view(B, T)
@ -122,40 +141,47 @@ def tokenizing_distributed_data_loader_with_state_bos_bestfit(
for row_idx in range(B):
pos = 0
while pos < row_capacity:
# Ensure buffer has documents
# 填充文档缓冲区直到达到指定大小,以确保有足够的文档可供选择
while len(doc_buffer) < buffer_size:
refill_buffer()
# 计算当前 row 剩余的容量,以确定可以放入多少 tokens
remaining = row_capacity - pos
# Find largest doc that fits entirely
# 循环查找最大且能完全适应剩余空间的文档
best_idx = -1
best_len = 0
# 查找最短的文档,以便在没有完全适合剩余空间的文档时进行裁剪
shortest_idx = 0
shortest_len = len(doc_buffer[shortest_idx]) if doc_buffer else float('inf')
for i, doc in enumerate(doc_buffer):
doc_len = len(doc)
if doc_len <= remaining and doc_len > best_len:
best_idx = i
best_len = doc_len
if doc_len < shortest_len:
shortest_idx = i
shortest_len = doc_len
# 如果找到了一个完全适合剩余空间的文档,则将其放入当前行,并从缓冲区中移除
if best_idx >= 0:
doc = doc_buffer.pop(best_idx)
doc_len = len(doc)
row_buffer[row_idx, pos:pos + doc_len] = torch.tensor(doc, dtype=torch.long)
pos += doc_len
# 如果文档缓冲中所有文档长度都大于剩余空间,则选择一个最短的文档进行裁剪,以完全填充剩余空间,并从缓冲区中移除该文档
else:
# No doc fits - crop shortest in buffer to fill remaining and minimize waste
shortest_idx = min(range(len(doc_buffer)), key=lambda i: len(doc_buffer[i]))
doc = doc_buffer.pop(shortest_idx)
row_buffer[row_idx, pos:pos + remaining] = torch.tensor(doc[:remaining], dtype=torch.long)
pos += remaining
# Copy to pinned CPU buffer, then single HtoD transfer
# 将填充好的行缓冲区分割成输入和目标,输入是前 T 个 tokens目标是后 T 个 tokens
cpu_inputs.copy_(row_buffer[:, :-1])
cpu_targets.copy_(row_buffer[:, 1:])
state_dict = {"pq_idx": pq_idx, "rg_idx": rg_idx, "epoch": epoch}
# Single HtoD copy into persistent GPU buffer and yield
# 将 CPU 缓冲区的数据复制到 GPU 缓冲区,以便模型训练使用
gpu_buffer.copy_(cpu_buffer, non_blocking=use_cuda)
yield inputs, targets, state_dict

View File

@ -10,6 +10,7 @@ For details of how the dataset was prepared, see `repackage_data_reference.py`.
import os
import argparse
import time
from typing import Iterator
import requests
import pyarrow.parquet as pq
from multiprocessing import Pool
@ -24,36 +25,54 @@ BASE_URL = "https://huggingface.co/datasets/karpathy/fineweb-edu-100b-shuffle/re
MAX_SHARD = 1822 # the last datashard is shard_01822.parquet
index_to_filename = lambda index: f"shard_{index:05d}.parquet" # format of the filenames
base_dir = get_base_dir()
DATA_DIR = os.path.join(base_dir, "base_data")
DATA_DIR = os.path.join(base_dir, "pretrain_data")
os.makedirs(DATA_DIR, exist_ok=True)
# -----------------------------------------------------------------------------
# These functions are useful utilities to other modules, can/should be imported
def list_parquet_files(data_dir=None):
""" Looks into a data dir and returns full paths to all parquet files. """
def list_parquet_files(data_dir=None) -> list[str]:
"""获取指定目录下的所有 parquet 文件路径,默认使用 DATA_DIR
Args:
data_dir (str, optional): 要搜索的目录路径默认为 None表示使用 DATA_DIR
Returns:
list[str]: 包含所有 parquet 文件路径的列表
"""
data_dir = DATA_DIR if data_dir is None else data_dir
parquet_files = sorted([
f for f in os.listdir(data_dir)
if f.endswith('.parquet') and not f.endswith('.tmp')
])
parquet_paths = [os.path.join(data_dir, f) for f in parquet_files]
assert len(parquet_paths) != 0, f"No dataset parquet files found in {data_dir}, did you run dataset.py?"
return parquet_paths
def parquets_iter_batched(split, start=0, step=1):
"""
Iterate through the dataset, in batches of underlying row_groups for efficiency.
- split can be "train" or "val". the last parquet file will be val.
- start/step are useful for skipping rows in DDP. e.g. start=rank, step=world_size
def parquets_iter_batched(split, start=0, step=1) -> Iterator[list[str]]:
"""按 groups 迭代 parquet 文件中的文本数据,在分词器训练/测试中使用
Args:
split (str): 数据集划分必须是 "train" "val"
start (int, optional): 起始行组索引默认为 0
step (int, optional): 行组步长默认为 1, 表示每个行组都迭代, 用于不同进程迭代不同的行组
Yields:
Iterator[list[str]]: 每次迭代返回一个文本列表[text1, text2, ...]包含当前行组中的所有文本数据
"""
assert split in ["train", "val"], "split must be 'train' or 'val'"
parquet_paths = list_parquet_files()
# "train" 迭代除了最后一个文件以外的所有文件,"val" 只迭代最后一个文件
parquet_paths = parquet_paths[:-1] if split == "train" else parquet_paths[-1:]
for filepath in parquet_paths:
# pf = [
# [text1, text2, ...],
# [textN+1, textN+2, ...],
# ...]
# 每个 pf 包含多个 row group每个 row group 包含一批文本数据
pf = pq.ParquetFile(filepath)
for rg_idx in range(start, pf.num_row_groups, step):
rg = pf.read_row_group(rg_idx)
texts = rg.column('text').to_pylist()
# texts = [textI, textI+1, ...] 当前行组中的文本数据列表
yield texts
# -----------------------------------------------------------------------------

View File

@ -3,7 +3,7 @@ name = "nanochat"
version = "0.1.0"
description = "the minimal full-stack ChatGPT clone"
readme = "README.md"
requires-python = ">=3.10"
requires-python = ">=3.12"
dependencies = [
"datasets>=4.0.0",
"fastapi>=0.117.1",
@ -47,14 +47,23 @@ torch = [
{ index = "pytorch-cu128", extra = "gpu" },
]
[[tool.uv.index]]
name = "tuna"
url = "https://pypi.tuna.tsinghua.edu.cn/simple"
default = true # 标记为默认源
[[tool.uv.index]]
name = "pytorch-cpu"
url = "https://download.pytorch.org/whl/cpu"
url = "https://mirrors.nju.edu.cn/pytorch/whl/cpu"
# url = "https://pypi.tuna.tsinghua.edu.cn/simple"
# url = "https://download.pytorch.org/whl/cpu"
explicit = true
[[tool.uv.index]]
name = "pytorch-cu128"
url = "https://download.pytorch.org/whl/cu128"
url = "https://mirrors.nju.edu.cn/pytorch/whl/cu128"
# url = "https://pypi.tuna.tsinghua.edu.cn/simple"
# url = "https://download.pytorch.org/whl/cu128"
explicit = true
[project.optional-dependencies]

80
runs/speedrundiy.sh Executable file
View File

@ -0,0 +1,80 @@
#!/bin/bash
# -----------------------------------------------------------------------------
# 变量设置
NPROC_PER_NODE=1
NANOCHAT_BASE_DIR="/media/data/liujiang/data/datasets/nanochat_base_dir"
export OMP_NUM_THREADS=1
export NANOCHAT_BASE_DIR="${NANOCHAT_BASE_DIR}"
export HF_ENDPOINT=https://hf-mirror.com
export HF_HUB_ENABLE_HF_HUB_CACHE=true
export HF_HUB_CACHE="${NANOCHAT_BASE_DIR}/hf_hub_cache"
export HF_DATASETS_CACHE="${NANOCHAT_BASE_DIR}/hf_datasets_cache"
# -----------------------------------------------------------------------------
# 环境搭建
# install uv (if not already installed)
command -v uv &> /dev/null || curl -LsSf https://astral.sh/uv/install.sh | sh
# create a .venv local virtual environment (if it doesn't exist)
[ -d ".venv" ] || uv venv
# install the repo dependencies
uv sync --extra gpu
# activate venv so that `python` uses the project's venv instead of system python
source .venv/bin/activate
# -----------------------------------------------------------------------------
# 日志系统设置
# 如果使用 wandb 进行日志记录:
# 1) 在环境变量中设置 `WANDB_API_KEY`
# 2) 在本地先运行 `wandb login`
# 3) 运行脚本时设置 WANDB_RUN 环境变量,例如:`WANDB_RUN=d26 bash speedrun.sh`
if [ -z "$WANDB_RUN" ]; then
# 默认使用 "dummy":这是一个特殊情况,会跳过 wandb 日志记录
WANDB_RUN=dummy
else
# 如果设置了 WANDB_RUN 且不等于 dummy则运行 online 模式
if [ "$WANDB_RUN" != "dummy" ]; then
# 检查是否提供了 API KEY
if [ -z "$WANDB_API_KEY" ]; then
echo "错误: 检测到 WANDB_RUN=$WANDB_RUN 为 Online 模式,但未检测到 WANDB_API_KEY"
exit 1
fi
fi
fi
# -----------------------------------------------------------------------------
# 日志系统重置
python -m nanochat.report reset
# -----------------------------------------------------------------------------
# 分词器训练
python -m nanochat.dataset -n 8
python -m nanochat.dataset -n 370 &
DATASET_DOWNLOAD_PID=$!
python -m scripts.tok_train
python -m scripts.tok_eval
# -----------------------------------------------------------------------------
# 等待预训练数据下载完成,然后进行预训练
echo "Waiting for dataset download to complete..."
wait $DATASET_DOWNLOAD_PID
torchrun --standalone --nproc_per_node=$NPROC_PER_NODE -m scripts.base_train -- --depth=18 --target-param-data-ratio=8.25 --device-batch-size=1 --run=$WANDB_RUN
torchrun --standalone --nproc_per_node=$NPROC_PER_NODE -m scripts.base_eval -- --device-batch-size=1
# -----------------------------------------------------------------------------
# 指令微调数据集下载,并进行指令微调和评测
curl -L -o $NANOCHAT_BASE_DIR/identity_conversations.jsonl https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl
torchrun --standalone --nproc_per_node=$NPROC_PER_NODE -m scripts.chat_sft -- --device-batch-size=1 --run=$WANDB_RUN
torchrun --standalone --nproc_per_node=$NPROC_PER_NODE -m scripts.chat_eval -- -i sft
# 命令行聊天测试
python -m scripts.chat_cli -p "Why is the sky blue?"
# web聊天测试
python -m scripts.chat_web
# -----------------------------------------------------------------------------
# 生成报告
python -m nanochat.report generate

View File

@ -24,6 +24,8 @@ import csv
import time
import json
import yaml
import math
from typing import Optional, Literal
import shutil
import random
import zipfile
@ -95,7 +97,12 @@ EVAL_BUNDLE_URL = "https://karpathy-public.s3.us-west-2.amazonaws.com/eval_bundl
def place_eval_bundle(file_path):
"""Unzip eval_bundle.zip and place it in the base directory."""
"""将 eval_bundle.zip 解压到 base_dir/eval_bundle 目录下
Args:
file_path: eval_bundle.zip 文件路径
Returns:
None
"""
base_dir = get_base_dir()
eval_bundle_dir = os.path.join(base_dir, "eval_bundle")
with tempfile.TemporaryDirectory() as tmpdir:
@ -105,18 +112,33 @@ def place_eval_bundle(file_path):
shutil.move(extracted_bundle_dir, eval_bundle_dir)
print0(f"Placed eval_bundle directory at {eval_bundle_dir}")
scale_map = {
"small": 0.1,
"medium": 0.4,
"large": 0.7
}
def evaluate_core(model, tokenizer, device, scale: Optional[Literal["small", "medium", "large"]] = None, max_per_task=-1):
"""在 CORE 评测上评测模型, 返回每个任务的 accuracy 和 centered result, 以及最终的 CORE metric
Args:
- model: 评测模型
- tokenizer: 模型对应的 tokenizer
- device: 评测设备
- scale: 可选的评测规模, small/medium/large, 分别使用 10%/40%/70% 的任务进行评测 (默认: 100%)
- max_per_task: 每个任务的最大题目数量, -1 表示使用所有题目 (默认: -1)
def evaluate_core(model, tokenizer, device, max_per_task=-1):
"""
Evaluate a base model on the CORE benchmark.
Returns dict with results, centered_results, and core_metric.
Returns:
- dict 包含以下字段:
- "results": 每个任务的 accuracy, dict 形式 {task_label: accuracy}
- "centered_results": 每个任务的 centered result, dict 形式 {task_label: centered_result}
- "core_metric": 所有任务的 centered result 的平均, 即最终的 CORE metric
"""
base_dir = get_base_dir()
eval_bundle_dir = os.path.join(base_dir, "eval_bundle")
# Download the eval bundle if needed
# 任务下载
if not os.path.exists(eval_bundle_dir):
download_file_with_lock(EVAL_BUNDLE_URL, "eval_bundle.zip", postprocess_fn=place_eval_bundle)
# 加载配置和数据
config_path = os.path.join(eval_bundle_dir, "core.yaml")
data_base_path = os.path.join(eval_bundle_dir, "eval_data")
eval_meta_data = os.path.join(eval_bundle_dir, "eval_meta_data.csv")
@ -125,7 +147,7 @@ def evaluate_core(model, tokenizer, device, max_per_task=-1):
config = yaml.safe_load(f)
tasks = config['icl_tasks']
# Load random baseline values
# 加载每个任务的 baseline, 例如有四个选项的题目的 baseline 是 25.0, 判断题的 baseline 是 50.0
random_baselines = {}
with open(eval_meta_data, 'r', encoding='utf-8') as f:
reader = csv.DictReader(f)
@ -134,11 +156,22 @@ def evaluate_core(model, tokenizer, device, max_per_task=-1):
random_baseline = row['Random baseline']
random_baselines[task_name] = float(random_baseline)
# Evaluate each task
# 评测每个任务, 计算 accuracy 和 centered result, 最后计算 CORE metric
results = {}
centered_results = {}
# 限制任务数量
if scale is not None and len(tasks) > 1:
if scale not in scale_map:
raise ValueError(f"Invalid scale: {scale}. Must be one of {list(scale_map.keys())}")
num_tasks = math.ceil(len(tasks) * scale_map[scale])
print0(f"Scaling CORE evaluation: {scale} -> using {num_tasks} tasks (out of {len(tasks)})")
tasks = tasks[:num_tasks]
# 开始评测
for task in tasks:
start_time = time.time()
# 构造任务元信息
label = task['label']
task_meta = {
'task_type': task['icl_task_type'],
@ -148,16 +181,18 @@ def evaluate_core(model, tokenizer, device, max_per_task=-1):
}
print0(f"Evaluating: {label} ({task_meta['num_fewshot']}-shot, type: {task_meta['task_type']})... ", end='')
# 加载任务
data_path = os.path.join(data_base_path, task_meta['dataset_uri'])
with open(data_path, 'r', encoding='utf-8') as f:
data = [json.loads(line.strip()) for line in f]
# Shuffle for consistent subsampling when using max_per_task
# 限制每个任务的题目数量, 并确保顺序随机但可复现
shuffle_rng = random.Random(1337)
shuffle_rng.shuffle(data)
if max_per_task > 0:
data = data[:max_per_task]
# 评测任务并计算结果
accuracy = evaluate_task(model, tokenizer, data, device, task_meta)
results[label] = accuracy
random_baseline = random_baselines[label]
@ -166,6 +201,7 @@ def evaluate_core(model, tokenizer, device, max_per_task=-1):
elapsed = time.time() - start_time
print0(f"accuracy: {accuracy:.4f} | centered: {centered_result:.4f} | time: {elapsed:.2f}s")
# 计算 CORE metric: 所有任务的 centered result 的平均
core_metric = sum(centered_results.values()) / len(centered_results)
out = {
"results": results,
@ -185,23 +221,23 @@ def main():
parser.add_argument('--step', type=int, default=None, help='Model step to load (default = last)')
parser.add_argument('--max-per-task', type=int, default=-1, help='Max examples per CORE task (-1 = all)')
parser.add_argument('--device-batch-size', type=int, default=32, help='Per-device batch size for BPB evaluation')
parser.add_argument('--split-tokens', type=int, default=40*524288, help='Number of tokens to evaluate per split for BPB')
parser.add_argument('--split-tokens', type=int, default=4*524288, help='Number of tokens to evaluate per split for BPB')
parser.add_argument('--device-type', type=str, default='', help='cuda|cpu|mps (empty = autodetect)')
args = parser.parse_args()
# Parse evaluation modes
# 检查 eval mode 的合法性 (core, bpb, sample)
eval_modes = set(mode.strip() for mode in args.eval.split(','))
valid_modes = {'core', 'bpb', 'sample'}
invalid = eval_modes - valid_modes
if invalid:
parser.error(f"Invalid eval modes: {invalid}. Valid: {valid_modes}")
# Distributed / precision setup
# 初始化测评环境
device_type = autodetect_device_type() if args.device_type == '' else args.device_type
ddp, ddp_rank, ddp_local_rank, ddp_world_size, device = compute_init(device_type)
autocast_ctx = torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16) if device_type == "cuda" else nullcontext()
# Load model and tokenizer
# 加载模型和 tokenizer
is_hf_model = args.hf_path is not None
if is_hf_model:
model, tokenizer = load_hf_model(args.hf_path, device)
@ -225,7 +261,7 @@ def main():
samples = []
unconditioned_samples = []
# --- Sampling ---
# 采样测评
if 'sample' in eval_modes and not is_hf_model:
print0("\n" + "="*80)
print0("Model Samples")
@ -263,7 +299,7 @@ def main():
elif 'sample' in eval_modes and is_hf_model:
print0("\nSkipping sampling for HuggingFace models (not supported)")
# --- BPB evaluation ---
# BPB 测评
if 'bpb' in eval_modes:
print0("\n" + "="*80)
print0("BPB Evaluation")
@ -282,13 +318,13 @@ def main():
bpb_results[split_name] = bpb
print0(f"{split_name} bpb: {bpb:.6f}")
# --- CORE evaluation ---
# CORE 测评
if 'core' in eval_modes:
print0("\n" + "="*80)
print0("CORE Evaluation")
print0("="*80)
with autocast_ctx:
core_results = evaluate_core(model, tokenizer, device, max_per_task=args.max_per_task)
core_results = evaluate_core(model, tokenizer, device, scale="small", max_per_task=args.max_per_task)
# Write CSV output
if ddp_rank == 0:
@ -305,7 +341,7 @@ def main():
print0(f"\nResults written to: {output_csv_path}")
print0(f"CORE metric: {core_results['core_metric']:.4f}")
# --- Log to report ---
# 日志记录
from nanochat.report import get_report
report_data = [{"model": model_name}]

View File

@ -71,7 +71,7 @@ parser.add_argument("--final-lr-frac", type=float, default=0.0, help="final LR a
parser.add_argument("--resume-from-step", type=int, default=-1, help="resume training from this step (-1 = disable)")
# Evaluation
parser.add_argument("--eval-every", type=int, default=250, help="evaluate val bpb every N steps (-1 = disable)")
parser.add_argument("--eval-tokens", type=int, default=40*524288, help="number of tokens to evaluate val loss on")
parser.add_argument("--eval-tokens", type=int, default=4*524288, help="number of tokens to evaluate val loss on")
parser.add_argument("--core-metric-every", type=int, default=2000, help="evaluate CORE metric every N steps (-1 = disable)")
parser.add_argument("--core-metric-max-per-task", type=int, default=500, help="examples per task for CORE metric")
parser.add_argument("--sample-every", type=int, default=2000, help="sample from model every N steps (-1 = disable)")
@ -425,7 +425,7 @@ while True:
if args.core_metric_every > 0 and (last_step or (step > 0 and step % args.core_metric_every == 0)):
model.eval()
with disable_fp8(orig_model), autocast_ctx:
results = evaluate_core(orig_model, tokenizer, device, max_per_task=args.core_metric_max_per_task)
results = evaluate_core(orig_model, tokenizer, device, scale="small", max_per_task=args.core_metric_max_per_task)
print0(f"Step {step:05d} | CORE metric: {results['core_metric']:.4f}")
wandb_run.log({
"step": step,
@ -456,6 +456,7 @@ while True:
print0(tokenizer.decode(sample[0]))
model.train()
last_step = step == 10
# save checkpoint: at the end of the run, or every save_every steps, except at the first step or the resume step
if last_step or (step > 0 and step != args.resume_from_step and args.save_every > 0 and step % args.save_every == 0):
save_checkpoint(

View File

@ -60,7 +60,7 @@ parser.add_argument("--warmdown-ratio", type=float, default=0.5, help="ratio of
parser.add_argument("--final-lr-frac", type=float, default=0.0, help="final LR as fraction of initial LR")
# Evaluation
parser.add_argument("--eval-every", type=int, default=200, help="evaluate val bpb every N steps (-1 = disable)")
parser.add_argument("--eval-tokens", type=int, default=40*524288, help="number of tokens to evaluate val loss on")
parser.add_argument("--eval-tokens", type=int, default=4*524288, help="number of tokens to evaluate val loss on")
parser.add_argument("--chatcore-every", type=int, default=200, help="evaluate ChatCORE metric every N steps (-1 = disable)")
parser.add_argument("--chatcore-max-cat", type=int, default=-1, help="max problems per categorical task for ChatCORE")
parser.add_argument("--chatcore-max-sample", type=int, default=24, help="max problems per generative task for ChatCORE")
@ -393,6 +393,7 @@ while True:
})
model.train()
last_step = step == 10
# save checkpoint at the end of the run (all ranks participate so each saves its optimizer shard)
if last_step:
output_dirname = args.model_tag if args.model_tag else f"d{depth}" # e.g. d12
@ -435,6 +436,13 @@ while True:
loss.backward()
x, y = next(train_loader) # prefetch the next batch while the GPU is busy with forward/backward
progress = max(progress, approx_progress) # only increase progress monotonically
# 梯度裁剪
total_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
if torch.isnan(total_norm) or torch.isinf(total_norm):
print0(f"WARNING: gradient norm is {total_norm}, skipping step")
model.zero_grad(set_to_none=True)
# step += 1
continue
# step the optimizer
lrm = get_lr_multiplier(progress)
muon_momentum = get_muon_momentum(step)

View File

@ -4,6 +4,7 @@ In the style of GPT-4 tokenizer.
"""
import os
import time
from typing import Iterator, Tuple
import argparse
import torch
from nanochat.tokenizer import RustBPETokenizer
@ -16,7 +17,7 @@ from nanochat.dataset import parquets_iter_batched
parser = argparse.ArgumentParser(description='Train a BPE tokenizer')
parser.add_argument('--max-chars', type=int, default=2_000_000_000, help='Maximum characters to train on (default: 10B)')
parser.add_argument('--doc-cap', type=int, default=10_000, help='Maximum characters per document (default: 10,000)')
parser.add_argument('--vocab-size', type=int, default=32768, help='Vocabulary size (default: 32768 = 2^15)')
parser.add_argument('--vocab-size', type=int, default=2**16, help='Vocabulary size (default: 32768 = 2^15)')
args = parser.parse_args()
print(f"max_chars: {args.max_chars:,}")
print(f"doc_cap: {args.doc_cap:,}")
@ -25,32 +26,50 @@ print(f"vocab_size: {args.vocab_size:,}")
# -----------------------------------------------------------------------------
# Text iterator
def text_iterator():
"""
1) Flatten the batches into a single iterator
2) Crop every document to args.doc_cap characters
3) Break when we've seen args.max_chars characters
def text_iterator() -> Iterator[str]:
"""文档文本迭代器
Args:
None
Yields:
str: 文档文本
"""
nchars = 0
for batch in parquets_iter_batched(split="train"):
for doc in batch:
doc_text = doc
# 若文档长度超过配置上限,则截断
if len(doc_text) > args.doc_cap:
doc_text = doc_text[:args.doc_cap]
nchars += len(doc_text)
yield doc_text
# 如果已经处理的字符数超过配置上限,则停止迭代
if nchars > args.max_chars:
return
text_iter = text_iterator()
# -----------------------------------------------------------------------------
# Train the tokenizer
t0 = time.time()
start = time.time()
tokenizer = RustBPETokenizer.train_from_iterator(text_iter, args.vocab_size)
t1 = time.time()
train_time = t1 - t0
end = time.time()
train_time = end - start
print(f"Training time: {train_time:.2f}s")
def train(iterator: Iterator[str], vocab_size: int) -> Tuple[RustBPETokenizer, float]:
"""训练BPE分词器
Args:
iterator (Iterator[str]): 文本迭代器
vocab_size (int): 词表大小
Returns:
Tuple[RustBPETokenizer, float]: 训练好的分词器和训练时间
"""
start = time.time()
tokenizer = RustBPETokenizer.train_from_iterator(iterator, vocab_size)
end = time.time()
train_time = end - start
return tokenizer, train_time
# -----------------------------------------------------------------------------
# Save the tokenizer to disk
base_dir = get_base_dir()
@ -68,6 +87,23 @@ encoded = tokenizer.encode(test_text)
decoded = tokenizer.decode(encoded)
assert decoded == test_text
def sanity_check(tokenizer: RustBPETokenizer):
"""对分词器进行快速的内联检查,确保编码和解码的一致性
Args:
tokenizer (RustBPETokenizer): 需要检查的分词器
Raises:
AssertionError: 如果编码和解码不一致则抛出断言错误
"""
test_text = """Hello world! This is a test.
Numbers: 123, 4567, 89
Contractions: I'm, you're, it's
Special chars: @#$%^&*()
Unicode: 你好世界 🌍"""
encoded = tokenizer.encode(test_text)
decoded = tokenizer.decode(encoded)
assert decoded == test_text
# -----------------------------------------------------------------------------
# One more thing: we wish to cache a mapping from token id to number of bytes of that token
# for efficient evaluation of bits per byte. Unlike the typical mean loss, this
@ -90,6 +126,32 @@ with open(token_bytes_path, "wb") as f:
torch.save(token_bytes, f)
print(f"Saved token_bytes to {token_bytes_path}")
def generate_token_bytes(tokenizer: RustBPETokenizer, save_path: str) -> torch.Tensor:
"""生成一个张量表示每个token id对应的字节数并保存到磁盘
Args:
tokenizer (RustBPETokenizer): 已训练好的分词器
save_path (str): 保存token_bytes张量的路径
Returns:
torch.Tensor: 包含每个token id对应的字节数的张量
"""
vocab_size = tokenizer.get_vocab_size()
special_set = set(tokenizer.get_special_tokens())
token_strings = [tokenizer.decode([token_id]) for token_id in range(vocab_size)]
token_bytes = []
for token_id in range(vocab_size):
token_str = token_strings[token_id] # the Python string representation of this token
if token_str in special_set:
token_bytes.append(0) # special characters are not counted
else:
id_bytes = len(token_str.encode("utf-8")) # number of bytes that make up this token
token_bytes.append(id_bytes)
token_bytes = torch.tensor(token_bytes, dtype=torch.int32, device='cpu')
with open(save_path, "wb") as f:
torch.save(token_bytes, f)
print(f"Saved token_bytes to {save_path}")
return token_bytes
# Log to report
from nanochat.report import get_report
token_bytes_nonzero = (token_bytes[token_bytes > 0]).to(dtype=torch.float32)
@ -104,3 +166,38 @@ get_report().log(section="Tokenizer training", data=[
"token_bytes_std": token_bytes_nonzero.std().item(),
}
])
def log_tokenizer_training(args: argparse.Namespace, train_time: float, tokenizer: RustBPETokenizer):
"""记录分词器训练的相关信息到报告中
Args:
args (argparse.Namespace): 命令行参数
train_time (float): 训练时间
tokenizer (RustBPETokenizer): 已训练好的分词器
"""
# 计算token_bytes统计信息
vocab_size = tokenizer.get_vocab_size()
special_set = set(tokenizer.get_special_tokens())
token_strings = [tokenizer.decode([token_id]) for token_id in range(vocab_size)]
token_bytes = []
for token_id in range(vocab_size):
token_str = token_strings[token_id] # the Python string representation of this token
if token_str in special_set:
token_bytes.append(0) # special characters are not counted
else:
id_bytes = len(token_str.encode("utf-8")) # number of bytes that make up this token
token_bytes.append(id_bytes)
token_bytes = torch.tensor(token_bytes, dtype=torch.int32, device='cpu')
token_bytes_nonzero = (token_bytes[token_bytes > 0]).to(dtype=torch.float32)
# Log to report
get_report().log(section="Tokenizer training", data=[
vars(args), # argparse command line arguments
{"train_time": train_time},
{"num_special_tokens": len(special_set)},
{
"token_bytes_min": int(token_bytes_nonzero.min().item()),
"token_bytes_max": int(token_bytes_nonzero.max().item()),
"token_bytes_mean": token_bytes_nonzero.mean().item(),
"token_bytes_std": token_bytes_nonzero.std().item(),
}
])

3977
uv.lock

File diff suppressed because it is too large Load Diff