mirror of
https://github.com/karpathy/nanochat.git
synced 2026-03-27 07:05:15 +00:00
update: speedrundiy.sh 流程跑通
This commit is contained in:
parent
b07604ebaa
commit
c4d3727ba8
6
.env
Normal file
6
.env
Normal file
|
|
@ -0,0 +1,6 @@
|
|||
NANOCHAT_BASE_DIR="/media/data/liujiang/data/datasets/nanochat_base_dir"
|
||||
|
||||
HF_ENDPOINT=https://hf-mirror.com
|
||||
HF_HUB_ENABLE_HF_HUB_CACHE=true
|
||||
HF_HUB_CACHE=$NANOCHAT_BASE_DIR/hf_hub_cache
|
||||
HF_DATASETS_CACHE=$NANOCHAT_BASE_DIR/hf_datasets_cache
|
||||
2
.gitignore
vendored
2
.gitignore
vendored
|
|
@ -6,7 +6,7 @@ report.md
|
|||
eval_bundle/
|
||||
|
||||
# Secrets
|
||||
.env
|
||||
# .env
|
||||
|
||||
# Local setup
|
||||
CLAUDE.md
|
||||
|
|
|
|||
|
|
@ -1 +1 @@
|
|||
3.10
|
||||
3.12
|
||||
193
README.md
193
README.md
|
|
@ -1,182 +1,11 @@
|
|||
# nanochat
|
||||
|
||||

|
||||

|
||||
|
||||
nanochat is the simplest experimental harness for training LLMs. It is designed to run on a single GPU node, the code is minimal/hackable, and it covers all major LLM stages including tokenization, pretraining, finetuning, evaluation, inference, and a chat UI. For example, you can train your own GPT-2 capability LLM (which cost ~$43,000 to train in 2019) for only $72 (~3 hours of 8XH100 GPU node) and then talk to it in a familiar ChatGPT-like web UI. On a spot instance, the total cost can be closer to ~$20. More generally, nanochat is configured out of the box to train an entire miniseries of compute-optimal models by setting one single complexity dial: `--depth`, the number of layers in the GPT transformer model (GPT-2 capability happens to be approximately depth 26). All other hyperparameters (the width of the transformer, number of heads, learning rate adjustments, training horizons, weight decays, ...) are calculated automatically in an optimal way.
|
||||
|
||||
For questions about the repo, I recommend either using [DeepWiki](https://deepwiki.com/karpathy/nanochat) from Devin/Cognition to ask questions about the repo, or use the [Discussions tab](https://github.com/karpathy/nanochat/discussions), or come by the [#nanochat](https://discord.com/channels/1020383067459821711/1427295580895314031) channel on Discord.
|
||||
|
||||
## Time-to-GPT-2 Leaderboard
|
||||
|
||||
Presently, the main focus of development is on tuning the pretraining stage, which takes the most amount of compute. Inspired by the modded-nanogpt repo and to incentivise progress and community collaboration, nanochat maintains a leaderboard for a "GPT-2 speedrun", which is the wall-clock time required to train a nanochat model to GPT-2 grade capability, as measured by the DCLM CORE score. The [runs/speedrun.sh](runs/speedrun.sh) script always reflects the reference way to train GPT-2 grade model and talk to it. The current leaderboard looks as follows:
|
||||
|
||||
| # | time | val_bpb | CORE | Description | Date | Commit | Contributors |
|
||||
|---|-------------|---------|------|-------------|------|--------|--------------|
|
||||
| 0 | 168 hours | - | 0.2565 | Original OpenAI GPT-2 checkpoint | 2019 | - | OpenAI |
|
||||
| 1 | 3.04 | 0.74833 | 0.2585 | d24 baseline, slightly overtrained | Jan 29 2026 | 348fbb3 | @karpathy |
|
||||
| 2 | 2.91 | 0.74504 | 0.2578 | d26 slightly undertrained **+fp8** | Feb 2 2026 | a67eba3 | @karpathy |
|
||||
| 3 | 2.76 | 0.74645 | 0.2602 | bump total batch size to 1M tokens | Feb 5 2026 | 2c062aa | @karpathy |
|
||||
|
||||
The primary metric we care about is "time to GPT-2" - the wall clock time needed to outperform the GPT-2 (1.6B) CORE metric on an 8XH100 GPU node. The GPT-2 CORE score is 0.256525. In 2019, the training of GPT-2 cost approximately $43,000 so it is incredible that due to many advances over 7 years across the stack, we can now do so much faster and for well below $100 (e.g. at the current ~$3/GPU/hr, an 8XH100 node is ~$24/hr, so 3 hours is ~$72).
|
||||
|
||||
See [dev/LEADERBOARD.md](dev/LEADERBOARD.md) for more docs on how to interpret and contribute to the leaderboard.
|
||||
|
||||
## Getting started
|
||||
|
||||
### Reproduce and talk to GPT-2
|
||||
|
||||
The most fun you can have is to train your own GPT-2 and talk to it. The entire pipeline to do so is contained in the single file [runs/speedrun.sh](runs/speedrun.sh), which is designed to be run on an 8XH100 GPU node. Boot up a new 8XH100 GPU box from your favorite provider (e.g. I use and like [Lambda](https://lambda.ai/service/gpu-cloud)), and kick off the training script:
|
||||
|
||||
```bash
|
||||
bash runs/speedrun.sh
|
||||
```
|
||||
|
||||
You may wish to do so in a screen session as this will take ~3 hours to run. Once it's done, you can talk to it via the ChatGPT-like web UI. Make sure again that your local uv virtual environment is active (run `source .venv/bin/activate`), and serve it:
|
||||
|
||||
```bash
|
||||
python -m scripts.chat_web
|
||||
```
|
||||
|
||||
And then visit the URL shown. Make sure to access it correctly, e.g. on Lambda use the public IP of the node you're on, followed by the port, so for example [http://209.20.xxx.xxx:8000/](http://209.20.xxx.xxx:8000/), etc. Then talk to your LLM as you'd normally talk to ChatGPT! Get it to write stories or poems. Ask it to tell you who you are to see a hallucination. Ask it why the sky is blue. Or why it's green. The speedrun is a 4e19 FLOPs capability model so it's a bit like talking to a kindergartener :).
|
||||
|
||||
---
|
||||
|
||||
<img width="2672" height="1520" alt="image" src="https://github.com/user-attachments/assets/ed39ddf8-2370-437a-bedc-0f39781e76b5" />
|
||||
|
||||
---
|
||||
|
||||
A few more notes:
|
||||
|
||||
- The code will run just fine on the Ampere 8XA100 GPU node as well, but a bit slower.
|
||||
- All code will run just fine on even a single GPU by omitting `torchrun`, and will produce ~identical results (code will automatically switch to gradient accumulation), but you'll have to wait 8 times longer.
|
||||
- If your GPU(s) have less than 80GB, you'll have to tune some of the hyperparameters or you will OOM / run out of VRAM. Look for `--device_batch_size` in the scripts and reduce it until things fit. E.g. from 32 (default) to 16, 8, 4, 2, or even 1. Less than that you'll have to know a bit more what you're doing and get more creative.
|
||||
- Most of the code is fairly vanilla PyTorch so it should run on anything that supports that - xpu, mps, or etc, but I haven't personally exercised all of these code paths so there might be sharp edges.
|
||||
|
||||
## Research
|
||||
|
||||
If you are a researcher and wish to help improve nanochat, two scripts of interest are [runs/scaling_laws.sh](runs/scaling_laws.sh) and [runs/miniseries.sh](runs/miniseries.sh). See [Jan 7 miniseries v1](https://github.com/karpathy/nanochat/discussions/420) for related documentation. For quick experimentation (~5 min pretraining runs) my favorite scale is to train a 12-layer model (GPT-1 sized), e.g. like this:
|
||||
|
||||
```
|
||||
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
|
||||
--depth=12 \
|
||||
--run="d12" \
|
||||
--model-tag="d12" \
|
||||
--core-metric-every=999999 \
|
||||
--sample-every=-1 \
|
||||
--save-every=-1 \
|
||||
```
|
||||
|
||||
This uses wandb (run name "d12"), only runs the CORE metric on last step, and it doesn't sample and save intermediate checkpoints. I like to change something in the code, re-run a d12 (or a d16 etc) and see if it helped, in an iteration loop. To see if a run helps, I like to monitor the wandb plots for:
|
||||
|
||||
1. `val_bpb` (validation loss in vocab-size-invariant units of bits per byte) as a function of `step`, `total_training_time` and `total_training_flops`.
|
||||
2. `core_metric` (the DCLM CORE socre)
|
||||
3. VRAM utilization, `train/mfu` (Model FLOPS utilization), `train/tok_per_sec` (training throughput)
|
||||
|
||||
See an example [here](https://github.com/karpathy/nanochat/pull/498#issuecomment-3850720044).
|
||||
|
||||
The important thing to note is that nanochat is written and configured around one single dial of complexity - the depth of the transformer. This single integer automatically determines all other hyperparameters (the width of the transformer, number of heads, learning rate adjustments, training horizons, weight decays, ...) so that the trained model comes out compute optimal. The idea is that the user doesn't have to think about or set any of this, they are simply asking for a smaller or bigger model using `--depth`, and everything "just works". By sweeping out the depth, you achieve the nanochat miniseries of compute optimal models at various sizes. GPT-2 capability model (which is of most interest at the moment) happens to be somewhere around d24-d26 range with the current code. But any candidate changes to the repo have to be principled enough that they work for all settings of depth.
|
||||
|
||||
## Running on CPU / MPS
|
||||
|
||||
The script [runs/runcpu.sh](runs/runcpu.sh) shows a very simple example of running on CPU or Apple Silicon. It dramatically shrinks the LLM that is being trained to make things fit into a reasonable time interval of a few ten minutes of training. You will not get strong results in this way.
|
||||
|
||||
## Guides
|
||||
|
||||
I've published a number of guides that might contain helpful information, most recent to least recent:
|
||||
|
||||
- [Feb 1 2026: Beating GPT-2 for <<$100: the nanochat journey](https://github.com/karpathy/nanochat/discussions/481)
|
||||
- [Jan 7 miniseries v1](https://github.com/karpathy/nanochat/discussions/420) documents the first nanochat miniseries of models.
|
||||
- To add new abilities to nanochat, see [Guide: counting r in strawberry (and how to add abilities generally)](https://github.com/karpathy/nanochat/discussions/164).
|
||||
- To customize your nanochat, see [Guide: infusing identity to your nanochat](https://github.com/karpathy/nanochat/discussions/139) in Discussions, which describes how you can tune your nanochat's personality through synthetic data generation and mixing that data into the SFT stage.
|
||||
- [Oct 13 2025: original nanochat post](https://github.com/karpathy/nanochat/discussions/1) introducing nanochat, though now it contains some deprecated information and the model is a lot older (with worse results) than current master.
|
||||
|
||||
## File structure
|
||||
|
||||
```
|
||||
.
|
||||
├── LICENSE
|
||||
├── README.md
|
||||
├── dev
|
||||
│ ├── gen_synthetic_data.py # Example synthetic data for identity
|
||||
│ ├── generate_logo.html
|
||||
│ ├── nanochat.png
|
||||
│ └── repackage_data_reference.py # Pretraining data shard generation
|
||||
├── nanochat
|
||||
│ ├── __init__.py # empty
|
||||
│ ├── checkpoint_manager.py # Save/Load model checkpoints
|
||||
│ ├── common.py # Misc small utilities, quality of life
|
||||
│ ├── core_eval.py # Evaluates base model CORE score (DCLM paper)
|
||||
│ ├── dataloader.py # Tokenizing Distributed Data Loader
|
||||
│ ├── dataset.py # Download/read utils for pretraining data
|
||||
│ ├── engine.py # Efficient model inference with KV Cache
|
||||
│ ├── execution.py # Allows the LLM to execute Python code as tool
|
||||
│ ├── gpt.py # The GPT nn.Module Transformer
|
||||
│ ├── logo.svg
|
||||
│ ├── loss_eval.py # Evaluate bits per byte (instead of loss)
|
||||
│ ├── optim.py # AdamW + Muon optimizer, 1GPU and distributed
|
||||
│ ├── report.py # Utilities for writing the nanochat Report
|
||||
│ ├── tokenizer.py # BPE Tokenizer wrapper in style of GPT-4
|
||||
│ └── ui.html # HTML/CSS/JS for nanochat frontend
|
||||
├── pyproject.toml
|
||||
├── runs
|
||||
│ ├── miniseries.sh # Miniseries training script
|
||||
│ ├── runcpu.sh # Small example of how to run on CPU/MPS
|
||||
│ ├── scaling_laws.sh # Scaling laws experiments
|
||||
│ └── speedrun.sh # Train the ~$100 nanochat d20
|
||||
├── scripts
|
||||
│ ├── base_eval.py # Base model: CORE score, bits per byte, samples
|
||||
│ ├── base_train.py # Base model: train
|
||||
│ ├── chat_cli.py # Chat model: talk to over CLI
|
||||
│ ├── chat_eval.py # Chat model: eval tasks
|
||||
│ ├── chat_rl.py # Chat model: reinforcement learning
|
||||
│ ├── chat_sft.py # Chat model: train SFT
|
||||
│ ├── chat_web.py # Chat model: talk to over WebUI
|
||||
│ ├── tok_eval.py # Tokenizer: evaluate compression rate
|
||||
│ └── tok_train.py # Tokenizer: train it
|
||||
├── tasks
|
||||
│ ├── arc.py # Multiple choice science questions
|
||||
│ ├── common.py # TaskMixture | TaskSequence
|
||||
│ ├── customjson.py # Make Task from arbitrary jsonl convos
|
||||
│ ├── gsm8k.py # 8K Grade School Math questions
|
||||
│ ├── humaneval.py # Misnomer; Simple Python coding task
|
||||
│ ├── mmlu.py # Multiple choice questions, broad topics
|
||||
│ ├── smoltalk.py # Conglomerate dataset of SmolTalk from HF
|
||||
│ └── spellingbee.py # Task teaching model to spell/count letters
|
||||
├── tests
|
||||
│ └── test_engine.py
|
||||
└── uv.lock
|
||||
```
|
||||
|
||||
## Contributing
|
||||
|
||||
The goal of nanochat is to improve the state of the art in micro models that are accessible to work with end to end on budgets of < $1000 dollars. Accessibility is about overall cost but also about cognitive complexity - nanochat is not an exhaustively configurable LLM "framework"; there are no giant configuration objects, model factories, or if-then-else monsters in the code base. It is a single, cohesive, minimal, readable, hackable, maximally-forkable "strong baseline" codebase designed to run start to end and produce a ChatGPT model you can talk to. Currently, the most interesting part personally is speeding up the latency to GPT-2 (i.e. getting a CORE score above 0.256525). Currently this takes ~3 hours, but by improving the pretraining stage we can improve this further.
|
||||
|
||||
Current AI policy: disclosure. When submitting a PR, please declare any parts that had substantial LLM contribution and that you have not written or that you do not fully understand.
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
- The name (nanochat) derives from my earlier project [nanoGPT](https://github.com/karpathy/nanoGPT), which only covered pretraining.
|
||||
- nanochat is also inspired by [modded-nanoGPT](https://github.com/KellerJordan/modded-nanogpt), which gamified the nanoGPT repo with clear metrics and a leaderboard, and borrows a lot of its ideas and some implementation for pretraining.
|
||||
- Thank you to [HuggingFace](https://huggingface.co/) for fineweb and smoltalk.
|
||||
- Thank you [Lambda](https://lambda.ai/service/gpu-cloud) for the compute used in developing this project.
|
||||
- Thank you to chief LLM whisperer 🧙♂️ Alec Radford for advice/guidance.
|
||||
- Thank you to the repo czar Sofie [@svlandeg](https://github.com/svlandeg) for help with managing issues, pull requests and discussions of nanochat.
|
||||
|
||||
## Cite
|
||||
|
||||
If you find nanochat helpful in your research cite simply as:
|
||||
|
||||
```bibtex
|
||||
@misc{nanochat,
|
||||
author = {Andrej Karpathy},
|
||||
title = {nanochat: The best ChatGPT that \$100 can buy},
|
||||
year = {2025},
|
||||
publisher = {GitHub},
|
||||
url = {https://github.com/karpathy/nanochat}
|
||||
}
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
### sh
|
||||
- `python -m nanochat.report reset`
|
||||
- `python -m scripts.tok_train --max_chars=2000000000`
|
||||
- `python -m scripts.tok_eval`
|
||||
- `torchrun --standalone --nproc_per_node=1 -m scripts.base_train -- --depth=18 --device-batch-size=1`
|
||||
- `torchrun --standalone --nproc_per_node=1 -m scripts.base_eval -- --device-batch-size=1`
|
||||
- `torchrun --standalone --nproc_per_node=1 -m scripts.chat_sft -- --device-batch-size=1`
|
||||
- `torchrun --standalone --nproc_per_node=1 -m scripts.chat_eval -- -i sft`
|
||||
- `python -m scripts.chat_cli -p "Why is the sky blue?"`
|
||||
- `python -m scripts.chat_web`
|
||||
- `python -m nanochat.report generate`
|
||||
|
|
@ -6,9 +6,12 @@ import os
|
|||
import re
|
||||
import logging
|
||||
import urllib.request
|
||||
from dotenv import load_dotenv
|
||||
from filelock import FileLock
|
||||
|
||||
import torch
|
||||
import torch.distributed as dist
|
||||
from filelock import FileLock
|
||||
|
||||
|
||||
class ColoredFormatter(logging.Formatter):
|
||||
"""Custom formatter that adds colors to log messages."""
|
||||
|
|
@ -47,15 +50,24 @@ def setup_default_logging():
|
|||
setup_default_logging()
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
def get_base_dir():
|
||||
# co-locate nanochat intermediates with other cached data in ~/.cache (by default)
|
||||
load_dotenv() # 将 .env 注入到环境变量
|
||||
def get_base_dir() -> str:
|
||||
"""项目数据存储的基础目录
|
||||
|
||||
Args:
|
||||
None
|
||||
|
||||
Returns:
|
||||
str: 基础目录的路径
|
||||
"""
|
||||
if os.environ.get("NANOCHAT_BASE_DIR"):
|
||||
nanochat_dir = os.environ.get("NANOCHAT_BASE_DIR")
|
||||
nanochat_dir = os.environ.get("NANOCHAT_BASE_DIR", "")
|
||||
assert os.path.isdir(nanochat_dir), f"NANOCHAT_BASE_DIR is set to {nanochat_dir} but that is not a directory"
|
||||
else:
|
||||
home_dir = os.path.expanduser("~")
|
||||
cache_dir = os.path.join(home_dir, ".cache")
|
||||
nanochat_dir = os.path.join(cache_dir, "nanochat")
|
||||
os.makedirs(nanochat_dir, exist_ok=True)
|
||||
os.makedirs(nanochat_dir, exist_ok=True)
|
||||
return nanochat_dir
|
||||
|
||||
def download_file_with_lock(url, filename, postprocess_fn=None):
|
||||
|
|
|
|||
|
|
@ -101,8 +101,16 @@ def find_common_length(token_sequences, direction='left'):
|
|||
return min_len
|
||||
|
||||
|
||||
def stack_sequences(tokens, pad_token_id):
|
||||
"""Stack up a list of token sequences, pad to longest on the right"""
|
||||
def stack_sequences(tokens, pad_token_id) -> torch.Tensor:
|
||||
"""根据最长的 token 序列长度将短序列 padding 到同一长度,并堆叠成一个 batch
|
||||
|
||||
Args:
|
||||
tokens (List[List[int]]): token id 序列列表
|
||||
pad_token_id (int): 用于 padding 的 token id
|
||||
|
||||
Returns:
|
||||
input_ids (torch.LongTensor): padding 后的 token id 序列张量, 形状为 (batch_size, max_seq_len) 的 tensor
|
||||
"""
|
||||
bsz, seq_len = len(tokens), max(len(x) for x in tokens)
|
||||
input_ids = torch.full((bsz, seq_len), pad_token_id, dtype=torch.long)
|
||||
for i, x in enumerate(tokens):
|
||||
|
|
@ -143,36 +151,56 @@ def batch_sequences_lm(tokenizer, prompts):
|
|||
|
||||
@torch.no_grad()
|
||||
def forward_model(model, input_ids):
|
||||
"""
|
||||
Take BxT tensor of token ids, return BxT tensor of losses and argmax predictions.
|
||||
The last column of losses is set to nan because we don't have autoregressive targets there.
|
||||
"""将输入的token id序列输入到模型中, 得到损失和预测结果
|
||||
Args:
|
||||
- model: 评测用的语言模型
|
||||
- input_ids: 输入的token id序列, 形状为 (batch_size, seq_len)
|
||||
|
||||
Returns:
|
||||
- losses: 损失, 形状为 (batch_size, seq_len)
|
||||
- predictions: 预测结果, 形状为 (batch_size, seq_len)
|
||||
"""
|
||||
batch_size, seq_len = input_ids.size()
|
||||
outputs = model(input_ids)
|
||||
# Roll the tensor to the left by one position to get the (autoregressive) target ids
|
||||
# 构造目标token id序列, 通过将输入序列向左滚动一位得到
|
||||
target_ids = torch.roll(input_ids, shifts=-1, dims=1)
|
||||
# Calculate cross entropy at all positions
|
||||
# 计算每个位置的交叉熵损失
|
||||
losses = torch.nn.functional.cross_entropy(
|
||||
outputs.view(batch_size * seq_len, -1),
|
||||
target_ids.view(batch_size * seq_len),
|
||||
reduction='none'
|
||||
).view(batch_size, seq_len)
|
||||
# Set the last column to be nan because there is no autoregressive loss there
|
||||
# 忽略最后一个位置的损失, 因为它没有对应的目标token
|
||||
losses[:, -1] = float('nan')
|
||||
# Get the argmax predictions at each position
|
||||
# 得到每个位置的预测token id, 通过取输出logits的argmax得到
|
||||
predictions = outputs.argmax(dim=-1)
|
||||
return losses, predictions
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
def evaluate_example(idx, model, tokenizer, data, device, task_meta):
|
||||
"""Evaluate a single example, return True if correct, False otherwise"""
|
||||
def evaluate_example(idx, model, tokenizer, data, device, task_meta) -> bool:
|
||||
"""测试一个题目的正确性, 返回True/False, 支持多选题、schema题和语言模型题
|
||||
|
||||
Args:
|
||||
idx: 题目在数据中的索引
|
||||
model: 评测用的语言模型
|
||||
tokenizer: 评测用的tokenizer
|
||||
data: 评测数据列表,每个元素是一个题目的字典表示
|
||||
device: 评测用的设备(CPU/GPU)
|
||||
task_meta: 任务的元信息字典,包含以下字段:
|
||||
- task_type: 任务类型,字符串,取值为'multiple_choice'、'schema'或'language_modeling'
|
||||
- num_fewshot: 每个题目需要采样的few-shot示例数量, 整数
|
||||
- continuation_delimiter: prompt中上下文和续写之间的分隔符, 字符串
|
||||
|
||||
Returns:
|
||||
is_correct: 该题目的预测是否正确, 布尔值
|
||||
"""
|
||||
item = data[idx]
|
||||
task_type = task_meta['task_type']
|
||||
num_fewshot = task_meta['num_fewshot']
|
||||
continuation_delimiter = task_meta['continuation_delimiter']
|
||||
|
||||
# Sample few-shot examples (excluding current item)
|
||||
# 生成模仿示例
|
||||
fewshot_examples = []
|
||||
if num_fewshot > 0:
|
||||
rng = random.Random(1234 + idx)
|
||||
|
|
@ -180,7 +208,7 @@ def evaluate_example(idx, model, tokenizer, data, device, task_meta):
|
|||
fewshot_indices = rng.sample(available_indices, num_fewshot)
|
||||
fewshot_examples = [data[i] for i in fewshot_indices]
|
||||
|
||||
# Render prompts and batch sequences based on task type
|
||||
# 将题目和模仿示例渲染成完整的prompt,并tokenize成输入模型的token序列
|
||||
if task_type == 'multiple_choice':
|
||||
prompts = render_prompts_mc(item, continuation_delimiter, fewshot_examples)
|
||||
tokens, start_idxs, end_idxs = batch_sequences_mc(tokenizer, prompts)
|
||||
|
|
@ -193,8 +221,7 @@ def evaluate_example(idx, model, tokenizer, data, device, task_meta):
|
|||
else:
|
||||
raise ValueError(f"Unsupported task type: {task_type}")
|
||||
|
||||
# Some models can't forward sequences beyond a certain length (e.g. GPT-2)
|
||||
# In these cases, we have to truncate sequences to max length and adjust the indices
|
||||
# 根据模型的最大输入长度从左侧裁剪token序列, 并相应地调整start/end indices
|
||||
if hasattr(model, 'max_seq_len') and model.max_seq_len is not None:
|
||||
max_tokens = model.max_seq_len
|
||||
new_tokens, new_start_idxs, new_end_idxs = [], [], []
|
||||
|
|
@ -212,25 +239,25 @@ def evaluate_example(idx, model, tokenizer, data, device, task_meta):
|
|||
new_end_idxs.append(e)
|
||||
tokens, start_idxs, end_idxs = new_tokens, new_start_idxs, new_end_idxs
|
||||
|
||||
# Stack up all the sequences into a batch
|
||||
# 将token序列堆叠成一个batch,并移动到评测设备上
|
||||
pad_token_id = tokenizer.get_bos_token_id() # use BOS as pad token is ok
|
||||
input_ids = stack_sequences(tokens, pad_token_id)
|
||||
input_ids = input_ids.to(device)
|
||||
|
||||
# Forward the model, get the autoregressive loss and argmax prediction at each token
|
||||
# 前向传播得到loss和预测
|
||||
losses, predictions = forward_model(model, input_ids)
|
||||
|
||||
# See if the losses/predictions come out correctly
|
||||
# lm 题目, inputs 形状为 (1, seq_len), predictions 形状为 (1, seq_len), start_idxs 和 end_idxs 是长度为1的列表
|
||||
if task_type == 'language_modeling':
|
||||
# language modeling task is currently always batch size 1
|
||||
si = start_idxs[0]
|
||||
ei = end_idxs[0]
|
||||
# predictions[i] predict input_ids[i+1] autoregressively
|
||||
# 截取模型对答案的预测部分,并和正确答案进行比较
|
||||
predicted_tokens = predictions[0, si-1:ei-1]
|
||||
actual_tokens = input_ids[0, si:ei]
|
||||
is_correct = torch.all(predicted_tokens == actual_tokens).item()
|
||||
# choice/schema 题目, inputs 形状为 (num_options, seq_len), predictions 形状为 (num_options, seq_len), start_idxs 和 end_idxs 是长度为 num_options 的列表
|
||||
elif task_type in ['multiple_choice', 'schema']:
|
||||
# For MC/schema: find the option with lowest average loss
|
||||
# 计算模型在每个选项上的平均loss, 选loss最小的选项作为模型的预测, 并和正确答案进行比较
|
||||
mean_losses = [losses[i, si-1:ei-1].mean().item()
|
||||
for i, (si, ei) in enumerate(zip(start_idxs, end_idxs))]
|
||||
pred_idx = mean_losses.index(min(mean_losses))
|
||||
|
|
@ -238,25 +265,37 @@ def evaluate_example(idx, model, tokenizer, data, device, task_meta):
|
|||
else:
|
||||
raise ValueError(f"Unsupported task type: {task_type}")
|
||||
|
||||
return is_correct
|
||||
return bool(is_correct)
|
||||
|
||||
|
||||
def evaluate_task(model, tokenizer, data, device, task_meta):
|
||||
"""
|
||||
This function is responsible for evaluating one task across many examples.
|
||||
It also handles dispatch to all processes if the script is run with torchrun.
|
||||
def evaluate_task(model, tokenizer, data, device, task_meta) -> float:
|
||||
"""评测一个任务(含有多个题目)的CORE分数, 支持多选题、schema题和语言模型题
|
||||
|
||||
Args:
|
||||
model: 评测用的语言模型
|
||||
tokenizer: 评测用的tokenizer
|
||||
data: 评测数据列表,每个元素是一个题目的字典表示
|
||||
device: 评测用的设备(CPU/GPU)
|
||||
task_meta: 任务的元信息字典,包含以下字段:
|
||||
- task_type: 任务类型,字符串,取值为'multiple_choice'、'schema'或'language_modeling'
|
||||
- num_fewshot: 每个题目需要采样的few-shot示例数量, 整数
|
||||
- continuation_delimiter: prompt中上下文和续写之间的分隔符, 字符串
|
||||
|
||||
Returns:
|
||||
mean_correct: 该任务的CORE分数, 浮点数, 取值范围为0.0到1.0, 代表模型在该任务上预测正确的题目比例
|
||||
"""
|
||||
# 获取当前进程的rank和总的进程数
|
||||
rank = dist.get_rank() if dist.is_initialized() else 0
|
||||
world_size = dist.get_world_size() if dist.is_initialized() else 1
|
||||
correct = torch.zeros(len(data), dtype=torch.float32, device=device)
|
||||
# stride the examples to each rank
|
||||
# 为每个进程分配不同的题目进行评测, 以实现并行评测
|
||||
for idx in range(rank, len(data), world_size):
|
||||
is_correct = evaluate_example(idx, model, tokenizer, data, device, task_meta)
|
||||
correct[idx] = float(is_correct)
|
||||
# sync results across all the processes if running distributed
|
||||
# 在所有进程之间同步correct张量,并求和得到总的正确题目数量
|
||||
if world_size > 1:
|
||||
dist.barrier()
|
||||
dist.all_reduce(correct, op=dist.ReduceOp.SUM)
|
||||
# compute the mean
|
||||
# 计算正确题目的比例作为CORE分数
|
||||
mean_correct = correct.mean().item()
|
||||
return mean_correct
|
||||
|
|
|
|||
|
|
@ -16,26 +16,33 @@ Fallback to the original if you have very limited data AND long documents:
|
|||
https://github.com/karpathy/nanochat/blob/3c3a3d7/nanochat/dataloader.py#L78-L117
|
||||
"""
|
||||
|
||||
from typing import Iterator, Tuple
|
||||
import torch
|
||||
import pyarrow.parquet as pq
|
||||
|
||||
from nanochat.common import get_dist_info
|
||||
from nanochat.dataset import list_parquet_files
|
||||
|
||||
def _document_batches(split, resume_state_dict, tokenizer_batch_size):
|
||||
"""
|
||||
Infinite iterator over document batches (list of text strings) from parquet files.
|
||||
def _document_batches(split, resume_state_dict, tokenizer_batch_size, nums_parallel_files=250) -> Iterator[Tuple[list, Tuple[int, int, int]]]:
|
||||
"""生成器函数,按文档批次加载数据,并支持从中断处恢复。在预训练的训练/测试中使用
|
||||
|
||||
Handles DDP sharding and approximate resume. Each yield is (text_batch, (pq_idx, rg_idx, epoch))
|
||||
where text_batch is a list of document strings, indices track position for resumption,
|
||||
and epoch counts how many times we've cycled through the dataset (starts at 1).
|
||||
Args:
|
||||
split: "train" or "val"
|
||||
resume_state_dict: 状态字典,包含 "pq_idx", "rg_idx", "epoch" 用于从中断处恢复
|
||||
tokenizer_batch_size: 定义每个批次的文章数量
|
||||
nums_parallel_files: 定义整个预训练过程使用的 parquet 文件数量
|
||||
|
||||
Yields:
|
||||
tuple: (batch, (pq_idx, rg_idx, epoch)) 元组,包含当前批次的文本列表和对应的 parquet 文件索引、row group 索引、epoch
|
||||
"""
|
||||
ddp, ddp_rank, ddp_local_rank, ddp_world_size = get_dist_info()
|
||||
|
||||
parquet_paths = list_parquet_files()
|
||||
parquet_paths = parquet_paths[:min(len(parquet_paths), nums_parallel_files)]
|
||||
assert len(parquet_paths) != 0, "No dataset parquet files found, did you run dataset.py?"
|
||||
# 训练使用除了最后一个文件以外的所有文件,测试使用最后一个文件
|
||||
parquet_paths = parquet_paths[:-1] if split == "train" else parquet_paths[-1:]
|
||||
|
||||
# 从状态字典中恢复位置
|
||||
resume_pq_idx = resume_state_dict["pq_idx"] if resume_state_dict is not None else 0
|
||||
resume_rg_idx = resume_state_dict["rg_idx"] if resume_state_dict is not None else None
|
||||
resume_epoch = resume_state_dict.get("epoch", 1) if resume_state_dict is not None else 1
|
||||
|
|
@ -43,29 +50,35 @@ def _document_batches(split, resume_state_dict, tokenizer_batch_size):
|
|||
pq_idx = resume_pq_idx
|
||||
epoch = resume_epoch
|
||||
|
||||
while True: # iterate infinitely (multi-epoch)
|
||||
while True:
|
||||
pq_idx = resume_pq_idx if first_pass else 0
|
||||
while pq_idx < len(parquet_paths):
|
||||
filepath = parquet_paths[pq_idx]
|
||||
# pf = [
|
||||
# [text1, text2, ...],
|
||||
# [textN+1, textN+2, ...],
|
||||
# ...]
|
||||
# 每个 pf 包含多个 row group,每个 row group 包含一批文本数据
|
||||
pf = pq.ParquetFile(filepath)
|
||||
# Start from resume point if resuming on same file, otherwise from DDP rank
|
||||
# 计算每个 ddp 进程应该从哪个 row group 开始
|
||||
if first_pass and (resume_rg_idx is not None) and (pq_idx == resume_pq_idx):
|
||||
base_idx = resume_rg_idx // ddp_world_size
|
||||
base_idx += 1 # advance by 1 so we don't repeat data after resuming
|
||||
base_idx += 1 # 从中断位置的下一个 row group 开始,以避免重复
|
||||
rg_idx = base_idx * ddp_world_size + ddp_rank
|
||||
if rg_idx >= pf.num_row_groups:
|
||||
if rg_idx >= pf.num_row_groups: # 若越界则开始遍历下一个 parquet 文件
|
||||
pq_idx += 1
|
||||
continue
|
||||
resume_rg_idx = None # only do this once
|
||||
else:
|
||||
rg_idx = ddp_rank
|
||||
# 遍历 当前 parquet 文件的 row group,每个 ddp 进程处理不同的 row group
|
||||
while rg_idx < pf.num_row_groups:
|
||||
rg = pf.read_row_group(rg_idx)
|
||||
batch = rg.column('text').to_pylist()
|
||||
for i in range(0, len(batch), tokenizer_batch_size):
|
||||
rg = pf.read_row_group(rg_idx) # 读取一个 row group,得到一个表格对象 rg
|
||||
batch = rg.column('text').to_pylist() # 从表格对象中提取文本列,得到一个文本列表 batch
|
||||
for i in range(0, len(batch), tokenizer_batch_size): # 将文本列表分割成更小的批次,每个批次包含 tokenizer_batch_size 个文本
|
||||
yield batch[i:i+tokenizer_batch_size], (pq_idx, rg_idx, epoch)
|
||||
rg_idx += ddp_world_size
|
||||
pq_idx += 1
|
||||
rg_idx += ddp_world_size # 每个 ddp 进程跳过已经被其他进程处理的 row group
|
||||
pq_idx += 1 # 处理下一个 parquet 文件
|
||||
first_pass = False
|
||||
epoch += 1
|
||||
|
||||
|
|
@ -76,24 +89,27 @@ def tokenizing_distributed_data_loader_with_state_bos_bestfit(
|
|||
device="cuda", resume_state_dict=None,
|
||||
buffer_size=1000
|
||||
):
|
||||
"""
|
||||
BOS-aligned dataloader with Best-Fit Cropping.
|
||||
"""分布式数据加载器,使用 BOS-aligned best-fit 算法将文本数据打包成固定长度的输入和目标序列,适用于预训练的训练/测试
|
||||
|
||||
Reduces token waste compared to simple greedy cropping by searching a buffer
|
||||
for documents that fit well, while maintaining 100% utilization (no padding).
|
||||
|
||||
Algorithm for each row:
|
||||
1. From buffered docs, pick the LARGEST doc that fits entirely
|
||||
2. Repeat until no doc fits
|
||||
3. When nothing fits, crop a doc to fill remaining space exactly
|
||||
|
||||
Key properties:
|
||||
- Every row starts with BOS
|
||||
- 100% utilization (no padding, every token is trained on)
|
||||
- Approximately 35% of all tokens are discarded due to cropping
|
||||
Args:
|
||||
tokenizer: 用于将文本编码为 token 的 tokenizer 对象
|
||||
B: 每个批次的行数(batch size), 用于定义 inputs 和 targets 的批次大小
|
||||
T: 每行的 token 数量(序列长度), 用于定义 inputs 和 targets 的序列长度
|
||||
split: "train" 或 "val",指定数据集的划分
|
||||
tokenizer_threads: 用于编码文本的线程数量
|
||||
tokenizer_batch_size: 每个批次的文本数量,用于编码过程
|
||||
device: "cuda" 或 "cpu",指定数据加载器输出的设备
|
||||
resume_state_dict: 状态字典,包含 "pq_idx", "rg_idx", "epoch" 用于从中断处恢复
|
||||
buffer_size: 文档缓冲区的大小,定义了在填充行缓冲区之前从生成器中加载多少文档
|
||||
|
||||
Returns:
|
||||
Iterator[Tuple[torch.Tensor, torch.Tensor, dict]]: 生成器,迭代返回 (inputs, targets, state_dict) 元组,其中 inputs 和 targets 是形状为 (B, T) 的张量,state_dict 包含当前的 "pq_idx", "rg_idx", "epoch" 用于恢复
|
||||
"""
|
||||
assert split in ["train", "val"], "split must be 'train' or 'val'"
|
||||
|
||||
# row = [1, 2, 3, 4..., T, T + 1]
|
||||
# input = [1, 2, 3, 4..., T]
|
||||
# target = [2, 3, 4, 5..., T + 1]
|
||||
row_capacity = T + 1
|
||||
batches = _document_batches(split, resume_state_dict, tokenizer_batch_size)
|
||||
bos_token = tokenizer.get_bos_token_id()
|
||||
|
|
@ -101,19 +117,22 @@ def tokenizing_distributed_data_loader_with_state_bos_bestfit(
|
|||
pq_idx, rg_idx, epoch = 0, 0, 1
|
||||
|
||||
def refill_buffer():
|
||||
"""填充文档缓冲区, 直到达到指定的 buffer_size,
|
||||
每次调用都会从 batches 生成器中获取下一个批次的文本数据, 并将其编码为 token 列表后添加到 doc_buffer 中
|
||||
"""
|
||||
nonlocal pq_idx, rg_idx, epoch
|
||||
doc_batch, (pq_idx, rg_idx, epoch) = next(batches)
|
||||
# 一次性编码一个文本列表, 将文章开头设置为 bos_token, 并行使用多个线程加速编码过程
|
||||
token_lists = tokenizer.encode(doc_batch, prepend=bos_token, num_threads=tokenizer_threads)
|
||||
for tokens in token_lists:
|
||||
doc_buffer.append(tokens)
|
||||
|
||||
# Pre-allocate buffers once: layout is [inputs (B*T) | targets (B*T)]
|
||||
# This gives us contiguous views and a single HtoD transfer
|
||||
# 预分配 tokens 的 CPU 和 GPU 缓冲区
|
||||
use_cuda = device == "cuda"
|
||||
row_buffer = torch.empty((B, row_capacity), dtype=torch.long) # for building rows without creating Python lists
|
||||
cpu_buffer = torch.empty(2 * B * T, dtype=torch.long, pin_memory=use_cuda) # staging area (CPU)
|
||||
gpu_buffer = torch.empty(2 * B * T, dtype=torch.long, device=device) # on-device buffer
|
||||
cpu_inputs = cpu_buffer[:B * T].view(B, T) # a few views into these buffers just for convenience
|
||||
row_buffer = torch.empty((B, row_capacity), dtype=torch.long)
|
||||
cpu_buffer = torch.empty(2 * B * T, dtype=torch.long, pin_memory=use_cuda)
|
||||
gpu_buffer = torch.empty(2 * B * T, dtype=torch.long, device=device)
|
||||
cpu_inputs = cpu_buffer[:B * T].view(B, T)
|
||||
cpu_targets = cpu_buffer[B * T:].view(B, T)
|
||||
inputs = gpu_buffer[:B * T].view(B, T)
|
||||
targets = gpu_buffer[B * T:].view(B, T)
|
||||
|
|
@ -122,40 +141,47 @@ def tokenizing_distributed_data_loader_with_state_bos_bestfit(
|
|||
for row_idx in range(B):
|
||||
pos = 0
|
||||
while pos < row_capacity:
|
||||
# Ensure buffer has documents
|
||||
# 填充文档缓冲区直到达到指定大小,以确保有足够的文档可供选择
|
||||
while len(doc_buffer) < buffer_size:
|
||||
refill_buffer()
|
||||
|
||||
# 计算当前 row 剩余的容量,以确定可以放入多少 tokens
|
||||
remaining = row_capacity - pos
|
||||
|
||||
# Find largest doc that fits entirely
|
||||
# 循环查找最大且能完全适应剩余空间的文档
|
||||
best_idx = -1
|
||||
best_len = 0
|
||||
# 查找最短的文档,以便在没有完全适合剩余空间的文档时进行裁剪
|
||||
shortest_idx = 0
|
||||
shortest_len = len(doc_buffer[shortest_idx]) if doc_buffer else float('inf')
|
||||
for i, doc in enumerate(doc_buffer):
|
||||
doc_len = len(doc)
|
||||
if doc_len <= remaining and doc_len > best_len:
|
||||
best_idx = i
|
||||
best_len = doc_len
|
||||
if doc_len < shortest_len:
|
||||
shortest_idx = i
|
||||
shortest_len = doc_len
|
||||
|
||||
# 如果找到了一个完全适合剩余空间的文档,则将其放入当前行,并从缓冲区中移除
|
||||
if best_idx >= 0:
|
||||
doc = doc_buffer.pop(best_idx)
|
||||
doc_len = len(doc)
|
||||
row_buffer[row_idx, pos:pos + doc_len] = torch.tensor(doc, dtype=torch.long)
|
||||
pos += doc_len
|
||||
# 如果文档缓冲中所有文档长度都大于剩余空间,则选择一个最短的文档进行裁剪,以完全填充剩余空间,并从缓冲区中移除该文档
|
||||
else:
|
||||
# No doc fits - crop shortest in buffer to fill remaining and minimize waste
|
||||
shortest_idx = min(range(len(doc_buffer)), key=lambda i: len(doc_buffer[i]))
|
||||
doc = doc_buffer.pop(shortest_idx)
|
||||
row_buffer[row_idx, pos:pos + remaining] = torch.tensor(doc[:remaining], dtype=torch.long)
|
||||
pos += remaining
|
||||
|
||||
# Copy to pinned CPU buffer, then single HtoD transfer
|
||||
# 将填充好的行缓冲区分割成输入和目标,输入是前 T 个 tokens,目标是后 T 个 tokens
|
||||
cpu_inputs.copy_(row_buffer[:, :-1])
|
||||
cpu_targets.copy_(row_buffer[:, 1:])
|
||||
|
||||
state_dict = {"pq_idx": pq_idx, "rg_idx": rg_idx, "epoch": epoch}
|
||||
|
||||
# Single HtoD copy into persistent GPU buffer and yield
|
||||
# 将 CPU 缓冲区的数据复制到 GPU 缓冲区,以便模型训练使用
|
||||
gpu_buffer.copy_(cpu_buffer, non_blocking=use_cuda)
|
||||
yield inputs, targets, state_dict
|
||||
|
||||
|
|
|
|||
|
|
@ -10,6 +10,7 @@ For details of how the dataset was prepared, see `repackage_data_reference.py`.
|
|||
import os
|
||||
import argparse
|
||||
import time
|
||||
from typing import Iterator
|
||||
import requests
|
||||
import pyarrow.parquet as pq
|
||||
from multiprocessing import Pool
|
||||
|
|
@ -24,36 +25,54 @@ BASE_URL = "https://huggingface.co/datasets/karpathy/fineweb-edu-100b-shuffle/re
|
|||
MAX_SHARD = 1822 # the last datashard is shard_01822.parquet
|
||||
index_to_filename = lambda index: f"shard_{index:05d}.parquet" # format of the filenames
|
||||
base_dir = get_base_dir()
|
||||
DATA_DIR = os.path.join(base_dir, "base_data")
|
||||
DATA_DIR = os.path.join(base_dir, "pretrain_data")
|
||||
os.makedirs(DATA_DIR, exist_ok=True)
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# These functions are useful utilities to other modules, can/should be imported
|
||||
|
||||
def list_parquet_files(data_dir=None):
|
||||
""" Looks into a data dir and returns full paths to all parquet files. """
|
||||
def list_parquet_files(data_dir=None) -> list[str]:
|
||||
"""获取指定目录下的所有 parquet 文件路径,默认使用 DATA_DIR
|
||||
Args:
|
||||
data_dir (str, optional): 要搜索的目录路径,默认为 None,表示使用 DATA_DIR
|
||||
|
||||
Returns:
|
||||
list[str]: 包含所有 parquet 文件路径的列表
|
||||
"""
|
||||
data_dir = DATA_DIR if data_dir is None else data_dir
|
||||
parquet_files = sorted([
|
||||
f for f in os.listdir(data_dir)
|
||||
if f.endswith('.parquet') and not f.endswith('.tmp')
|
||||
])
|
||||
parquet_paths = [os.path.join(data_dir, f) for f in parquet_files]
|
||||
assert len(parquet_paths) != 0, f"No dataset parquet files found in {data_dir}, did you run dataset.py?"
|
||||
return parquet_paths
|
||||
|
||||
def parquets_iter_batched(split, start=0, step=1):
|
||||
"""
|
||||
Iterate through the dataset, in batches of underlying row_groups for efficiency.
|
||||
- split can be "train" or "val". the last parquet file will be val.
|
||||
- start/step are useful for skipping rows in DDP. e.g. start=rank, step=world_size
|
||||
def parquets_iter_batched(split, start=0, step=1) -> Iterator[list[str]]:
|
||||
"""按 groups 迭代 parquet 文件中的文本数据,在分词器训练/测试中使用
|
||||
Args:
|
||||
split (str): 数据集划分,必须是 "train" 或 "val"
|
||||
start (int, optional): 起始行组索引,默认为 0
|
||||
step (int, optional): 行组步长,默认为 1, 表示每个行组都迭代, 用于不同进程迭代不同的行组
|
||||
|
||||
Yields:
|
||||
Iterator[list[str]]: 每次迭代返回一个文本列表[text1, text2, ...],包含当前行组中的所有文本数据
|
||||
"""
|
||||
assert split in ["train", "val"], "split must be 'train' or 'val'"
|
||||
parquet_paths = list_parquet_files()
|
||||
# "train" 迭代除了最后一个文件以外的所有文件,"val" 只迭代最后一个文件
|
||||
parquet_paths = parquet_paths[:-1] if split == "train" else parquet_paths[-1:]
|
||||
for filepath in parquet_paths:
|
||||
# pf = [
|
||||
# [text1, text2, ...],
|
||||
# [textN+1, textN+2, ...],
|
||||
# ...]
|
||||
# 每个 pf 包含多个 row group,每个 row group 包含一批文本数据
|
||||
pf = pq.ParquetFile(filepath)
|
||||
for rg_idx in range(start, pf.num_row_groups, step):
|
||||
rg = pf.read_row_group(rg_idx)
|
||||
texts = rg.column('text').to_pylist()
|
||||
# texts = [textI, textI+1, ...] 当前行组中的文本数据列表
|
||||
yield texts
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
|
|
|
|||
|
|
@ -3,7 +3,7 @@ name = "nanochat"
|
|||
version = "0.1.0"
|
||||
description = "the minimal full-stack ChatGPT clone"
|
||||
readme = "README.md"
|
||||
requires-python = ">=3.10"
|
||||
requires-python = ">=3.12"
|
||||
dependencies = [
|
||||
"datasets>=4.0.0",
|
||||
"fastapi>=0.117.1",
|
||||
|
|
@ -47,14 +47,23 @@ torch = [
|
|||
{ index = "pytorch-cu128", extra = "gpu" },
|
||||
]
|
||||
|
||||
[[tool.uv.index]]
|
||||
name = "tuna"
|
||||
url = "https://pypi.tuna.tsinghua.edu.cn/simple"
|
||||
default = true # 标记为默认源
|
||||
|
||||
[[tool.uv.index]]
|
||||
name = "pytorch-cpu"
|
||||
url = "https://download.pytorch.org/whl/cpu"
|
||||
url = "https://mirrors.nju.edu.cn/pytorch/whl/cpu"
|
||||
# url = "https://pypi.tuna.tsinghua.edu.cn/simple"
|
||||
# url = "https://download.pytorch.org/whl/cpu"
|
||||
explicit = true
|
||||
|
||||
[[tool.uv.index]]
|
||||
name = "pytorch-cu128"
|
||||
url = "https://download.pytorch.org/whl/cu128"
|
||||
url = "https://mirrors.nju.edu.cn/pytorch/whl/cu128"
|
||||
# url = "https://pypi.tuna.tsinghua.edu.cn/simple"
|
||||
# url = "https://download.pytorch.org/whl/cu128"
|
||||
explicit = true
|
||||
|
||||
[project.optional-dependencies]
|
||||
|
|
|
|||
80
runs/speedrundiy.sh
Executable file
80
runs/speedrundiy.sh
Executable file
|
|
@ -0,0 +1,80 @@
|
|||
#!/bin/bash
|
||||
# -----------------------------------------------------------------------------
|
||||
# 变量设置
|
||||
NPROC_PER_NODE=1
|
||||
NANOCHAT_BASE_DIR="/media/data/liujiang/data/datasets/nanochat_base_dir"
|
||||
export OMP_NUM_THREADS=1
|
||||
export NANOCHAT_BASE_DIR="${NANOCHAT_BASE_DIR}"
|
||||
export HF_ENDPOINT=https://hf-mirror.com
|
||||
export HF_HUB_ENABLE_HF_HUB_CACHE=true
|
||||
export HF_HUB_CACHE="${NANOCHAT_BASE_DIR}/hf_hub_cache"
|
||||
export HF_DATASETS_CACHE="${NANOCHAT_BASE_DIR}/hf_datasets_cache"
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# 环境搭建
|
||||
|
||||
# install uv (if not already installed)
|
||||
command -v uv &> /dev/null || curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||
# create a .venv local virtual environment (if it doesn't exist)
|
||||
[ -d ".venv" ] || uv venv
|
||||
# install the repo dependencies
|
||||
uv sync --extra gpu
|
||||
# activate venv so that `python` uses the project's venv instead of system python
|
||||
source .venv/bin/activate
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# 日志系统设置
|
||||
# 如果使用 wandb 进行日志记录:
|
||||
# 1) 在环境变量中设置 `WANDB_API_KEY`
|
||||
# 2) 在本地先运行 `wandb login`
|
||||
# 3) 运行脚本时设置 WANDB_RUN 环境变量,例如:`WANDB_RUN=d26 bash speedrun.sh`
|
||||
|
||||
if [ -z "$WANDB_RUN" ]; then
|
||||
# 默认使用 "dummy":这是一个特殊情况,会跳过 wandb 日志记录
|
||||
WANDB_RUN=dummy
|
||||
else
|
||||
# 如果设置了 WANDB_RUN 且不等于 dummy,则运行 online 模式
|
||||
if [ "$WANDB_RUN" != "dummy" ]; then
|
||||
# 检查是否提供了 API KEY
|
||||
if [ -z "$WANDB_API_KEY" ]; then
|
||||
echo "错误: 检测到 WANDB_RUN=$WANDB_RUN 为 Online 模式,但未检测到 WANDB_API_KEY"
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# 日志系统重置
|
||||
python -m nanochat.report reset
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# 分词器训练
|
||||
python -m nanochat.dataset -n 8
|
||||
python -m nanochat.dataset -n 370 &
|
||||
DATASET_DOWNLOAD_PID=$!
|
||||
python -m scripts.tok_train
|
||||
python -m scripts.tok_eval
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# 等待预训练数据下载完成,然后进行预训练
|
||||
echo "Waiting for dataset download to complete..."
|
||||
wait $DATASET_DOWNLOAD_PID
|
||||
|
||||
torchrun --standalone --nproc_per_node=$NPROC_PER_NODE -m scripts.base_train -- --depth=18 --target-param-data-ratio=8.25 --device-batch-size=1 --run=$WANDB_RUN
|
||||
torchrun --standalone --nproc_per_node=$NPROC_PER_NODE -m scripts.base_eval -- --device-batch-size=1
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# 指令微调数据集下载,并进行指令微调和评测
|
||||
curl -L -o $NANOCHAT_BASE_DIR/identity_conversations.jsonl https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl
|
||||
torchrun --standalone --nproc_per_node=$NPROC_PER_NODE -m scripts.chat_sft -- --device-batch-size=1 --run=$WANDB_RUN
|
||||
torchrun --standalone --nproc_per_node=$NPROC_PER_NODE -m scripts.chat_eval -- -i sft
|
||||
|
||||
# 命令行聊天测试
|
||||
python -m scripts.chat_cli -p "Why is the sky blue?"
|
||||
|
||||
# web聊天测试
|
||||
python -m scripts.chat_web
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# 生成报告
|
||||
python -m nanochat.report generate
|
||||
|
|
@ -24,6 +24,8 @@ import csv
|
|||
import time
|
||||
import json
|
||||
import yaml
|
||||
import math
|
||||
from typing import Optional, Literal
|
||||
import shutil
|
||||
import random
|
||||
import zipfile
|
||||
|
|
@ -95,7 +97,12 @@ EVAL_BUNDLE_URL = "https://karpathy-public.s3.us-west-2.amazonaws.com/eval_bundl
|
|||
|
||||
|
||||
def place_eval_bundle(file_path):
|
||||
"""Unzip eval_bundle.zip and place it in the base directory."""
|
||||
"""将 eval_bundle.zip 解压到 base_dir/eval_bundle 目录下
|
||||
Args:
|
||||
file_path: eval_bundle.zip 文件路径
|
||||
Returns:
|
||||
None
|
||||
"""
|
||||
base_dir = get_base_dir()
|
||||
eval_bundle_dir = os.path.join(base_dir, "eval_bundle")
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
|
|
@ -105,18 +112,33 @@ def place_eval_bundle(file_path):
|
|||
shutil.move(extracted_bundle_dir, eval_bundle_dir)
|
||||
print0(f"Placed eval_bundle directory at {eval_bundle_dir}")
|
||||
|
||||
scale_map = {
|
||||
"small": 0.1,
|
||||
"medium": 0.4,
|
||||
"large": 0.7
|
||||
}
|
||||
def evaluate_core(model, tokenizer, device, scale: Optional[Literal["small", "medium", "large"]] = None, max_per_task=-1):
|
||||
"""在 CORE 评测上评测模型, 返回每个任务的 accuracy 和 centered result, 以及最终的 CORE metric
|
||||
Args:
|
||||
- model: 评测模型
|
||||
- tokenizer: 模型对应的 tokenizer
|
||||
- device: 评测设备
|
||||
- scale: 可选的评测规模, small/medium/large, 分别使用 10%/40%/70% 的任务进行评测 (默认: 100%)
|
||||
- max_per_task: 每个任务的最大题目数量, -1 表示使用所有题目 (默认: -1)
|
||||
|
||||
def evaluate_core(model, tokenizer, device, max_per_task=-1):
|
||||
"""
|
||||
Evaluate a base model on the CORE benchmark.
|
||||
Returns dict with results, centered_results, and core_metric.
|
||||
Returns:
|
||||
- dict 包含以下字段:
|
||||
- "results": 每个任务的 accuracy, dict 形式 {task_label: accuracy}
|
||||
- "centered_results": 每个任务的 centered result, dict 形式 {task_label: centered_result}
|
||||
- "core_metric": 所有任务的 centered result 的平均, 即最终的 CORE metric
|
||||
"""
|
||||
base_dir = get_base_dir()
|
||||
eval_bundle_dir = os.path.join(base_dir, "eval_bundle")
|
||||
# Download the eval bundle if needed
|
||||
# 任务下载
|
||||
if not os.path.exists(eval_bundle_dir):
|
||||
download_file_with_lock(EVAL_BUNDLE_URL, "eval_bundle.zip", postprocess_fn=place_eval_bundle)
|
||||
|
||||
# 加载配置和数据
|
||||
config_path = os.path.join(eval_bundle_dir, "core.yaml")
|
||||
data_base_path = os.path.join(eval_bundle_dir, "eval_data")
|
||||
eval_meta_data = os.path.join(eval_bundle_dir, "eval_meta_data.csv")
|
||||
|
|
@ -125,7 +147,7 @@ def evaluate_core(model, tokenizer, device, max_per_task=-1):
|
|||
config = yaml.safe_load(f)
|
||||
tasks = config['icl_tasks']
|
||||
|
||||
# Load random baseline values
|
||||
# 加载每个任务的 baseline, 例如有四个选项的题目的 baseline 是 25.0, 判断题的 baseline 是 50.0
|
||||
random_baselines = {}
|
||||
with open(eval_meta_data, 'r', encoding='utf-8') as f:
|
||||
reader = csv.DictReader(f)
|
||||
|
|
@ -134,11 +156,22 @@ def evaluate_core(model, tokenizer, device, max_per_task=-1):
|
|||
random_baseline = row['Random baseline']
|
||||
random_baselines[task_name] = float(random_baseline)
|
||||
|
||||
# Evaluate each task
|
||||
# 评测每个任务, 计算 accuracy 和 centered result, 最后计算 CORE metric
|
||||
results = {}
|
||||
centered_results = {}
|
||||
|
||||
# 限制任务数量
|
||||
if scale is not None and len(tasks) > 1:
|
||||
if scale not in scale_map:
|
||||
raise ValueError(f"Invalid scale: {scale}. Must be one of {list(scale_map.keys())}")
|
||||
num_tasks = math.ceil(len(tasks) * scale_map[scale])
|
||||
print0(f"Scaling CORE evaluation: {scale} -> using {num_tasks} tasks (out of {len(tasks)})")
|
||||
tasks = tasks[:num_tasks]
|
||||
|
||||
# 开始评测
|
||||
for task in tasks:
|
||||
start_time = time.time()
|
||||
# 构造任务元信息
|
||||
label = task['label']
|
||||
task_meta = {
|
||||
'task_type': task['icl_task_type'],
|
||||
|
|
@ -148,16 +181,18 @@ def evaluate_core(model, tokenizer, device, max_per_task=-1):
|
|||
}
|
||||
print0(f"Evaluating: {label} ({task_meta['num_fewshot']}-shot, type: {task_meta['task_type']})... ", end='')
|
||||
|
||||
# 加载任务
|
||||
data_path = os.path.join(data_base_path, task_meta['dataset_uri'])
|
||||
with open(data_path, 'r', encoding='utf-8') as f:
|
||||
data = [json.loads(line.strip()) for line in f]
|
||||
|
||||
# Shuffle for consistent subsampling when using max_per_task
|
||||
# 限制每个任务的题目数量, 并确保顺序随机但可复现
|
||||
shuffle_rng = random.Random(1337)
|
||||
shuffle_rng.shuffle(data)
|
||||
if max_per_task > 0:
|
||||
data = data[:max_per_task]
|
||||
|
||||
# 评测任务并计算结果
|
||||
accuracy = evaluate_task(model, tokenizer, data, device, task_meta)
|
||||
results[label] = accuracy
|
||||
random_baseline = random_baselines[label]
|
||||
|
|
@ -166,6 +201,7 @@ def evaluate_core(model, tokenizer, device, max_per_task=-1):
|
|||
elapsed = time.time() - start_time
|
||||
print0(f"accuracy: {accuracy:.4f} | centered: {centered_result:.4f} | time: {elapsed:.2f}s")
|
||||
|
||||
# 计算 CORE metric: 所有任务的 centered result 的平均
|
||||
core_metric = sum(centered_results.values()) / len(centered_results)
|
||||
out = {
|
||||
"results": results,
|
||||
|
|
@ -185,23 +221,23 @@ def main():
|
|||
parser.add_argument('--step', type=int, default=None, help='Model step to load (default = last)')
|
||||
parser.add_argument('--max-per-task', type=int, default=-1, help='Max examples per CORE task (-1 = all)')
|
||||
parser.add_argument('--device-batch-size', type=int, default=32, help='Per-device batch size for BPB evaluation')
|
||||
parser.add_argument('--split-tokens', type=int, default=40*524288, help='Number of tokens to evaluate per split for BPB')
|
||||
parser.add_argument('--split-tokens', type=int, default=4*524288, help='Number of tokens to evaluate per split for BPB')
|
||||
parser.add_argument('--device-type', type=str, default='', help='cuda|cpu|mps (empty = autodetect)')
|
||||
args = parser.parse_args()
|
||||
|
||||
# Parse evaluation modes
|
||||
# 检查 eval mode 的合法性 (core, bpb, sample)
|
||||
eval_modes = set(mode.strip() for mode in args.eval.split(','))
|
||||
valid_modes = {'core', 'bpb', 'sample'}
|
||||
invalid = eval_modes - valid_modes
|
||||
if invalid:
|
||||
parser.error(f"Invalid eval modes: {invalid}. Valid: {valid_modes}")
|
||||
|
||||
# Distributed / precision setup
|
||||
# 初始化测评环境
|
||||
device_type = autodetect_device_type() if args.device_type == '' else args.device_type
|
||||
ddp, ddp_rank, ddp_local_rank, ddp_world_size, device = compute_init(device_type)
|
||||
autocast_ctx = torch.amp.autocast(device_type=device_type, dtype=torch.bfloat16) if device_type == "cuda" else nullcontext()
|
||||
|
||||
# Load model and tokenizer
|
||||
# 加载模型和 tokenizer
|
||||
is_hf_model = args.hf_path is not None
|
||||
if is_hf_model:
|
||||
model, tokenizer = load_hf_model(args.hf_path, device)
|
||||
|
|
@ -225,7 +261,7 @@ def main():
|
|||
samples = []
|
||||
unconditioned_samples = []
|
||||
|
||||
# --- Sampling ---
|
||||
# 采样测评
|
||||
if 'sample' in eval_modes and not is_hf_model:
|
||||
print0("\n" + "="*80)
|
||||
print0("Model Samples")
|
||||
|
|
@ -263,7 +299,7 @@ def main():
|
|||
elif 'sample' in eval_modes and is_hf_model:
|
||||
print0("\nSkipping sampling for HuggingFace models (not supported)")
|
||||
|
||||
# --- BPB evaluation ---
|
||||
# BPB 测评
|
||||
if 'bpb' in eval_modes:
|
||||
print0("\n" + "="*80)
|
||||
print0("BPB Evaluation")
|
||||
|
|
@ -282,13 +318,13 @@ def main():
|
|||
bpb_results[split_name] = bpb
|
||||
print0(f"{split_name} bpb: {bpb:.6f}")
|
||||
|
||||
# --- CORE evaluation ---
|
||||
# CORE 测评
|
||||
if 'core' in eval_modes:
|
||||
print0("\n" + "="*80)
|
||||
print0("CORE Evaluation")
|
||||
print0("="*80)
|
||||
with autocast_ctx:
|
||||
core_results = evaluate_core(model, tokenizer, device, max_per_task=args.max_per_task)
|
||||
core_results = evaluate_core(model, tokenizer, device, scale="small", max_per_task=args.max_per_task)
|
||||
|
||||
# Write CSV output
|
||||
if ddp_rank == 0:
|
||||
|
|
@ -305,7 +341,7 @@ def main():
|
|||
print0(f"\nResults written to: {output_csv_path}")
|
||||
print0(f"CORE metric: {core_results['core_metric']:.4f}")
|
||||
|
||||
# --- Log to report ---
|
||||
# 日志记录
|
||||
from nanochat.report import get_report
|
||||
report_data = [{"model": model_name}]
|
||||
|
||||
|
|
|
|||
|
|
@ -71,7 +71,7 @@ parser.add_argument("--final-lr-frac", type=float, default=0.0, help="final LR a
|
|||
parser.add_argument("--resume-from-step", type=int, default=-1, help="resume training from this step (-1 = disable)")
|
||||
# Evaluation
|
||||
parser.add_argument("--eval-every", type=int, default=250, help="evaluate val bpb every N steps (-1 = disable)")
|
||||
parser.add_argument("--eval-tokens", type=int, default=40*524288, help="number of tokens to evaluate val loss on")
|
||||
parser.add_argument("--eval-tokens", type=int, default=4*524288, help="number of tokens to evaluate val loss on")
|
||||
parser.add_argument("--core-metric-every", type=int, default=2000, help="evaluate CORE metric every N steps (-1 = disable)")
|
||||
parser.add_argument("--core-metric-max-per-task", type=int, default=500, help="examples per task for CORE metric")
|
||||
parser.add_argument("--sample-every", type=int, default=2000, help="sample from model every N steps (-1 = disable)")
|
||||
|
|
@ -425,7 +425,7 @@ while True:
|
|||
if args.core_metric_every > 0 and (last_step or (step > 0 and step % args.core_metric_every == 0)):
|
||||
model.eval()
|
||||
with disable_fp8(orig_model), autocast_ctx:
|
||||
results = evaluate_core(orig_model, tokenizer, device, max_per_task=args.core_metric_max_per_task)
|
||||
results = evaluate_core(orig_model, tokenizer, device, scale="small", max_per_task=args.core_metric_max_per_task)
|
||||
print0(f"Step {step:05d} | CORE metric: {results['core_metric']:.4f}")
|
||||
wandb_run.log({
|
||||
"step": step,
|
||||
|
|
@ -456,6 +456,7 @@ while True:
|
|||
print0(tokenizer.decode(sample[0]))
|
||||
model.train()
|
||||
|
||||
last_step = step == 10
|
||||
# save checkpoint: at the end of the run, or every save_every steps, except at the first step or the resume step
|
||||
if last_step or (step > 0 and step != args.resume_from_step and args.save_every > 0 and step % args.save_every == 0):
|
||||
save_checkpoint(
|
||||
|
|
|
|||
|
|
@ -60,7 +60,7 @@ parser.add_argument("--warmdown-ratio", type=float, default=0.5, help="ratio of
|
|||
parser.add_argument("--final-lr-frac", type=float, default=0.0, help="final LR as fraction of initial LR")
|
||||
# Evaluation
|
||||
parser.add_argument("--eval-every", type=int, default=200, help="evaluate val bpb every N steps (-1 = disable)")
|
||||
parser.add_argument("--eval-tokens", type=int, default=40*524288, help="number of tokens to evaluate val loss on")
|
||||
parser.add_argument("--eval-tokens", type=int, default=4*524288, help="number of tokens to evaluate val loss on")
|
||||
parser.add_argument("--chatcore-every", type=int, default=200, help="evaluate ChatCORE metric every N steps (-1 = disable)")
|
||||
parser.add_argument("--chatcore-max-cat", type=int, default=-1, help="max problems per categorical task for ChatCORE")
|
||||
parser.add_argument("--chatcore-max-sample", type=int, default=24, help="max problems per generative task for ChatCORE")
|
||||
|
|
@ -393,6 +393,7 @@ while True:
|
|||
})
|
||||
model.train()
|
||||
|
||||
last_step = step == 10
|
||||
# save checkpoint at the end of the run (all ranks participate so each saves its optimizer shard)
|
||||
if last_step:
|
||||
output_dirname = args.model_tag if args.model_tag else f"d{depth}" # e.g. d12
|
||||
|
|
@ -435,6 +436,13 @@ while True:
|
|||
loss.backward()
|
||||
x, y = next(train_loader) # prefetch the next batch while the GPU is busy with forward/backward
|
||||
progress = max(progress, approx_progress) # only increase progress monotonically
|
||||
# 梯度裁剪
|
||||
total_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
|
||||
if torch.isnan(total_norm) or torch.isinf(total_norm):
|
||||
print0(f"WARNING: gradient norm is {total_norm}, skipping step")
|
||||
model.zero_grad(set_to_none=True)
|
||||
# step += 1
|
||||
continue
|
||||
# step the optimizer
|
||||
lrm = get_lr_multiplier(progress)
|
||||
muon_momentum = get_muon_momentum(step)
|
||||
|
|
|
|||
|
|
@ -4,6 +4,7 @@ In the style of GPT-4 tokenizer.
|
|||
"""
|
||||
import os
|
||||
import time
|
||||
from typing import Iterator, Tuple
|
||||
import argparse
|
||||
import torch
|
||||
from nanochat.tokenizer import RustBPETokenizer
|
||||
|
|
@ -16,7 +17,7 @@ from nanochat.dataset import parquets_iter_batched
|
|||
parser = argparse.ArgumentParser(description='Train a BPE tokenizer')
|
||||
parser.add_argument('--max-chars', type=int, default=2_000_000_000, help='Maximum characters to train on (default: 10B)')
|
||||
parser.add_argument('--doc-cap', type=int, default=10_000, help='Maximum characters per document (default: 10,000)')
|
||||
parser.add_argument('--vocab-size', type=int, default=32768, help='Vocabulary size (default: 32768 = 2^15)')
|
||||
parser.add_argument('--vocab-size', type=int, default=2**16, help='Vocabulary size (default: 32768 = 2^15)')
|
||||
args = parser.parse_args()
|
||||
print(f"max_chars: {args.max_chars:,}")
|
||||
print(f"doc_cap: {args.doc_cap:,}")
|
||||
|
|
@ -25,32 +26,50 @@ print(f"vocab_size: {args.vocab_size:,}")
|
|||
# -----------------------------------------------------------------------------
|
||||
# Text iterator
|
||||
|
||||
def text_iterator():
|
||||
"""
|
||||
1) Flatten the batches into a single iterator
|
||||
2) Crop every document to args.doc_cap characters
|
||||
3) Break when we've seen args.max_chars characters
|
||||
def text_iterator() -> Iterator[str]:
|
||||
"""文档文本迭代器
|
||||
Args:
|
||||
None
|
||||
|
||||
Yields:
|
||||
str: 文档文本
|
||||
"""
|
||||
nchars = 0
|
||||
for batch in parquets_iter_batched(split="train"):
|
||||
for doc in batch:
|
||||
doc_text = doc
|
||||
# 若文档长度超过配置上限,则截断
|
||||
if len(doc_text) > args.doc_cap:
|
||||
doc_text = doc_text[:args.doc_cap]
|
||||
nchars += len(doc_text)
|
||||
yield doc_text
|
||||
# 如果已经处理的字符数超过配置上限,则停止迭代
|
||||
if nchars > args.max_chars:
|
||||
return
|
||||
text_iter = text_iterator()
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Train the tokenizer
|
||||
t0 = time.time()
|
||||
start = time.time()
|
||||
tokenizer = RustBPETokenizer.train_from_iterator(text_iter, args.vocab_size)
|
||||
t1 = time.time()
|
||||
train_time = t1 - t0
|
||||
end = time.time()
|
||||
train_time = end - start
|
||||
print(f"Training time: {train_time:.2f}s")
|
||||
|
||||
def train(iterator: Iterator[str], vocab_size: int) -> Tuple[RustBPETokenizer, float]:
|
||||
"""训练BPE分词器
|
||||
Args:
|
||||
iterator (Iterator[str]): 文本迭代器
|
||||
vocab_size (int): 词表大小
|
||||
|
||||
Returns:
|
||||
Tuple[RustBPETokenizer, float]: 训练好的分词器和训练时间(秒)
|
||||
"""
|
||||
start = time.time()
|
||||
tokenizer = RustBPETokenizer.train_from_iterator(iterator, vocab_size)
|
||||
end = time.time()
|
||||
train_time = end - start
|
||||
return tokenizer, train_time
|
||||
# -----------------------------------------------------------------------------
|
||||
# Save the tokenizer to disk
|
||||
base_dir = get_base_dir()
|
||||
|
|
@ -68,6 +87,23 @@ encoded = tokenizer.encode(test_text)
|
|||
decoded = tokenizer.decode(encoded)
|
||||
assert decoded == test_text
|
||||
|
||||
def sanity_check(tokenizer: RustBPETokenizer):
|
||||
"""对分词器进行快速的内联检查,确保编码和解码的一致性
|
||||
Args:
|
||||
tokenizer (RustBPETokenizer): 需要检查的分词器
|
||||
|
||||
Raises:
|
||||
AssertionError: 如果编码和解码不一致,则抛出断言错误
|
||||
"""
|
||||
test_text = """Hello world! This is a test.
|
||||
Numbers: 123, 4567, 89
|
||||
Contractions: I'm, you're, it's
|
||||
Special chars: @#$%^&*()
|
||||
Unicode: 你好世界 🌍"""
|
||||
encoded = tokenizer.encode(test_text)
|
||||
decoded = tokenizer.decode(encoded)
|
||||
assert decoded == test_text
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# One more thing: we wish to cache a mapping from token id to number of bytes of that token
|
||||
# for efficient evaluation of bits per byte. Unlike the typical mean loss, this
|
||||
|
|
@ -90,6 +126,32 @@ with open(token_bytes_path, "wb") as f:
|
|||
torch.save(token_bytes, f)
|
||||
print(f"Saved token_bytes to {token_bytes_path}")
|
||||
|
||||
def generate_token_bytes(tokenizer: RustBPETokenizer, save_path: str) -> torch.Tensor:
|
||||
"""生成一个张量,表示每个token id对应的字节数,并保存到磁盘
|
||||
Args:
|
||||
tokenizer (RustBPETokenizer): 已训练好的分词器
|
||||
save_path (str): 保存token_bytes张量的路径
|
||||
|
||||
Returns:
|
||||
torch.Tensor: 包含每个token id对应的字节数的张量
|
||||
"""
|
||||
vocab_size = tokenizer.get_vocab_size()
|
||||
special_set = set(tokenizer.get_special_tokens())
|
||||
token_strings = [tokenizer.decode([token_id]) for token_id in range(vocab_size)]
|
||||
token_bytes = []
|
||||
for token_id in range(vocab_size):
|
||||
token_str = token_strings[token_id] # the Python string representation of this token
|
||||
if token_str in special_set:
|
||||
token_bytes.append(0) # special characters are not counted
|
||||
else:
|
||||
id_bytes = len(token_str.encode("utf-8")) # number of bytes that make up this token
|
||||
token_bytes.append(id_bytes)
|
||||
token_bytes = torch.tensor(token_bytes, dtype=torch.int32, device='cpu')
|
||||
with open(save_path, "wb") as f:
|
||||
torch.save(token_bytes, f)
|
||||
print(f"Saved token_bytes to {save_path}")
|
||||
return token_bytes
|
||||
|
||||
# Log to report
|
||||
from nanochat.report import get_report
|
||||
token_bytes_nonzero = (token_bytes[token_bytes > 0]).to(dtype=torch.float32)
|
||||
|
|
@ -104,3 +166,38 @@ get_report().log(section="Tokenizer training", data=[
|
|||
"token_bytes_std": token_bytes_nonzero.std().item(),
|
||||
}
|
||||
])
|
||||
|
||||
def log_tokenizer_training(args: argparse.Namespace, train_time: float, tokenizer: RustBPETokenizer):
|
||||
"""记录分词器训练的相关信息到报告中
|
||||
Args:
|
||||
args (argparse.Namespace): 命令行参数
|
||||
train_time (float): 训练时间(秒)
|
||||
tokenizer (RustBPETokenizer): 已训练好的分词器
|
||||
"""
|
||||
# 计算token_bytes统计信息
|
||||
vocab_size = tokenizer.get_vocab_size()
|
||||
special_set = set(tokenizer.get_special_tokens())
|
||||
token_strings = [tokenizer.decode([token_id]) for token_id in range(vocab_size)]
|
||||
token_bytes = []
|
||||
for token_id in range(vocab_size):
|
||||
token_str = token_strings[token_id] # the Python string representation of this token
|
||||
if token_str in special_set:
|
||||
token_bytes.append(0) # special characters are not counted
|
||||
else:
|
||||
id_bytes = len(token_str.encode("utf-8")) # number of bytes that make up this token
|
||||
token_bytes.append(id_bytes)
|
||||
token_bytes = torch.tensor(token_bytes, dtype=torch.int32, device='cpu')
|
||||
token_bytes_nonzero = (token_bytes[token_bytes > 0]).to(dtype=torch.float32)
|
||||
|
||||
# Log to report
|
||||
get_report().log(section="Tokenizer training", data=[
|
||||
vars(args), # argparse command line arguments
|
||||
{"train_time": train_time},
|
||||
{"num_special_tokens": len(special_set)},
|
||||
{
|
||||
"token_bytes_min": int(token_bytes_nonzero.min().item()),
|
||||
"token_bytes_max": int(token_bytes_nonzero.max().item()),
|
||||
"token_bytes_mean": token_bytes_nonzero.mean().item(),
|
||||
"token_bytes_std": token_bytes_nonzero.std().item(),
|
||||
}
|
||||
])
|
||||
Loading…
Reference in New Issue
Block a user