From f8ff0439b9b9192399deb1ed8a09874152b4a407 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Fri, 6 Mar 2026 11:03:00 +0100 Subject: [PATCH 1/6] two more small typos --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 077fd9c..6be1109 100644 --- a/README.md +++ b/README.md @@ -71,7 +71,7 @@ OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train This uses wandb (run name "d12"), only runs the CORE metric on last step, and it doesn't sample and save intermediate checkpoints. I like to change something in the code, re-run a d12 (or a d16 etc) and see if it helped, in an iteration loop. To see if a run helps, I like to monitor the wandb plots for: 1. `val_bpb` (validation loss in vocab-size-invariant units of bits per byte) as a function of `step`, `total_training_time` and `total_training_flops`. -2. `core_metric` (the DCLM CORE socre) +2. `core_metric` (the DCLM CORE score) 3. VRAM utilization, `train/mfu` (Model FLOPS utilization), `train/tok_per_sec` (training throughput) See an example [here](https://github.com/karpathy/nanochat/pull/498#issuecomment-3850720044). @@ -101,7 +101,7 @@ NANOCHAT_DTYPE=bfloat16 torchrun --nproc_per_node=8 -m scripts.base_train # for How it works: model weights are stored in fp32 (for optimizer precision), but our custom `Linear` layer casts them to `COMPUTE_DTYPE` during the forward pass. Embeddings are stored directly in `COMPUTE_DTYPE` to save memory. This gives us the same mixed-precision benefit as autocast but with full explicit control over what runs in which precision. -Note: `float16` training automatically enables a `GradScaler` in `base_train.py` to prevent gradient underflow. SFT suppors this too but RL currently does not. Inference in fp16 works fine everywhere. +Note: `float16` training automatically enables a `GradScaler` in `base_train.py` to prevent gradient underflow. SFT supports this too but RL currently does not. Inference in fp16 works fine everywhere. ## Guides From d96558bcb0dc11b546bebff79bc0f56fa944c362 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Tue, 10 Mar 2026 09:57:30 +0100 Subject: [PATCH 2/6] fix heading, cf #622 --- .claude/skills/read-arxiv-paper/SKILL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.claude/skills/read-arxiv-paper/SKILL.md b/.claude/skills/read-arxiv-paper/SKILL.md index 6a9cda7..0a1b131 100644 --- a/.claude/skills/read-arxiv-paper/SKILL.md +++ b/.claude/skills/read-arxiv-paper/SKILL.md @@ -33,7 +33,7 @@ Every latex source usually has an entrypoint, such as `main.tex` or something li Once you've found the entrypoint, Read the contents and then recurse through all other relevant source files to read the paper. -#### Part 6: Report +### Part 6: Report Once you've read the paper, produce a summary of the paper into a markdown file at `./knowledge/summary_{tag}.md`. Notice that 1) use the local knowledge directory here (it's easier for me to open and reference here), not in `~/.cache`, and 2) generate some reasonable `tag` like e.g. `conditional_memory` or whatever seems appropriate given the paper. Probably make sure that the tag doesn't exist yet so you're not overwriting files. From 1052d25d454847a4bbf2cb85cbee250471535814 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Fri, 13 Mar 2026 13:46:16 +0100 Subject: [PATCH 3/6] we only need to wait 2h now! --- dev/LEADERBOARD.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev/LEADERBOARD.md b/dev/LEADERBOARD.md index 556ec3c..6fdeaa3 100644 --- a/dev/LEADERBOARD.md +++ b/dev/LEADERBOARD.md @@ -36,7 +36,7 @@ Note that: - `target-param-data-ratio=8.25` controls the training horizon, which is determined in the script by taking the number of non-embedding model parameters and simply multiplying by this number. The current optimal Tokens:Params ratio can be seen in the defaults of the `base_train.py` script (it is 10.5). 10.5 would produce the *compute optimal* model given the currently measured scaling laws. However, GPT-2 capability is currently somewhere in between a d24 and d26. So to reach it exactly, we want to either overtrain d24 or undertrain d26. In this particular example, I am choosing to slightly undertrain a d26. Note that odd depths (e.g. d25) are not super recommended to use because the math around the transformer sizing and its head dimensions doesn't come out neatly. - `--fp8` turns on fp8 training. If your GPU does not support fp8, you can leave this out and the code will simply train in bf16. bf16 is higher precision than fp8, so you can actually expect that you might be able to do fewer steps (lower the `target-param-data-ratio`) to achieve the same capability. -Once you kick off the run, you wait ~3 hours and then at the end you'll see something like: +Once you kick off the run, you wait ~2 hours and then at the end you'll see something like: ``` wandb: Run summary: From bd6e9c8d5fb1d02f43bb4bb0c837736183662b39 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Sun, 15 Mar 2026 22:18:18 +0100 Subject: [PATCH 4/6] fix numbering --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 9c09cc3..fa0cd23 100644 --- a/README.md +++ b/README.md @@ -19,7 +19,7 @@ Presently, the main focus of development is on tuning the pretraining stage, whi | 3 | 2.76 | 0.74645 | 0.2602 | bump total batch size to 1M tokens | Feb 5 2026 | 2c062aa | @karpathy | | 4 | 2.02 | 0.71854 | 0.2571 | change dataset to NVIDIA ClimbMix | Mar 4 2026 | 324e69c | @ddudek @karpathy | | 5 | 1.80 | 0.71808 | 0.2690 | autoresearch [round 1](https://x.com/karpathy/status/2031135152349524125) | Mar 9 2026 | 6ed7d1d | @karpathy | -| 5 | 1.65 | 0.71800 | 0.2626 | autoresearch round 2 | Mar 14 2026 | a825e63 | @karpathy | +| 6 | 1.65 | 0.71800 | 0.2626 | autoresearch round 2 | Mar 14 2026 | a825e63 | @karpathy | The primary metric we care about is "time to GPT-2" - the wall clock time needed to outperform the GPT-2 (1.6B) CORE metric on an 8XH100 GPU node. The GPT-2 CORE score is 0.256525. In 2019, the training of GPT-2 cost approximately $43,000 so it is incredible that due to many advances over 7 years across the stack, we can now do so much faster and for well below $100 (e.g. at the current ~$3/GPU/hr, an 8XH100 node is ~$24/hr, so 2 hours is ~$48). From 1f9e42a85588c34be86e4cb30db5488b0f01f4c2 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Sun, 15 Mar 2026 22:27:18 +0100 Subject: [PATCH 5/6] two more typos, from PR 645 --- .claude/skills/read-arxiv-paper/SKILL.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.claude/skills/read-arxiv-paper/SKILL.md b/.claude/skills/read-arxiv-paper/SKILL.md index 0a1b131..cebee1b 100644 --- a/.claude/skills/read-arxiv-paper/SKILL.md +++ b/.claude/skills/read-arxiv-paper/SKILL.md @@ -1,6 +1,6 @@ --- name: read-arxiv-paper -description: Use this skill when when asked to read an arxiv paper given an arxiv URL +description: Use this skill when asked to read an arxiv paper given an arxiv URL --- You will be given a URL of an arxiv paper, for example: @@ -37,4 +37,4 @@ Once you've found the entrypoint, Read the contents and then recurse through all Once you've read the paper, produce a summary of the paper into a markdown file at `./knowledge/summary_{tag}.md`. Notice that 1) use the local knowledge directory here (it's easier for me to open and reference here), not in `~/.cache`, and 2) generate some reasonable `tag` like e.g. `conditional_memory` or whatever seems appropriate given the paper. Probably make sure that the tag doesn't exist yet so you're not overwriting files. -As for the summary itself, remember that you're processing this paper within the context of the nanochat repository, so most often we we will be interested in how to apply the paper and its lessons to the nanochat project. Therefore, you should feel free to "remind yourself" of the related nanochat code by reading the relevant parts, and then explicitly make the connection of how this paper might relate to nanochat or what are things we might be inspired about or try. +As for the summary itself, remember that you're processing this paper within the context of the nanochat repository, so most often we will be interested in how to apply the paper and its lessons to the nanochat project. Therefore, you should feel free to "remind yourself" of the related nanochat code by reading the relevant parts, and then explicitly make the connection of how this paper might relate to nanochat or what are things we might be inspired about or try. From 51f42a4406ccd5223f945edbbd6deefba14e3f97 Mon Sep 17 00:00:00 2001 From: svlandeg Date: Sun, 15 Mar 2026 22:29:27 +0100 Subject: [PATCH 6/6] ~1.5h :-) --- dev/LEADERBOARD.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev/LEADERBOARD.md b/dev/LEADERBOARD.md index 097c394..c3fa8cd 100644 --- a/dev/LEADERBOARD.md +++ b/dev/LEADERBOARD.md @@ -36,7 +36,7 @@ Note that: - `target-param-data-ratio=8.25` controls the training horizon, which is determined in the script by taking the number of non-embedding model parameters and simply multiplying by this number. The current optimal Tokens:Params ratio can be seen in the defaults of the `base_train.py` script (it is 10.5). 10.5 would produce the *compute optimal* model given the currently measured scaling laws. However, GPT-2 capability is currently somewhere in between a d24 and d26. So to reach it exactly, we want to either overtrain d24 or undertrain d26. In this particular example, I am choosing to slightly undertrain a d26. Note that odd depths (e.g. d25) are not super recommended to use because the math around the transformer sizing and its head dimensions doesn't come out neatly. - `--fp8` turns on fp8 training. If your GPU does not support fp8, you can leave this out and the code will simply train in bf16. bf16 is higher precision than fp8, so you can actually expect that you might be able to do fewer steps (lower the `target-param-data-ratio`) to achieve the same capability. -Once you kick off the run, you wait ~2 hours and then at the end you'll see something like: +Once you kick off the run, you wait ~1.5 hours and then at the end you'll see something like: ``` wandb: Run summary: