Merge d96558bcb0 into f068604948

2026-06-18 03:59:09 +00:00 · 2026-03-10 09:57:42 +01:00 · 2026-03-10 09:57:42 +01:00 · 395ffcf64e
commit 395ffcf64e
parent f068604948 d96558bcb0
2 changed files with 3 additions and 3 deletions
--- a/.claude/skills/read-arxiv-paper/SKILL.md
+++ b/.claude/skills/read-arxiv-paper/SKILL.md
@ -33,7 +33,7 @@ Every latex source usually has an entrypoint, such as `main.tex` or something li

 Once you've found the entrypoint, Read the contents and then recurse through all other relevant source files to read the paper.

-#### Part 6: Report
+### Part 6: Report

 Once you've read the paper, produce a summary of the paper into a markdown file at `./knowledge/summary_{tag}.md`. Notice that 1) use the local knowledge directory here (it's easier for me to open and reference here), not in `~/.cache`, and 2) generate some reasonable `tag` like e.g. `conditional_memory` or whatever seems appropriate given the paper. Probably make sure that the tag doesn't exist yet so you're not overwriting files.

--- a/README.md
+++ b/README.md
@ -72,7 +72,7 @@ OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train
 This uses wandb (run name "d12"), only runs the CORE metric on last step, and it doesn't sample and save intermediate checkpoints. I like to change something in the code, re-run a d12 (or a d16 etc) and see if it helped, in an iteration loop. To see if a run helps, I like to monitor the wandb plots for:

 1. `val_bpb` (validation loss in vocab-size-invariant units of bits per byte) as a function of `step`, `total_training_time` and `total_training_flops`.
-2. `core_metric` (the DCLM CORE socre)
+2. `core_metric` (the DCLM CORE score)
 3. VRAM utilization, `train/mfu` (Model FLOPS utilization), `train/tok_per_sec` (training throughput)

 See an example [here](https://github.com/karpathy/nanochat/pull/498#issuecomment-3850720044).
@ -102,7 +102,7 @@ NANOCHAT_DTYPE=bfloat16 torchrun --nproc_per_node=8 -m scripts.base_train  # for

 How it works: model weights are stored in fp32 (for optimizer precision), but our custom `Linear` layer casts them to `COMPUTE_DTYPE` during the forward pass. Embeddings are stored directly in `COMPUTE_DTYPE` to save memory. This gives us the same mixed-precision benefit as autocast but with full explicit control over what runs in which precision.

-Note: `float16` training automatically enables a `GradScaler` in `base_train.py` to prevent gradient underflow. SFT suppors this too but RL currently does not. Inference in fp16 works fine everywhere.
+Note: `float16` training automatically enables a `GradScaler` in `base_train.py` to prevent gradient underflow. SFT supports this too but RL currently does not. Inference in fp16 works fine everywhere.

 ## Guides