DeoJin
40ecb443f0
Merge f7ecfa0366 into a445144d39
2026-03-28 04:20:40 +00:00
Andrej Karpathy
a445144d39
create a group for dev dependencies, there is no need to install all this other stuff just for speedrun and it's exposing people to dependency chain attacks. we need to delete more dependencies. dependencies bad bad bad
2026-03-26 03:41:28 +00:00
Andrej Karpathy
03be953668
delete non-essential deps from legacy use
2026-03-26 03:41:28 +00:00
Andrej
7808dc7159
Merge pull request #595 from svlandeg/fix/typo
...
Small fixes
2026-03-25 14:40:25 -07:00
Andrej
a4ed96687b
Merge pull request #634 from 2bitbit/fix-docs-and-comments
...
fix: correct minor typos in help text, README, and comments
2026-03-25 14:31:49 -07:00
Andrej
7b70f6b411
Merge pull request #639 from mathieu-lacage/master
...
Verified: both paths return identical data (99,842 rows), and all splits under subset='all' load cleanly.
2026-03-25 14:29:30 -07:00
RoomWithOutRoof
47e983eea7
fix: use meta device in disable_fp8 to avoid VRAM spike ( #616 )
...
When swapping Float8Linear to Linear in disable_fp8 context manager,
using device=fp8_module.weight.device directly allocates new tensors
on GPU, causing unnecessary VRAM spike (~1GB for large models).
This fix uses device='meta' to avoid physical memory allocation,
then swaps in the weight tensor reference. This eliminates the
unnecessary VRAM spike during evaluation phase.
Fixes issue #592
Co-authored-by: RoomWithOutRoof <roomwithoutroof@sparklab.ai>
2026-03-25 14:24:57 -07:00
Andrej Karpathy
c0dbf1f3ff
use COMPUTE_DTYPE-aware cast in Muon polar express step
...
The bf16 cast is intentional for speed on Hopper+ GPUs, but should be
skipped on other platforms rather than blindly applied. fp16 is unstable
here due to its limited exponent range, and fp32 platforms don't benefit
from the cast. Now: bf16 when COMPUTE_DTYPE is bf16, no cast otherwise.
Inspired by PR #667 .
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 20:19:14 +00:00
Andrej Karpathy
4e1694cc95
bunch of ideas tried from openai/parameter-golf, all negative results for nanochat
2026-03-24 22:13:13 +00:00
Andrej Karpathy
1cd94d768f
bump D:N ratio to 12 per recent scaling laws re-run
2026-03-24 19:25:50 +00:00
Andrej Karpathy
c16db281ff
fix small bug with params logging and batch size
2026-03-24 19:25:34 +00:00
svlandeg
dfe7d39ce8
Merge branch 'master' into fix/typo
2026-03-18 17:01:45 +01:00
Andrej Karpathy
5019accc5b
fix scaling laws scripts after the bigram embeddings were removed
2026-03-17 16:55:56 +00:00
DeoJin
f7ecfa0366
docs: show model-tag in chat CLI examples
2026-03-17 12:56:14 +01:00
svlandeg
51f42a4406
~1.5h :-)
2026-03-15 22:29:27 +01:00
svlandeg
1f9e42a855
two more typos, from PR 645
2026-03-15 22:27:18 +01:00
svlandeg
bd6e9c8d5f
fix numbering
2026-03-15 22:18:18 +01:00
svlandeg
02e865c2ab
Merge branch 'master' into fix/typo
2026-03-15 22:18:01 +01:00
Andrej Karpathy
1b1cc3c599
submit new time to GPT-2 leaderboard entry: 99 minutes
2026-03-14 17:15:01 +00:00
Andrej Karpathy
a825e63f81
Autoresearch round 2: smear, backout, and hyperparameter tuning
...
New architectural features:
- Smear: mix previous token embedding into current position via learned
gate, providing cheap bigram-like info (works in training + KV cache)
- Backout: subtract learned fraction of mid-layer residual before logit
projection to remove low-level features
Hyperparameter tuning:
- Muon momentum warmdown 0.97→0.90 during LR warmdown phase
- Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05
- c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4
- Speedrun data:params ratio reduced to 8
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 17:03:06 +00:00
svlandeg
6405b26d24
Merge branch 'master' into fix/typo
2026-03-13 13:56:50 +01:00
svlandeg
1052d25d45
we only need to wait 2h now!
2026-03-13 13:46:16 +01:00
Mathieu Lacage
a641b6ca96
MMLU main split is named auxiliary_train, not train
2026-03-13 13:19:10 +01:00
2bitbit
2bb93b2ae4
fix: correct minor typos in help text, README, and comments
2026-03-12 17:03:26 +08:00
svlandeg
d96558bcb0
fix heading, cf #622
2026-03-10 09:57:30 +01:00
Andrej Karpathy
f068604948
new leaderboard entry coming from improvements of autoresearch round 1, time to gpt-2 from 2.02 hours to 1.80 hours
2026-03-10 06:26:39 +00:00
Andrej Karpathy
6ed7d1d82c
All of these improvements were developed by Claude running autonomously over ~2 days using autoresearch. I didn't touch anything - incredible. All tuning was done on d12 but generalized easily to larger models (e.g. d24 in particular). This means we will also get a new "Time to GPT-2" Leaderboard entry, which I will push separately.
...
Optimizer & schedule changes:
- Increase unembedding LR 0.004 -> 0.008, weight decay 0.2 -> 0.28
- Per-group Adam betas and weight decay (instead of shared global betas)
- Muon beta2 0.95 -> 0.9, momentum warmup target 0.95 -> 0.97 over 400 steps
- Warmup: ratio-based -> absolute steps (default 40)
- Warmdown ratio 0.5 -> 0.65, final LR fraction 0.0 -> 0.05
- Weight decay schedule: linear -> cosine decay
- Polar express norm factor 1.02 -> 1.01
Architecture & init changes:
- VE gate: channels 32 -> 12, scale range 2x -> 3x, init small positive
- Add post-QK-norm scaling (q,k *= 1.15) for sharper attention
- Embedding init std 1.0 -> 0.8, MLP c_fc init 0.5x smaller
- RoPE base theta 10K -> 100K
- Short attention window: seq_len/2 -> ~seq_len/3 (ceil to 128 tile)
- Logit softcap 20 -> 15
2026-03-09 20:45:17 +00:00
svlandeg
f8ff0439b9
two more small typos
2026-03-06 11:03:00 +01:00
Andrej Karpathy
1076f97059
delete autocast, an unnecessary thorn in my side, manage dtypes directly
2026-03-04 23:55:30 +00:00
Sofie Van Landeghem
752abc836e
Ensure that inputs and targets are contiguous ( #569 )
...
* call reshape instead of view in case the tensors are not contiguous
* fix directly in data loader instead
2026-03-04 13:58:27 -08:00
Andrej Karpathy
4b4077425b
Document new Leaderboard entry congrats @ddudek for pointing out ClimbMix, time to GPT-2 is now 2.01 hours, down from 2.76 previously
2026-03-04 20:02:07 +00:00
Andrej Karpathy
324e69c45d
big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise
2026-03-04 19:47:12 +00:00
Andrej Karpathy
b07604ebaa
document the legacy fineweb100b dataset and the new climbmix400b dataset
2026-03-03 17:24:31 +00:00
Andrej Karpathy
aba30cb037
tune logit softcap?
2026-03-03 00:38:53 +00:00
Anish
83dccc20ae
Restore completion-only loss masking in SFT dataloader ( #582 )
...
* printing steps count
* adding reply only loss for chat
* using the mask by render_conversation function of tokeniser
* undoing some changes
* putting back the comment which got removed accidently, no functionality change
2026-03-02 16:37:47 -08:00
Dipesh Babu
c7ba252142
docs: fix typos in experiment log ( #547 )
2026-02-20 08:03:45 -08:00
Andrej Karpathy
2dffdc8cf6
document MoE exploration
2026-02-19 02:53:47 +00:00
Andrej Karpathy
48804bff3a
report negative result on fineweb dataset
2026-02-18 23:45:31 +00:00
Andrej Karpathy
bb5137860e
fix comment
2026-02-18 23:26:22 +00:00
Andrej Karpathy
458555117b
Merge branch 'Chetter2-patch-1'
2026-02-18 23:17:39 +00:00
Andrej Karpathy
bac5a35dd7
fix minor bug in fp8 application to skip tiny matmuls
2026-02-18 23:17:29 +00:00
George Shakan
ad55575326
Fix bug in setting precision ( #538 )
2026-02-18 15:49:18 +00:00
Sofie Van Landeghem
cac43e8511
Fix MockModel's device definition ( #535 )
...
* fix MockModel's device definition
* cleanup
2026-02-18 15:49:18 +00:00
Andrej Karpathy
f5fe7925ed
update dev log with recent
2026-02-18 15:49:18 +00:00
Andrej Karpathy
1415fb7617
tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft
2026-02-18 15:49:18 +00:00
Andrej Karpathy
77f8fb8303
a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps
2026-02-18 15:49:18 +00:00
George Shakan
0a23f87643
Fix bug in setting precision ( #538 )
2026-02-18 07:42:11 -08:00
Sofie Van Landeghem
4800c62f6e
Fix MockModel's device definition ( #535 )
...
* fix MockModel's device definition
* cleanup
2026-02-17 16:03:46 -08:00
Andrej Karpathy
4a6e47b0c6
update dev log with recent
2026-02-17 15:44:54 +00:00
Andrej Karpathy
8180e1d8c1
tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft
2026-02-16 20:23:04 +00:00