Suraj-Self
cf0d5fe6b2
Merge bb8e371256 into 7808dc7159
2026-03-26 10:58:33 +08:00
suraj-self
bb8e371256
Merge branch 'master' into fix-scaling-zero-division
2026-03-26 08:27:51 +05:30
Andrej
7808dc7159
Merge pull request #595 from svlandeg/fix/typo
...
Small fixes
2026-03-25 14:40:25 -07:00
Andrej
a4ed96687b
Merge pull request #634 from 2bitbit/fix-docs-and-comments
...
fix: correct minor typos in help text, README, and comments
2026-03-25 14:31:49 -07:00
Andrej
7b70f6b411
Merge pull request #639 from mathieu-lacage/master
...
Verified: both paths return identical data (99,842 rows), and all splits under subset='all' load cleanly.
2026-03-25 14:29:30 -07:00
RoomWithOutRoof
47e983eea7
fix: use meta device in disable_fp8 to avoid VRAM spike ( #616 )
...
When swapping Float8Linear to Linear in disable_fp8 context manager,
using device=fp8_module.weight.device directly allocates new tensors
on GPU, causing unnecessary VRAM spike (~1GB for large models).
This fix uses device='meta' to avoid physical memory allocation,
then swaps in the weight tensor reference. This eliminates the
unnecessary VRAM spike during evaluation phase.
Fixes issue #592
Co-authored-by: RoomWithOutRoof <roomwithoutroof@sparklab.ai>
2026-03-25 14:24:57 -07:00
Andrej Karpathy
c0dbf1f3ff
use COMPUTE_DTYPE-aware cast in Muon polar express step
...
The bf16 cast is intentional for speed on Hopper+ GPUs, but should be
skipped on other platforms rather than blindly applied. fp16 is unstable
here due to its limited exponent range, and fp32 platforms don't benefit
from the cast. Now: bf16 when COMPUTE_DTYPE is bf16, no cast otherwise.
Inspired by PR #667 .
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 20:19:14 +00:00
Andrej Karpathy
4e1694cc95
bunch of ideas tried from openai/parameter-golf, all negative results for nanochat
2026-03-24 22:13:13 +00:00
Andrej Karpathy
1cd94d768f
bump D:N ratio to 12 per recent scaling laws re-run
2026-03-24 19:25:50 +00:00
Andrej Karpathy
c16db281ff
fix small bug with params logging and batch size
2026-03-24 19:25:34 +00:00
suraj-self
db0cbb9110
Merge branch 'master' into fix-scaling-zero-division
2026-03-18 23:47:53 +05:30
svlandeg
dfe7d39ce8
Merge branch 'master' into fix/typo
2026-03-18 17:01:45 +01:00
Andrej Karpathy
5019accc5b
fix scaling laws scripts after the bigram embeddings were removed
2026-03-17 16:55:56 +00:00
svlandeg
51f42a4406
~1.5h :-)
2026-03-15 22:29:27 +01:00
svlandeg
1f9e42a855
two more typos, from PR 645
2026-03-15 22:27:18 +01:00
svlandeg
bd6e9c8d5f
fix numbering
2026-03-15 22:18:18 +01:00
svlandeg
02e865c2ab
Merge branch 'master' into fix/typo
2026-03-15 22:18:01 +01:00
suraj-self
daaea1537a
Merge branch 'master' into fix-scaling-zero-division
2026-03-15 21:07:07 +05:30
Andrej Karpathy
1b1cc3c599
submit new time to GPT-2 leaderboard entry: 99 minutes
2026-03-14 17:15:01 +00:00
Andrej Karpathy
a825e63f81
Autoresearch round 2: smear, backout, and hyperparameter tuning
...
New architectural features:
- Smear: mix previous token embedding into current position via learned
gate, providing cheap bigram-like info (works in training + KV cache)
- Backout: subtract learned fraction of mid-layer residual before logit
projection to remove low-level features
Hyperparameter tuning:
- Muon momentum warmdown 0.97→0.90 during LR warmdown phase
- Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05
- c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4
- Speedrun data:params ratio reduced to 8
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 17:03:06 +00:00
svlandeg
6405b26d24
Merge branch 'master' into fix/typo
2026-03-13 13:56:50 +01:00
svlandeg
1052d25d45
we only need to wait 2h now!
2026-03-13 13:46:16 +01:00
Mathieu Lacage
a641b6ca96
MMLU main split is named auxiliary_train, not train
2026-03-13 13:19:10 +01:00
2bitbit
2bb93b2ae4
fix: correct minor typos in help text, README, and comments
2026-03-12 17:03:26 +08:00
suraj-self
781b53078c
Merge branch 'master' into fix-scaling-zero-division
2026-03-11 21:39:46 +05:30
svlandeg
d96558bcb0
fix heading, cf #622
2026-03-10 09:57:30 +01:00
Andrej Karpathy
f068604948
new leaderboard entry coming from improvements of autoresearch round 1, time to gpt-2 from 2.02 hours to 1.80 hours
2026-03-10 06:26:39 +00:00
suraj-self
e3bd5545b5
Merge branch 'master' into fix-scaling-zero-division
2026-03-10 07:38:01 +05:30
Andrej Karpathy
6ed7d1d82c
All of these improvements were developed by Claude running autonomously over ~2 days using autoresearch. I didn't touch anything - incredible. All tuning was done on d12 but generalized easily to larger models (e.g. d24 in particular). This means we will also get a new "Time to GPT-2" Leaderboard entry, which I will push separately.
...
Optimizer & schedule changes:
- Increase unembedding LR 0.004 -> 0.008, weight decay 0.2 -> 0.28
- Per-group Adam betas and weight decay (instead of shared global betas)
- Muon beta2 0.95 -> 0.9, momentum warmup target 0.95 -> 0.97 over 400 steps
- Warmup: ratio-based -> absolute steps (default 40)
- Warmdown ratio 0.5 -> 0.65, final LR fraction 0.0 -> 0.05
- Weight decay schedule: linear -> cosine decay
- Polar express norm factor 1.02 -> 1.01
Architecture & init changes:
- VE gate: channels 32 -> 12, scale range 2x -> 3x, init small positive
- Add post-QK-norm scaling (q,k *= 1.15) for sharper attention
- Embedding init std 1.0 -> 0.8, MLP c_fc init 0.5x smaller
- RoPE base theta 10K -> 100K
- Short attention window: seq_len/2 -> ~seq_len/3 (ceil to 128 tile)
- Logit softcap 20 -> 15
2026-03-09 20:45:17 +00:00
svlandeg
f8ff0439b9
two more small typos
2026-03-06 11:03:00 +01:00
suraj-self
1bce71e03d
Merge branch 'master' into fix-scaling-zero-division
2026-03-05 08:37:48 +05:30
Andrej Karpathy
1076f97059
delete autocast, an unnecessary thorn in my side, manage dtypes directly
2026-03-04 23:55:30 +00:00
Sofie Van Landeghem
752abc836e
Ensure that inputs and targets are contiguous ( #569 )
...
* call reshape instead of view in case the tensors are not contiguous
* fix directly in data loader instead
2026-03-04 13:58:27 -08:00
Andrej Karpathy
4b4077425b
Document new Leaderboard entry congrats @ddudek for pointing out ClimbMix, time to GPT-2 is now 2.01 hours, down from 2.76 previously
2026-03-04 20:02:07 +00:00
Andrej Karpathy
324e69c45d
big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise
2026-03-04 19:47:12 +00:00
Andrej Karpathy
b07604ebaa
document the legacy fineweb100b dataset and the new climbmix400b dataset
2026-03-03 17:24:31 +00:00
suraj-self
be723b7afb
Merge branch 'master' into fix-scaling-zero-division
2026-03-03 11:22:31 +05:30
Andrej Karpathy
aba30cb037
tune logit softcap?
2026-03-03 00:38:53 +00:00
Anish
83dccc20ae
Restore completion-only loss masking in SFT dataloader ( #582 )
...
* printing steps count
* adding reply only loss for chat
* using the mask by render_conversation function of tokeniser
* undoing some changes
* putting back the comment which got removed accidently, no functionality change
2026-03-02 16:37:47 -08:00
suraj-self
58465e3bf5
fix: guard target-param-data-ratio against zero to avoid ZeroDivisionError
2026-02-28 13:10:50 +05:30
Dipesh Babu
c7ba252142
docs: fix typos in experiment log ( #547 )
2026-02-20 08:03:45 -08:00
Andrej Karpathy
2dffdc8cf6
document MoE exploration
2026-02-19 02:53:47 +00:00
Andrej Karpathy
48804bff3a
report negative result on fineweb dataset
2026-02-18 23:45:31 +00:00
Andrej Karpathy
bb5137860e
fix comment
2026-02-18 23:26:22 +00:00
Andrej Karpathy
458555117b
Merge branch 'Chetter2-patch-1'
2026-02-18 23:17:39 +00:00
Andrej Karpathy
bac5a35dd7
fix minor bug in fp8 application to skip tiny matmuls
2026-02-18 23:17:29 +00:00
George Shakan
ad55575326
Fix bug in setting precision ( #538 )
2026-02-18 15:49:18 +00:00
Sofie Van Landeghem
cac43e8511
Fix MockModel's device definition ( #535 )
...
* fix MockModel's device definition
* cleanup
2026-02-18 15:49:18 +00:00
Andrej Karpathy
f5fe7925ed
update dev log with recent
2026-02-18 15:49:18 +00:00
Andrej Karpathy
1415fb7617
tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft
2026-02-18 15:49:18 +00:00