Commit Graph

383 Commits

Author SHA1 Message Date
suraj-self
9f9ef95adc Merge branch 'master' into fix-batch-size-assertion 2026-03-26 08:26:25 +05:30
Andrej
7808dc7159
Merge pull request #595 from svlandeg/fix/typo
Small fixes
2026-03-25 14:40:25 -07:00
Andrej
a4ed96687b
Merge pull request #634 from 2bitbit/fix-docs-and-comments
fix: correct minor typos in help text, README, and comments
2026-03-25 14:31:49 -07:00
Andrej
7b70f6b411
Merge pull request #639 from mathieu-lacage/master
Verified: both paths return identical data (99,842 rows), and all splits under subset='all' load cleanly.
2026-03-25 14:29:30 -07:00
RoomWithOutRoof
47e983eea7
fix: use meta device in disable_fp8 to avoid VRAM spike (#616)
When swapping Float8Linear to Linear in disable_fp8 context manager,
using device=fp8_module.weight.device directly allocates new tensors
on GPU, causing unnecessary VRAM spike (~1GB for large models).

This fix uses device='meta' to avoid physical memory allocation,
then swaps in the weight tensor reference. This eliminates the
unnecessary VRAM spike during evaluation phase.

Fixes issue #592

Co-authored-by: RoomWithOutRoof <roomwithoutroof@sparklab.ai>
2026-03-25 14:24:57 -07:00
Andrej Karpathy
c0dbf1f3ff use COMPUTE_DTYPE-aware cast in Muon polar express step
The bf16 cast is intentional for speed on Hopper+ GPUs, but should be
skipped on other platforms rather than blindly applied. fp16 is unstable
here due to its limited exponent range, and fp32 platforms don't benefit
from the cast. Now: bf16 when COMPUTE_DTYPE is bf16, no cast otherwise.

Inspired by PR #667.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 20:19:14 +00:00
Andrej Karpathy
4e1694cc95 bunch of ideas tried from openai/parameter-golf, all negative results for nanochat 2026-03-24 22:13:13 +00:00
Andrej Karpathy
1cd94d768f bump D:N ratio to 12 per recent scaling laws re-run 2026-03-24 19:25:50 +00:00
Andrej Karpathy
c16db281ff fix small bug with params logging and batch size 2026-03-24 19:25:34 +00:00
suraj-self
92250671a5 Merge branch 'master' into fix-batch-size-assertion 2026-03-18 23:47:01 +05:30
svlandeg
dfe7d39ce8 Merge branch 'master' into fix/typo 2026-03-18 17:01:45 +01:00
Andrej Karpathy
5019accc5b fix scaling laws scripts after the bigram embeddings were removed 2026-03-17 16:55:56 +00:00
svlandeg
51f42a4406 ~1.5h :-) 2026-03-15 22:29:27 +01:00
svlandeg
1f9e42a855 two more typos, from PR 645 2026-03-15 22:27:18 +01:00
svlandeg
bd6e9c8d5f fix numbering 2026-03-15 22:18:18 +01:00
svlandeg
02e865c2ab Merge branch 'master' into fix/typo 2026-03-15 22:18:01 +01:00
suraj-self
daba23cbb5 Merge branch 'master' into fix-batch-size-assertion 2026-03-15 21:06:31 +05:30
Andrej Karpathy
1b1cc3c599 submit new time to GPT-2 leaderboard entry: 99 minutes 2026-03-14 17:15:01 +00:00
Andrej Karpathy
a825e63f81 Autoresearch round 2: smear, backout, and hyperparameter tuning
New architectural features:
- Smear: mix previous token embedding into current position via learned
  gate, providing cheap bigram-like info (works in training + KV cache)
- Backout: subtract learned fraction of mid-layer residual before logit
  projection to remove low-level features

Hyperparameter tuning:
- Muon momentum warmdown 0.97→0.90 during LR warmdown phase
- Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05
- c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4
- Speedrun data:params ratio reduced to 8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 17:03:06 +00:00
svlandeg
6405b26d24 Merge branch 'master' into fix/typo 2026-03-13 13:56:50 +01:00
svlandeg
1052d25d45 we only need to wait 2h now! 2026-03-13 13:46:16 +01:00
Mathieu Lacage
a641b6ca96 MMLU main split is named auxiliary_train, not train 2026-03-13 13:19:10 +01:00
2bitbit
2bb93b2ae4 fix: correct minor typos in help text, README, and comments 2026-03-12 17:03:26 +08:00
suraj-self
fc5a32a70e Merge branch 'master' into fix-batch-size-assertion 2026-03-11 21:42:04 +05:30
svlandeg
d96558bcb0 fix heading, cf #622 2026-03-10 09:57:30 +01:00
Andrej Karpathy
f068604948 new leaderboard entry coming from improvements of autoresearch round 1, time to gpt-2 from 2.02 hours to 1.80 hours 2026-03-10 06:26:39 +00:00
suraj-self
0e5403e7f6 Merge branch 'master' into fix-batch-size-assertion 2026-03-10 07:41:07 +05:30
Andrej Karpathy
6ed7d1d82c All of these improvements were developed by Claude running autonomously over ~2 days using autoresearch. I didn't touch anything - incredible. All tuning was done on d12 but generalized easily to larger models (e.g. d24 in particular). This means we will also get a new "Time to GPT-2" Leaderboard entry, which I will push separately.
Optimizer & schedule changes:
- Increase unembedding LR 0.004 -> 0.008, weight decay 0.2 -> 0.28
- Per-group Adam betas and weight decay (instead of shared global betas)
- Muon beta2 0.95 -> 0.9, momentum warmup target 0.95 -> 0.97 over 400 steps
- Warmup: ratio-based -> absolute steps (default 40)
- Warmdown ratio 0.5 -> 0.65, final LR fraction 0.0 -> 0.05
- Weight decay schedule: linear -> cosine decay
- Polar express norm factor 1.02 -> 1.01

Architecture & init changes:
- VE gate: channels 32 -> 12, scale range 2x -> 3x, init small positive
- Add post-QK-norm scaling (q,k *= 1.15) for sharper attention
- Embedding init std 1.0 -> 0.8, MLP c_fc init 0.5x smaller
- RoPE base theta 10K -> 100K
- Short attention window: seq_len/2 -> ~seq_len/3 (ceil to 128 tile)
- Logit softcap 20 -> 15
2026-03-09 20:45:17 +00:00
suraj-self
f2899a1b4a Extend informative assertion message to chat_sft.py for consistency 2026-03-08 16:30:53 +05:30
svlandeg
f8ff0439b9 two more small typos 2026-03-06 11:03:00 +01:00
suraj-self
28894e1262 Merge branch 'master' into fix-batch-size-assertion 2026-03-05 08:41:31 +05:30
Andrej Karpathy
1076f97059 delete autocast, an unnecessary thorn in my side, manage dtypes directly 2026-03-04 23:55:30 +00:00
Sofie Van Landeghem
752abc836e
Ensure that inputs and targets are contiguous (#569)
* call reshape instead of view in case the tensors are not contiguous

* fix directly in data loader instead
2026-03-04 13:58:27 -08:00
Andrej Karpathy
4b4077425b Document new Leaderboard entry congrats @ddudek for pointing out ClimbMix, time to GPT-2 is now 2.01 hours, down from 2.76 previously 2026-03-04 20:02:07 +00:00
Andrej Karpathy
324e69c45d big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise 2026-03-04 19:47:12 +00:00
Andrej Karpathy
b07604ebaa document the legacy fineweb100b dataset and the new climbmix400b dataset 2026-03-03 17:24:31 +00:00
suraj-self
6e9ef8f565 Merge branch 'master' into fix-batch-size-assertion 2026-03-03 11:25:58 +05:30
Andrej Karpathy
aba30cb037 tune logit softcap? 2026-03-03 00:38:53 +00:00
Anish
83dccc20ae
Restore completion-only loss masking in SFT dataloader (#582)
* printing steps count

* adding reply only loss for chat

* using the mask by render_conversation function of tokeniser

* undoing some changes

* putting back the comment which got removed accidently, no functionality change
2026-03-02 16:37:47 -08:00
suraj-self
998b8f846b Simplify batch size assertion message 2026-02-21 08:43:25 +05:30
suraj-self
d489a1fa22 Merge remote-tracking branch 'upstream/master' into fix-batch-size-assertion 2026-02-21 08:30:41 +05:30
Dipesh Babu
c7ba252142
docs: fix typos in experiment log (#547) 2026-02-20 08:03:45 -08:00
Andrej Karpathy
2dffdc8cf6 document MoE exploration 2026-02-19 02:53:47 +00:00
Andrej Karpathy
48804bff3a report negative result on fineweb dataset 2026-02-18 23:45:31 +00:00
Andrej Karpathy
bb5137860e fix comment 2026-02-18 23:26:22 +00:00
Andrej Karpathy
458555117b Merge branch 'Chetter2-patch-1' 2026-02-18 23:17:39 +00:00
Andrej Karpathy
bac5a35dd7 fix minor bug in fp8 application to skip tiny matmuls 2026-02-18 23:17:29 +00:00
George Shakan
ad55575326 Fix bug in setting precision (#538) 2026-02-18 15:49:18 +00:00
Sofie Van Landeghem
cac43e8511 Fix MockModel's device definition (#535)
* fix MockModel's device definition

* cleanup
2026-02-18 15:49:18 +00:00
Andrej Karpathy
f5fe7925ed update dev log with recent 2026-02-18 15:49:18 +00:00