Andrej Karpathy
c16db281ff
fix small bug with params logging and batch size
2026-03-24 19:25:34 +00:00
Andrej Karpathy
5019accc5b
fix scaling laws scripts after the bigram embeddings were removed
2026-03-17 16:55:56 +00:00
Andrej Karpathy
a825e63f81
Autoresearch round 2: smear, backout, and hyperparameter tuning
...
New architectural features:
- Smear: mix previous token embedding into current position via learned
gate, providing cheap bigram-like info (works in training + KV cache)
- Backout: subtract learned fraction of mid-layer residual before logit
projection to remove low-level features
Hyperparameter tuning:
- Muon momentum warmdown 0.97→0.90 during LR warmdown phase
- Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05
- c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4
- Speedrun data:params ratio reduced to 8
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 17:03:06 +00:00
Andrej Karpathy
324e69c45d
big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise
2026-03-04 19:47:12 +00:00
Andrej Karpathy
bb5137860e
fix comment
2026-02-18 23:26:22 +00:00
Andrej Karpathy
1ec0a34779
at 28 and above we start to need batch size 8
2026-02-08 18:26:34 +00:00
Andrej Karpathy
ff46300720
tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
2026-02-08 17:54:12 +00:00
Andrej Karpathy
685271dc8d
new optimal ratio for d26 training
2026-02-06 19:21:27 +00:00
Andrej Karpathy
542beb0c8c
bump speedrun to be the up to date leaderboard run
2026-02-04 02:12:04 +00:00
Andrej Karpathy
b19b4f3e49
fix bug in speedrun script, batch size that doesn't OOM on 8XH100 for d24 is 16
2026-02-02 15:50:14 +00:00
Andrej Karpathy
0307997f9b
merge two files base_loss and base_eval into a single file, it's nicer this way, and unify the huggingface code associated with both
2026-02-01 02:36:43 +00:00
Andrej Karpathy
1ddaad1c1c
nuke midtraining from orbit, it's not as needed now that we have a BOS-aligned dataloader. Also change the README a lot. midtrianing is not yet fully properly erased across the board, but good enough for step 1
2026-01-31 19:12:25 +00:00
Andrej Karpathy
02baa15405
i am feeling in a delete mood today. i need to delete a lot of code. there is too much code and surface area and complexity. ew
2026-01-30 17:08:53 +00:00
Andrej Karpathy
067daa7758
small fix cpu script ty PR #474
2026-01-30 02:11:25 +00:00
Andrej Karpathy
c88bbf8133
Merge branch 'engram'
2026-01-27 22:33:16 +00:00
Andrej Karpathy
c8d93beed2
add engram-lite, add log, tune scaling laws analysis scripts
2026-01-27 22:31:17 +00:00
Andrej Karpathy
8630d32be4
quick fix to not OOM main speedrun script
2026-01-26 22:31:42 +00:00
Andrej Karpathy
63bb5831e2
something i've wanted to do for a while - move all .sh runs to their own directory so they don't pollute root dir
2026-01-18 15:27:41 +00:00