Yury Kirpichev
e4d7efe5ff
Merge 52f1a5ee5c into 348fbb301b
2026-01-31 13:26:35 -05:00
Andrej Karpathy
348fbb301b
fix dataloader for midtrain to never crop data. we can't just throw it away like we do in pretraining
2026-01-31 18:21:36 +00:00
Andrej Karpathy
3c3a3d7042
warmdown of 0.5 is slightly better:
2026-01-31 01:08:44 +00:00
Andrei Panferov
4d8dbaf6e0
Fix escape character in README bibtex entry ( #454 )
2026-01-30 09:34:02 -08:00
Andrej Karpathy
3ba42e8135
Fix SDPA KV-cache decode to respect sliding window ( #456 )
...
SDPA fallback now respects sliding window during single-token KV-cache
decode by slicing K/V to the last (window + 1) tokens.
Also simplifies the mask building for chunk inference to properly apply
sliding window in that path as well.
Fixes #452
Co-Authored-By: Kartik Vashishta <kartikv776@gmail.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 17:32:12 +00:00
Aarushi Singh
ace6740bdd
feat: allow top_k=0 in web api to disable filtering ( #458 )
...
* allow top_k=0 in web api to disable filtering
* adding a comment for clear reasoning
* adding change to docstring
2026-01-30 09:21:41 -08:00
Harsh Gupta
2e17723817
Fix generate() crash when top_k=0 ( #467 )
...
Prevent a crash in generate() by skipping top-k filtering when top_k is set to 0
2026-01-30 09:21:02 -08:00
Andrej Karpathy
02baa15405
i am feeling in a delete mood today. i need to delete a lot of code. there is too much code and surface area and complexity. ew
2026-01-30 17:08:53 +00:00
Andrej Karpathy
d6c4f3b923
i think this is the new torch 2.9+ API for declaring tf32 preference
2026-01-30 17:03:15 +00:00
Andrej Karpathy
067daa7758
small fix cpu script ty PR #474
2026-01-30 02:11:25 +00:00
Andrej Karpathy
6a341f2ecf
contiguous views and single HtoD transfer for inputs/targets much cleaner
2026-01-30 00:23:01 +00:00
Andrej Karpathy
ebd4d9bbf5
tried muonh, appealing but didn't work out of the box
2026-01-29 19:01:36 +00:00
Andrej Karpathy
41bb2eac32
Combine AdamW and Muon into single MuonAdamW optimizer, cleaner, ty @chrisjmccormick for idea/help
2026-01-29 00:52:08 +00:00
Andrej Karpathy
64a651a63c
include .claude is ok
2026-01-29 00:35:02 +00:00
Andrej Karpathy
65df0de42b
add arxiv reading skill
2026-01-29 00:34:24 +00:00
Andrej Karpathy
74554be3b5
revert engram, not seeing an improvement at larger scale
2026-01-28 20:07:39 +00:00
Sofie Van Landeghem
d5418ea5a1
Fix link to DeepSeek Engram paper ( #470 )
...
* Fix link to DeepSeek Engram paper in LOG.md
Updated link to the DeepSeek Engram paper in the log.
* remove www
2026-01-28 08:31:44 -08:00
Andrej Karpathy
c88bbf8133
Merge branch 'engram'
2026-01-27 22:33:16 +00:00
Andrej Karpathy
c8d93beed2
add engram-lite, add log, tune scaling laws analysis scripts
2026-01-27 22:31:17 +00:00
Andrej Karpathy
8630d32be4
quick fix to not OOM main speedrun script
2026-01-26 22:31:42 +00:00
Andrej Karpathy
59e36cc727
first version of engram following modded nanogpt style
2026-01-25 18:59:51 +00:00
Andrej Karpathy
85b3e95e09
320 experiments just to tune the adam beta1 of x0 a little bit up from 0.8 to 0.96
2026-01-25 00:04:02 +00:00
xiayan0118
6a477eedbd
fix: pass device_type to compute_init in engine.__main__ ( #451 )
...
When running engine.py directly on non-GPU devices (CPU, MPS),
compute_init() needs the device_type parameter to initialize correctly.
This fixes failures on machines without CUDA support.
2026-01-19 17:19:51 -08:00
Yury Kirpichev
52f1a5ee5c
Add support for ROCm backend in speedrun script
2026-01-18 14:38:21 -08:00
Andrej Karpathy
63bb5831e2
something i've wanted to do for a while - move all .sh runs to their own directory so they don't pollute root dir
2026-01-18 15:27:41 +00:00
Andrej Karpathy
a91743c168
Merge branch 've'
2026-01-18 15:14:39 +00:00
Andrej Karpathy
d58fcd9d73
log for jan 17
2026-01-18 03:01:17 +00:00
Andrej Karpathy
babde18ce1
small tweaks
2026-01-18 03:00:38 +00:00
Andrej Karpathy
cf5c9e5b8e
resolve a crash for odd depths because FA3 needs head_dim % 8 == 0
2026-01-18 00:07:08 +00:00
Andrej Karpathy
413e91aa0f
optimal ratio is now around 4
2026-01-17 23:51:09 +00:00
Andrej Karpathy
e7ed2082b8
update the default GPTConfig kwargs otherwise they are confusing
2026-01-17 21:16:46 +00:00
karpathy
f9a7e0f111
update the CPU/MPS script to give reasonable results. The model can at least answer that Paris is the capital of France and knows that the sky is blue, for about 40 minutes of training on my macbook. Also fixed a bug that existed due to KVCache bfloat16 dtype assumption
2026-01-17 12:27:30 -08:00
Andrej Karpathy
f5425245f9
more GPU types from PR 147 thanks @Qubitium
2026-01-17 03:22:20 +00:00
Andrej Karpathy
2955650327
add detection of device to report more correct mfu for bf16
2026-01-17 03:16:14 +00:00
Yury Kirpichev
77a46902e4
Fix WANDB_RUN parameter passing in runcpu.sh ( #407 )
...
- Add --run=$WANDB_RUN to base_train, mid_train, and chat_sft calls
- Ensures wandb logging works when WANDB_RUN environment variable is set
- Matches the behavior in speedrun.sh
Co-authored-by: svlandeg <svlandeg@github.com>
2026-01-16 18:59:44 -08:00
Barış Özmen
bbc4413c58
Add high value engine tests for core invariants (33 LoC) ( #396 )
...
* test: add engine generation tests for expected invariants
- test_seed_reproducibility
- test_temperature_zero_determinism
- test_max_tokens_respected
- test_num_samples_count
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Fix temperature test
* add test for seed variation in sampling
Add test for seed variation in sampling with temperature > 0.
* Rename test for clarity
* Shorten assert msg
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2026-01-16 18:59:12 -08:00
Nitish Pandey
f42ae9e901
fix condition to perform bpb evaluation ( #324 )
...
Co-authored-by: svlandeg <svlandeg@github.com>
2026-01-16 18:56:43 -08:00
Yamahammer
e1dafc510f
Reduce token waste in BOS bestfit by cropping shortest doc ( #445 )
...
When no document fits the remaining row space, crop the shortest
document in the buffer instead of the first. This minimizes
discarded tokens.
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-16 18:50:34 -08:00
Andrej Karpathy
6460dc6382
tweaks to readme a bit
2026-01-17 02:28:31 +00:00
Andrej Karpathy
1933e85046
brief update to log
2026-01-17 00:25:50 +00:00
Andrej Karpathy
3b95d4fd39
allow label for scaling laws script
2026-01-17 00:23:30 +00:00
Andrej Karpathy
e85db6b4a4
alternating design
2026-01-16 23:52:12 +00:00
Andrej Karpathy
9a88194c3f
simply one VE per layer, works best
2026-01-16 22:08:52 +00:00
Andrej Karpathy
0b58d70e99
full ve version works very well
2026-01-16 21:16:47 +00:00
Andrej Karpathy
e3f58b838e
ranked version
2026-01-16 20:59:42 +00:00
Andrej Karpathy
184d4c12b1
also add to log about the FA3 changes
2026-01-16 18:25:04 +00:00
Andrej Karpathy
b62a5bc44a
naturally i failed to include the actual code in the previous commit facepalm
2026-01-16 17:39:41 +00:00
Andrej Karpathy
8203efa919
implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user.
2026-01-16 17:37:51 +00:00
Haoyu Wang
50413d2d67
typo in comments: change "GAPO" to "DAPO"
2026-01-15 22:03:42 -08:00
Andrej Karpathy
fbf2bbea25
update log with a bunch of attempts
2026-01-16 02:21:17 +00:00