VishalKrishnaKumar
176a0b01a8
Merge 89d2741cba into 63bb5831e2
2026-01-18 23:32:55 -08:00
Andrej Karpathy
63bb5831e2
something i've wanted to do for a while - move all .sh runs to their own directory so they don't pollute root dir
2026-01-18 15:27:41 +00:00
Andrej Karpathy
a91743c168
Merge branch 've'
2026-01-18 15:14:39 +00:00
Andrej Karpathy
d58fcd9d73
log for jan 17
2026-01-18 03:01:17 +00:00
Andrej Karpathy
babde18ce1
small tweaks
2026-01-18 03:00:38 +00:00
Andrej Karpathy
cf5c9e5b8e
resolve a crash for odd depths because FA3 needs head_dim % 8 == 0
2026-01-18 00:07:08 +00:00
Andrej Karpathy
413e91aa0f
optimal ratio is now around 4
2026-01-17 23:51:09 +00:00
Andrej Karpathy
e7ed2082b8
update the default GPTConfig kwargs otherwise they are confusing
2026-01-17 21:16:46 +00:00
karpathy
f9a7e0f111
update the CPU/MPS script to give reasonable results. The model can at least answer that Paris is the capital of France and knows that the sky is blue, for about 40 minutes of training on my macbook. Also fixed a bug that existed due to KVCache bfloat16 dtype assumption
2026-01-17 12:27:30 -08:00
Andrej Karpathy
f5425245f9
more GPU types from PR 147 thanks @Qubitium
2026-01-17 03:22:20 +00:00
Andrej Karpathy
2955650327
add detection of device to report more correct mfu for bf16
2026-01-17 03:16:14 +00:00
Yury Kirpichev
77a46902e4
Fix WANDB_RUN parameter passing in runcpu.sh ( #407 )
...
- Add --run=$WANDB_RUN to base_train, mid_train, and chat_sft calls
- Ensures wandb logging works when WANDB_RUN environment variable is set
- Matches the behavior in speedrun.sh
Co-authored-by: svlandeg <svlandeg@github.com>
2026-01-16 18:59:44 -08:00
Barış Özmen
bbc4413c58
Add high value engine tests for core invariants (33 LoC) ( #396 )
...
* test: add engine generation tests for expected invariants
- test_seed_reproducibility
- test_temperature_zero_determinism
- test_max_tokens_respected
- test_num_samples_count
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Fix temperature test
* add test for seed variation in sampling
Add test for seed variation in sampling with temperature > 0.
* Rename test for clarity
* Shorten assert msg
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2026-01-16 18:59:12 -08:00
Nitish Pandey
f42ae9e901
fix condition to perform bpb evaluation ( #324 )
...
Co-authored-by: svlandeg <svlandeg@github.com>
2026-01-16 18:56:43 -08:00
Yamahammer
e1dafc510f
Reduce token waste in BOS bestfit by cropping shortest doc ( #445 )
...
When no document fits the remaining row space, crop the shortest
document in the buffer instead of the first. This minimizes
discarded tokens.
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-16 18:50:34 -08:00
Andrej Karpathy
6460dc6382
tweaks to readme a bit
2026-01-17 02:28:31 +00:00
Andrej Karpathy
1933e85046
brief update to log
2026-01-17 00:25:50 +00:00
Andrej Karpathy
3b95d4fd39
allow label for scaling laws script
2026-01-17 00:23:30 +00:00
Andrej Karpathy
e85db6b4a4
alternating design
2026-01-16 23:52:12 +00:00
Andrej Karpathy
9a88194c3f
simply one VE per layer, works best
2026-01-16 22:08:52 +00:00
Andrej Karpathy
0b58d70e99
full ve version works very well
2026-01-16 21:16:47 +00:00
Andrej Karpathy
e3f58b838e
ranked version
2026-01-16 20:59:42 +00:00
Andrej Karpathy
184d4c12b1
also add to log about the FA3 changes
2026-01-16 18:25:04 +00:00
Andrej Karpathy
b62a5bc44a
naturally i failed to include the actual code in the previous commit facepalm
2026-01-16 17:39:41 +00:00
Andrej Karpathy
8203efa919
implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user.
2026-01-16 17:37:51 +00:00
Haoyu Wang
50413d2d67
typo in comments: change "GAPO" to "DAPO"
2026-01-15 22:03:42 -08:00
Andrej Karpathy
fbf2bbea25
update log with a bunch of attempts
2026-01-16 02:21:17 +00:00
Andrej Karpathy
747ed4491f
add negative result on olmo3 pretraining mix
2026-01-16 00:44:01 +00:00
Andrej Karpathy
7d1700c521
add zstd lib
2026-01-16 00:44:01 +00:00
Sofie Van Landeghem
d4ea28d4e2
Fix args in readme ( #438 )
...
* fix commands in readme, using new arg format
* fix typo
* add required -i flag to chat_eval example runs
2026-01-15 16:26:38 -08:00
Andrej Karpathy
bdcc030ffa
oops legacy spurious line now
2026-01-15 23:32:20 +00:00
Andrej Karpathy
22a71aa3d3
fuse adamw into a single torch compiled kernel similar to muon. it's about 1.7X faster, but overall it's so tiny that it's not making a major dent
2026-01-15 23:30:44 +00:00
Andrej Karpathy
255f8b9af6
cleanly separate cpu and gpu sections
2026-01-15 23:30:11 +00:00
svlandeg
89d2741cba
Merge branch 'master' into issue-183-nvshmem-install-fix
2026-01-15 20:21:15 +01:00
Andrej Karpathy
6bb92403d5
changes and optimizations to muon, making it more efficient and simpler/cleaner a bit
2026-01-15 03:20:48 +00:00
Andrej Karpathy
3142ca1a28
minor helpful message
2026-01-15 03:20:21 +00:00
Andrej Karpathy
7312ec9898
fix buggy midtrain and update all kwargs to be idiomatic. that is, argparse uses dashes variables use underscores. the underscores are just a remnant of the previous Configurator object. This is the right way
2026-01-13 22:45:27 +00:00
Andrej Karpathy
3b50b77ed3
fix base_loss to report correct loss by switching the dataloader to the new default
2026-01-13 22:09:36 +00:00
Andrej Karpathy
f92efce169
add negative result about not allowing attention across BOS tokens. A lot more code complexity for basically no gain in performance
2026-01-13 21:33:54 +00:00
Andrej Karpathy
43c29dd9d5
Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training
...
The new DataLoader ensures that every token sequence in train/val batches has a BOS token
at the beginning. Therefore, no token streams start abruptly in the middle of a document,
which could be confusing for the model. Note that this changes the loss scale because there
are fewer confusing tokens in the train/val batches. The main downside is that we now waste
about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md
entry for this change for a lot more information.
2026-01-13 20:05:47 +00:00
Andrej Karpathy
23985413aa
adjust the comment on the regex pattern per recent experimnet see dev/LOG.md
2026-01-13 17:50:39 +00:00
Andrej Karpathy
64b48d0e5c
validated that \p{N}{1,2} is the correct number of digits to group up to in the regex pattern of the GPT-4 tokenizer (2 down from 3), leading to the best val_bpb for 32K vocabs
2026-01-13 17:45:06 +00:00
Andrej Karpathy
238353c998
document my struggle with fp8 integration yesterday, it's not working like i thought it would and i suffered. one day i will return to continue the fight.
2026-01-13 17:14:29 +00:00
Andrej Karpathy
4610a838a1
record negative result on MTP
2026-01-12 05:23:47 +00:00
Andrej Karpathy
21608ec51e
allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway
2026-01-12 03:10:13 +00:00
Andrej Karpathy
aa95fb2e03
make miniseries more generic and easier to run and less hard coded
2026-01-12 02:54:35 +00:00
Andrej Karpathy
b33e394528
oops actually make SSSL the default window pattern
2026-01-11 21:50:35 +00:00
Andrej Karpathy
fbc1484e8c
add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
2026-01-11 21:49:54 +00:00
Andrej Karpathy
2ff7d51252
integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge
2026-01-11 20:33:19 +00:00
Andrej Karpathy
201d705957
recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints
2026-01-11 20:13:12 +00:00