Andrej Karpathy
da8b7ea4cb
also delete the rustbpe test code, this now lives in rustbpe repo that is separate
2026-01-04 01:23:34 +00:00
Andrej Karpathy
aa42f40e66
delete the inline rustbpe project. it was ugly to have a project within project and rustbpe is now nicely a separate repo on my github karpathy/rustbpe and it's on pypi etc., so we just add it as a depedency to uv. i think it is appropriate that this is a separate repo because 1) it doesn't have too many knobs, other than the ones that are exposed - the regex pattern and vocab size and 2) all of its complexity is not algorithmic (it's equivalent to minbpe), instead it is efficiency-related, so it is ok to hide relatively speaking
2026-01-03 23:55:28 +00:00
Andrej Karpathy
48abd7d85f
simplify, clarify and slightly tune model initialization. should be very slightly better possibly, but certainly a lot clearer
2026-01-01 21:15:09 +00:00
Paweł Krefta
10231dfb40
Fix conversation scroll to bottom on some browsers + remove duplicated padding ( #348 )
2025-12-31 13:03:22 -08:00
helloaidank
389d019a0b
small change to doc string at top of tok_train.py ( #402 )
2025-12-31 12:57:26 -08:00
Hossein-Lakzaei
8c89661465
Update README to match current d34 demo ( #314 ) ( #381 )
...
* Update README: switch hosted model description from d32 to d34 per discussion #314
* link to discussion thread
* parameter in quotes
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2025-12-30 10:17:11 +01:00
Andrej Karpathy
8f979a8bda
fix: sample first token independently for each row in multi-sample generation
...
Previously, when generating multiple samples (num_samples > 1), the first
token after prefill was sampled once and broadcast to all rows, causing
all samples to start identically. Now the prefill logits are expanded to
num_samples and sampled independently for each row.
Also simplified the generation loop by moving the forward pass to the end
of the loop, eliminating the first_iteration flag and if/else branching.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-28 04:52:13 +00:00
Dipesh Babu
2f2d7ab80c
fix: safe DDP cleanup (check initialized PG, not just env) ( #256 )
2025-12-27 20:27:40 -08:00
Andrej Karpathy
91d76cc690
Replace speedup assertion with warning in batch_encode test
...
Performance varies by machine and load, making hard assertions flaky.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-28 04:10:49 +00:00
Andrej
7a8769a40c
Merge pull request #383 from barisozmen/master
...
3x faster rust encode (`batch_encode`) (12 LoC + 2 tests)
2025-12-27 20:06:57 -08:00
Andrej
088726aa7d
clean up model_tag handling across scripts a bit more.
2025-12-27 20:01:09 -08:00
Andrej Karpathy
2874eda59a
update to new os env var to get rid of deprecation warning
2025-12-28 03:32:46 +00:00
Andrej Karpathy
e1770a3061
remove spurious cast, gets compiled away anyway but it's confusing people
2025-12-27 23:07:48 +00:00
Andrej Karpathy
49389ecaa8
fix tf32 warning for deprecated api use
2025-12-27 22:03:06 +00:00
DU Wenjie
ea4229851b
bugfix
2025-12-26 19:02:12 +08:00
DU Wenjie
7840049189
bugfix keep same args style in scripts/base_eval.py
2025-12-26 17:29:08 +08:00
Andrej
bc51da8bac
pad vocab size to 64 for DDP optimizers and efficiency
2025-12-23 09:13:31 -08:00
duwenjie
92c6654b95
bugfix save and load ckpt from model_tag dir
2025-12-21 15:07:04 +08:00
Barış Özmen
790f3be65c
add rust batch encode as a faster option over encode
2025-12-18 19:17:59 +03:00
Matěj Kripner
d314e96aa2
formatting
2025-12-09 12:48:46 +01:00
Matěj Kripner
bbc57da7d5
slightly nicer error message
2025-12-09 12:46:48 +01:00
Matěj Kripner
f1bf69d562
feat: pad vocab size to 64 for DDP optimizers and efficiency
2025-12-09 12:38:18 +01:00
Andrej
d5759400f9
fixing two typos in comments
2025-12-08 20:03:08 -08:00
Andrej
e72c3299df
fix random.seed() footgun bug for SpellingBee data generation
2025-12-08 19:58:45 -08:00
Andrej
7931e0903a
rename checkpoint_dir to checkpoints_dir for consistency.
2025-12-08 18:32:12 -08:00
Andrej
849d95ae1f
remove unnecessary check to make the logic in CausalSelfAttention.forward() clearer
2025-12-08 18:30:37 -08:00
Andrej
39cccc527f
small bugfix make mid_train script work even with a tiny number of iterations
2025-12-08 18:27:32 -08:00
Andrej
8b1cecaa95
Apply suggestion from @svlandeg for nicer looking comparison
...
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2025-12-08 18:27:06 -08:00
Andrej
58f3e84e01
clean up train/val loader in sft for consistency with mid/base
2025-12-08 18:23:57 -08:00
Andrej
1b2a675c88
Improve KV cache code readability
2025-12-08 18:19:05 -08:00
Andrej
d75e6ed711
Fix script comment to reference correct file
2025-12-08 18:16:42 -08:00
Andrej
72a7cf2bc4
Fix distributed Parquet dataloader resume for multi-epoch training
2025-12-08 18:15:02 -08:00
Andrej Karpathy
bffdb2ef91
group common code to make things neater in gpt logit computation
2025-12-09 02:01:05 +00:00
Andrej
cbf30c842c
apply float32 cast before logits softcapping so the tanh is in fp32. torch compile fuses this correctly with no extra memory costs.
2025-12-08 14:17:43 -08:00
Andrej Karpathy
90442de35f
fix bug where any rank has to be able to create checkpoint_dir if saving optim
2025-12-08 20:45:19 +00:00
Andrej
2fd0440355
fix: missing val_bpb on resume
2025-12-08 12:35:08 -08:00
sunyujun03
01ea71be39
Fix distributed Parquet dataloader resume for multi-epoch training
2025-12-08 00:10:19 -06:00
KimYeongHyeon
a8847a0f83
Fix script comment to reference correct file
2025-12-02 10:46:20 +09:00
deepbuilder
06677c30e0
Refactor dimension validation for KV cache
2025-11-28 15:22:18 -05:00
deepbuilder
a770dcef2e
Fix kv_cache indexing to explicitly include head dimension
2025-11-28 15:00:14 -05:00
spjosyula
16788eed3c
fix(model): apply float32 cast before logits softcapping
...
This change ensures that the logits softcapping operation (tanh) is performed in float32 precision rather than bfloat16. Previously, the code cast to float32 after the tanh operation, which meant the non-linearity was computed with bfloat16 precision
2025-11-23 20:12:09 +05:30
Sanzo00
53b3a4fb81
fix: missing val_bpb on resume
2025-11-22 11:04:20 +08:00
svlandeg
4bcc3bb698
clarify comment
2025-11-21 13:19:45 +01:00
Eric Silberstein
f37d45c21f
remove unneeded iter()
2025-11-20 15:14:56 -05:00
Eric Silberstein
5c93a56be5
remove unnecessary check
2025-11-19 16:31:41 -05:00
Eric Silberstein
dddb95caac
make mid_train script work even with a tiny number of iterations
2025-11-19 15:52:20 -05:00
Eric Silberstein
a4a0959c73
renamed find_largest_model() argument checkpoint_dir to checkpoints_dir for clarity
2025-11-19 15:33:36 -05:00
Eric Silberstein
024781f9df
fixing two typos in comments
2025-11-19 15:12:53 -05:00
Eric Silberstein
97770700f2
change test/train split approach because random.seed(1) and random.seed(-1) do the same thing
2025-11-19 14:51:02 -05:00
Andrej
4a87a0d19f
Merge pull request #299 from samjabrahams/rotary_embedding_head_dim_comment_cleanup
...
Fix comment: rotary embeddings final dimension size
2025-11-17 13:29:21 -08:00