Andrej Karpathy
63bb5831e2
something i've wanted to do for a while - move all .sh runs to their own directory so they don't pollute root dir
2026-01-18 15:27:41 +00:00
Andrej Karpathy
d58fcd9d73
log for jan 17
2026-01-18 03:01:17 +00:00
karpathy
f9a7e0f111
update the CPU/MPS script to give reasonable results. The model can at least answer that Paris is the capital of France and knows that the sky is blue, for about 40 minutes of training on my macbook. Also fixed a bug that existed due to KVCache bfloat16 dtype assumption
2026-01-17 12:27:30 -08:00
Yury Kirpichev
77a46902e4
Fix WANDB_RUN parameter passing in runcpu.sh ( #407 )
...
- Add --run=$WANDB_RUN to base_train, mid_train, and chat_sft calls
- Ensures wandb logging works when WANDB_RUN environment variable is set
- Matches the behavior in speedrun.sh
Co-authored-by: svlandeg <svlandeg@github.com>
2026-01-16 18:59:44 -08:00
Andrej Karpathy
1933e85046
brief update to log
2026-01-17 00:25:50 +00:00
Andrej Karpathy
184d4c12b1
also add to log about the FA3 changes
2026-01-16 18:25:04 +00:00
Andrej Karpathy
fbf2bbea25
update log with a bunch of attempts
2026-01-16 02:21:17 +00:00
Andrej Karpathy
747ed4491f
add negative result on olmo3 pretraining mix
2026-01-16 00:44:01 +00:00
Andrej Karpathy
7312ec9898
fix buggy midtrain and update all kwargs to be idiomatic. that is, argparse uses dashes variables use underscores. the underscores are just a remnant of the previous Configurator object. This is the right way
2026-01-13 22:45:27 +00:00
Andrej Karpathy
f92efce169
add negative result about not allowing attention across BOS tokens. A lot more code complexity for basically no gain in performance
2026-01-13 21:33:54 +00:00
Andrej Karpathy
43c29dd9d5
Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training
...
The new DataLoader ensures that every token sequence in train/val batches has a BOS token
at the beginning. Therefore, no token streams start abruptly in the middle of a document,
which could be confusing for the model. Note that this changes the loss scale because there
are fewer confusing tokens in the train/val batches. The main downside is that we now waste
about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md
entry for this change for a lot more information.
2026-01-13 20:05:47 +00:00
Andrej Karpathy
64b48d0e5c
validated that \p{N}{1,2} is the correct number of digits to group up to in the regex pattern of the GPT-4 tokenizer (2 down from 3), leading to the best val_bpb for 32K vocabs
2026-01-13 17:45:06 +00:00
Andrej Karpathy
238353c998
document my struggle with fp8 integration yesterday, it's not working like i thought it would and i suffered. one day i will return to continue the fight.
2026-01-13 17:14:29 +00:00
Andrej Karpathy
4610a838a1
record negative result on MTP
2026-01-12 05:23:47 +00:00
Andrej Karpathy
fbc1484e8c
add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
2026-01-11 21:49:54 +00:00
Andrej Karpathy
2ff7d51252
integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge
2026-01-11 20:33:19 +00:00
Andrej Karpathy
aa530cdad5
Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb
2026-01-11 18:47:35 +00:00
Andrej Karpathy
2c4473dd1b
Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.
2026-01-11 16:56:59 +00:00
Sofie Van Landeghem
a1ccb3dc0b
remove rust compilation as rustbpe is now installed from separate package ( #416 )
2026-01-08 06:18:37 -08:00
Andrej Karpathy
061f83c152
delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then
2026-01-08 02:16:50 +00:00
Andrej Karpathy
e8c30c3b19
add notebook used for scaling laws analysis
2026-01-07 22:28:53 +00:00
Andrej Karpathy
54e59c38ad
add notebook on deriving the CORE estimates for the GPT-3 miniseries.
2026-01-05 18:40:28 +00:00
Andrej Karpathy
ed2082fbc4
sane secrets management
2026-01-04 19:29:22 +00:00
svlandeg
2ce62ec076
ensure consistency of quotes within each statement
2025-11-03 21:52:02 +01:00
svlandeg
e22fc6f2fa
few more explicit UTF-8 encodings
2025-11-03 21:46:39 +01:00
Andrej
b6da6982f6
fix nanochat logo: the t was placed too far to the right
2025-11-02 08:17:00 -08:00
Andrej Karpathy
cf587acb1a
move eval bundle download to be lazy and inside the python code so that we can substantially simplify the run bash scripts
2025-11-01 16:04:38 +00:00
svlandeg
b996131570
Merge branch 'master' into logo/kerning-update
2025-10-29 11:45:40 +01:00
Andrej
a1de1f46ad
Merge pull request #156 from tlepoint/fix/export-base-dir
...
Export the base dir variable in runcpu.sh
2025-10-28 15:19:08 -07:00
svlandeg
8c9b004c99
typo fixes in scripts
2025-10-28 20:17:31 +01:00
Tancrède Lepoint
d5cda11ab8
Export the base dir variable
2025-10-22 18:15:02 -04:00
Luke Stanley
901b075605
Fix GPU-less CPU use on Linux with specific Torch indexes
2025-10-21 23:14:16 +00:00
Andrej Karpathy
94ee507054
quick fix base eval due to fewshot requirement
2025-10-21 17:56:08 +00:00
Andrej Karpathy
5bdc99abfb
merge and resolve conflict
2025-10-21 17:19:10 +00:00
Andrej Karpathy
fe5aed940b
add personality to nanochat. breaks previous code on git pull and requires download of a new file from s3, but there is a helpful error message so hopefully its ok
2025-10-21 15:04:58 +00:00
karpathy
2e9669e03a
upgrading all other files to be able to use cpu/mps as well as cuda. various minor other changes ,e.g. changing max_iterations to num_iterations in sft script for consistency in naming
2025-10-20 10:15:17 -07:00
obxium
938cb31f1a
Update logo
2025-10-14 14:19:44 -04:00
karpathy
a53833d04f
add nanochat logo png
2025-10-13 06:59:59 -07:00
karpathy
3a5e0bc50b
initial commit
2025-10-13 06:49:24 -07:00