Commit Graph

31 Commits

Author SHA1 Message Date
Andrej Karpathy
7312ec9898 fix buggy midtrain and update all kwargs to be idiomatic. that is, argparse uses dashes variables use underscores. the underscores are just a remnant of the previous Configurator object. This is the right way 2026-01-13 22:45:27 +00:00
Andrej Karpathy
f92efce169 add negative result about not allowing attention across BOS tokens. A lot more code complexity for basically no gain in performance 2026-01-13 21:33:54 +00:00
Andrej Karpathy
43c29dd9d5 Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training
The new DataLoader ensures that every token sequence in train/val batches has a BOS token
at the beginning. Therefore, no token streams start abruptly in the middle of a document,
which could be confusing for the model. Note that this changes the loss scale because there
are fewer confusing tokens in the train/val batches. The main downside is that we now waste
about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md
entry for this change for a lot more information.
2026-01-13 20:05:47 +00:00
Andrej Karpathy
64b48d0e5c validated that \p{N}{1,2} is the correct number of digits to group up to in the regex pattern of the GPT-4 tokenizer (2 down from 3), leading to the best val_bpb for 32K vocabs 2026-01-13 17:45:06 +00:00
Andrej Karpathy
238353c998 document my struggle with fp8 integration yesterday, it's not working like i thought it would and i suffered. one day i will return to continue the fight. 2026-01-13 17:14:29 +00:00
Andrej Karpathy
4610a838a1 record negative result on MTP 2026-01-12 05:23:47 +00:00
Andrej Karpathy
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb 2026-01-11 21:49:54 +00:00
Andrej Karpathy
2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge 2026-01-11 20:33:19 +00:00
Andrej Karpathy
aa530cdad5 Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb 2026-01-11 18:47:35 +00:00
Andrej Karpathy
2c4473dd1b Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes. 2026-01-11 16:56:59 +00:00
Sofie Van Landeghem
a1ccb3dc0b
remove rust compilation as rustbpe is now installed from separate package (#416) 2026-01-08 06:18:37 -08:00
Andrej Karpathy
061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then 2026-01-08 02:16:50 +00:00
Andrej Karpathy
e8c30c3b19 add notebook used for scaling laws analysis 2026-01-07 22:28:53 +00:00
Andrej Karpathy
54e59c38ad add notebook on deriving the CORE estimates for the GPT-3 miniseries. 2026-01-05 18:40:28 +00:00
Andrej Karpathy
ed2082fbc4 sane secrets management 2026-01-04 19:29:22 +00:00
svlandeg
2ce62ec076 ensure consistency of quotes within each statement 2025-11-03 21:52:02 +01:00
svlandeg
e22fc6f2fa few more explicit UTF-8 encodings 2025-11-03 21:46:39 +01:00
Andrej
b6da6982f6
fix nanochat logo: the t was placed too far to the right 2025-11-02 08:17:00 -08:00
Andrej Karpathy
cf587acb1a move eval bundle download to be lazy and inside the python code so that we can substantially simplify the run bash scripts 2025-11-01 16:04:38 +00:00
svlandeg
b996131570 Merge branch 'master' into logo/kerning-update 2025-10-29 11:45:40 +01:00
Andrej
a1de1f46ad
Merge pull request #156 from tlepoint/fix/export-base-dir
Export the base dir variable in runcpu.sh
2025-10-28 15:19:08 -07:00
svlandeg
8c9b004c99 typo fixes in scripts 2025-10-28 20:17:31 +01:00
Tancrède Lepoint
d5cda11ab8 Export the base dir variable 2025-10-22 18:15:02 -04:00
Luke Stanley
901b075605 Fix GPU-less CPU use on Linux with specific Torch indexes 2025-10-21 23:14:16 +00:00
Andrej Karpathy
94ee507054 quick fix base eval due to fewshot requirement 2025-10-21 17:56:08 +00:00
Andrej Karpathy
5bdc99abfb merge and resolve conflict 2025-10-21 17:19:10 +00:00
Andrej Karpathy
fe5aed940b add personality to nanochat. breaks previous code on git pull and requires download of a new file from s3, but there is a helpful error message so hopefully its ok 2025-10-21 15:04:58 +00:00
karpathy
2e9669e03a upgrading all other files to be able to use cpu/mps as well as cuda. various minor other changes ,e.g. changing max_iterations to num_iterations in sft script for consistency in naming 2025-10-20 10:15:17 -07:00
obxium
938cb31f1a Update logo 2025-10-14 14:19:44 -04:00
karpathy
a53833d04f add nanochat logo png 2025-10-13 06:59:59 -07:00
karpathy
3a5e0bc50b initial commit 2025-10-13 06:49:24 -07:00