Andrej Karpathy
|
aa530cdad5
|
Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb
|
2026-01-11 18:47:35 +00:00 |
|
Andrej Karpathy
|
2c4473dd1b
|
Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.
|
2026-01-11 16:56:59 +00:00 |
|
Sofie Van Landeghem
|
a1ccb3dc0b
|
remove rust compilation as rustbpe is now installed from separate package (#416)
|
2026-01-08 06:18:37 -08:00 |
|
Andrej Karpathy
|
061f83c152
|
delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then
|
2026-01-08 02:16:50 +00:00 |
|
Andrej Karpathy
|
e8c30c3b19
|
add notebook used for scaling laws analysis
|
2026-01-07 22:28:53 +00:00 |
|
Andrej Karpathy
|
54e59c38ad
|
add notebook on deriving the CORE estimates for the GPT-3 miniseries.
|
2026-01-05 18:40:28 +00:00 |
|
Andrej Karpathy
|
ed2082fbc4
|
sane secrets management
|
2026-01-04 19:29:22 +00:00 |
|
svlandeg
|
2ce62ec076
|
ensure consistency of quotes within each statement
|
2025-11-03 21:52:02 +01:00 |
|
svlandeg
|
e22fc6f2fa
|
few more explicit UTF-8 encodings
|
2025-11-03 21:46:39 +01:00 |
|
Andrej
|
b6da6982f6
|
fix nanochat logo: the t was placed too far to the right
|
2025-11-02 08:17:00 -08:00 |
|
Andrej Karpathy
|
cf587acb1a
|
move eval bundle download to be lazy and inside the python code so that we can substantially simplify the run bash scripts
|
2025-11-01 16:04:38 +00:00 |
|
svlandeg
|
b996131570
|
Merge branch 'master' into logo/kerning-update
|
2025-10-29 11:45:40 +01:00 |
|
Andrej
|
a1de1f46ad
|
Merge pull request #156 from tlepoint/fix/export-base-dir
Export the base dir variable in runcpu.sh
|
2025-10-28 15:19:08 -07:00 |
|
svlandeg
|
8c9b004c99
|
typo fixes in scripts
|
2025-10-28 20:17:31 +01:00 |
|
Tancrède Lepoint
|
d5cda11ab8
|
Export the base dir variable
|
2025-10-22 18:15:02 -04:00 |
|
Luke Stanley
|
901b075605
|
Fix GPU-less CPU use on Linux with specific Torch indexes
|
2025-10-21 23:14:16 +00:00 |
|
Andrej Karpathy
|
94ee507054
|
quick fix base eval due to fewshot requirement
|
2025-10-21 17:56:08 +00:00 |
|
Andrej Karpathy
|
5bdc99abfb
|
merge and resolve conflict
|
2025-10-21 17:19:10 +00:00 |
|
Andrej Karpathy
|
fe5aed940b
|
add personality to nanochat. breaks previous code on git pull and requires download of a new file from s3, but there is a helpful error message so hopefully its ok
|
2025-10-21 15:04:58 +00:00 |
|
karpathy
|
2e9669e03a
|
upgrading all other files to be able to use cpu/mps as well as cuda. various minor other changes ,e.g. changing max_iterations to num_iterations in sft script for consistency in naming
|
2025-10-20 10:15:17 -07:00 |
|
obxium
|
938cb31f1a
|
Update logo
|
2025-10-14 14:19:44 -04:00 |
|
karpathy
|
a53833d04f
|
add nanochat logo png
|
2025-10-13 06:59:59 -07:00 |
|
karpathy
|
3a5e0bc50b
|
initial commit
|
2025-10-13 06:49:24 -07:00 |
|