nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-01-20 18:34:14 +00:00

Author	SHA1	Message	Date
Andrej Karpathy	fbf2bbea25	update log with a bunch of attempts	2026-01-16 02:21:17 +00:00
Andrej Karpathy	747ed4491f	add negative result on olmo3 pretraining mix	2026-01-16 00:44:01 +00:00
Andrej Karpathy	7312ec9898	fix buggy midtrain and update all kwargs to be idiomatic. that is, argparse uses dashes variables use underscores. the underscores are just a remnant of the previous Configurator object. This is the right way	2026-01-13 22:45:27 +00:00
Andrej Karpathy	f92efce169	add negative result about not allowing attention across BOS tokens. A lot more code complexity for basically no gain in performance	2026-01-13 21:33:54 +00:00
Andrej Karpathy	43c29dd9d5	Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training The new DataLoader ensures that every token sequence in train/val batches has a BOS token at the beginning. Therefore, no token streams start abruptly in the middle of a document, which could be confusing for the model. Note that this changes the loss scale because there are fewer confusing tokens in the train/val batches. The main downside is that we now waste about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md entry for this change for a lot more information.	2026-01-13 20:05:47 +00:00
Andrej Karpathy	64b48d0e5c	validated that \p{N}{1,2} is the correct number of digits to group up to in the regex pattern of the GPT-4 tokenizer (2 down from 3), leading to the best val_bpb for 32K vocabs	2026-01-13 17:45:06 +00:00
Andrej Karpathy	238353c998	document my struggle with fp8 integration yesterday, it's not working like i thought it would and i suffered. one day i will return to continue the fight.	2026-01-13 17:14:29 +00:00
Andrej Karpathy	4610a838a1	record negative result on MTP	2026-01-12 05:23:47 +00:00
Andrej Karpathy	fbc1484e8c	add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb	2026-01-11 21:49:54 +00:00
Andrej Karpathy	2ff7d51252	integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge	2026-01-11 20:33:19 +00:00
Andrej Karpathy	aa530cdad5	Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb	2026-01-11 18:47:35 +00:00
Andrej Karpathy	2c4473dd1b	Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.	2026-01-11 16:56:59 +00:00
Sofie Van Landeghem	a1ccb3dc0b	remove rust compilation as rustbpe is now installed from separate package (#416 )	2026-01-08 06:18:37 -08:00
Andrej Karpathy	061f83c152	delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then	2026-01-08 02:16:50 +00:00
Andrej Karpathy	e8c30c3b19	add notebook used for scaling laws analysis	2026-01-07 22:28:53 +00:00
Andrej Karpathy	54e59c38ad	add notebook on deriving the CORE estimates for the GPT-3 miniseries.	2026-01-05 18:40:28 +00:00
Andrej Karpathy	ed2082fbc4	sane secrets management	2026-01-04 19:29:22 +00:00
svlandeg	2ce62ec076	ensure consistency of quotes within each statement	2025-11-03 21:52:02 +01:00
svlandeg	e22fc6f2fa	few more explicit UTF-8 encodings	2025-11-03 21:46:39 +01:00
Andrej	b6da6982f6	fix nanochat logo: the t was placed too far to the right	2025-11-02 08:17:00 -08:00
Andrej Karpathy	cf587acb1a	move eval bundle download to be lazy and inside the python code so that we can substantially simplify the run bash scripts	2025-11-01 16:04:38 +00:00
svlandeg	b996131570	Merge branch 'master' into logo/kerning-update	2025-10-29 11:45:40 +01:00
Andrej	a1de1f46ad	Merge pull request #156 from tlepoint/fix/export-base-dir Export the base dir variable in runcpu.sh	2025-10-28 15:19:08 -07:00
svlandeg	8c9b004c99	typo fixes in scripts	2025-10-28 20:17:31 +01:00
Tancrède Lepoint	d5cda11ab8	Export the base dir variable	2025-10-22 18:15:02 -04:00
Luke Stanley	901b075605	Fix GPU-less CPU use on Linux with specific Torch indexes	2025-10-21 23:14:16 +00:00
Andrej Karpathy	94ee507054	quick fix base eval due to fewshot requirement	2025-10-21 17:56:08 +00:00
Andrej Karpathy	5bdc99abfb	merge and resolve conflict	2025-10-21 17:19:10 +00:00
Andrej Karpathy	fe5aed940b	add personality to nanochat. breaks previous code on git pull and requires download of a new file from s3, but there is a helpful error message so hopefully its ok	2025-10-21 15:04:58 +00:00
karpathy	2e9669e03a	upgrading all other files to be able to use cpu/mps as well as cuda. various minor other changes ,e.g. changing max_iterations to num_iterations in sft script for consistency in naming	2025-10-20 10:15:17 -07:00
obxium	938cb31f1a	Update logo	2025-10-14 14:19:44 -04:00
karpathy	a53833d04f	add nanochat logo png	2025-10-13 06:59:59 -07:00
karpathy	3a5e0bc50b	initial commit	2025-10-13 06:49:24 -07:00

33 Commits