nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-01-29 14:54:46 +00:00

Author	SHA1	Message	Date
kiankyars	cc9bcbf6fd	Merge `00f1a3219d` into `f5425245f9`	2026-01-17 15:24:26 +08:00
Andrej Karpathy	f5425245f9	more GPU types from PR 147 thanks @Qubitium	2026-01-17 03:22:20 +00:00
Andrej Karpathy	2955650327	add detection of device to report more correct mfu for bf16	2026-01-17 03:16:14 +00:00
Yury Kirpichev	77a46902e4	Fix WANDB_RUN parameter passing in runcpu.sh (#407 ) - Add --run=$WANDB_RUN to base_train, mid_train, and chat_sft calls - Ensures wandb logging works when WANDB_RUN environment variable is set - Matches the behavior in speedrun.sh Co-authored-by: svlandeg <svlandeg@github.com>	2026-01-16 18:59:44 -08:00
Barış Özmen	bbc4413c58	Add high value engine tests for core invariants (33 LoC) (#396 ) * test: add engine generation tests for expected invariants - test_seed_reproducibility - test_temperature_zero_determinism - test_max_tokens_respected - test_num_samples_count 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix temperature test * add test for seed variation in sampling Add test for seed variation in sampling with temperature > 0. * Rename test for clarity * Shorten assert msg --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2026-01-16 18:59:12 -08:00
Nitish Pandey	f42ae9e901	fix condition to perform bpb evaluation (#324 ) Co-authored-by: svlandeg <svlandeg@github.com>	2026-01-16 18:56:43 -08:00
Yamahammer	e1dafc510f	Reduce token waste in BOS bestfit by cropping shortest doc (#445 ) When no document fits the remaining row space, crop the shortest document in the buffer instead of the first. This minimizes discarded tokens. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-16 18:50:34 -08:00
Andrej Karpathy	6460dc6382	tweaks to readme a bit	2026-01-17 02:28:31 +00:00
Andrej Karpathy	1933e85046	brief update to log	2026-01-17 00:25:50 +00:00
Kian Kyars	00f1a3219d	speedrun	2026-01-16 16:22:58 -08:00
Kian Kyars	2f7841cd50	remove all uv venv	2026-01-16 16:22:39 -08:00
Andrej Karpathy	184d4c12b1	also add to log about the FA3 changes	2026-01-16 18:25:04 +00:00
Andrej Karpathy	b62a5bc44a	naturally i failed to include the actual code in the previous commit facepalm	2026-01-16 17:39:41 +00:00
Andrej Karpathy	8203efa919	implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user.	2026-01-16 17:37:51 +00:00
Haoyu Wang	50413d2d67	typo in comments: change "GAPO" to "DAPO"	2026-01-15 22:03:42 -08:00
Andrej Karpathy	fbf2bbea25	update log with a bunch of attempts	2026-01-16 02:21:17 +00:00
Andrej Karpathy	747ed4491f	add negative result on olmo3 pretraining mix	2026-01-16 00:44:01 +00:00
Andrej Karpathy	7d1700c521	add zstd lib	2026-01-16 00:44:01 +00:00
Sofie Van Landeghem	d4ea28d4e2	Fix args in readme (#438 ) * fix commands in readme, using new arg format * fix typo * add required -i flag to chat_eval example runs	2026-01-15 16:26:38 -08:00
Andrej Karpathy	bdcc030ffa	oops legacy spurious line now	2026-01-15 23:32:20 +00:00
Andrej Karpathy	22a71aa3d3	fuse adamw into a single torch compiled kernel similar to muon. it's about 1.7X faster, but overall it's so tiny that it's not making a major dent	2026-01-15 23:30:44 +00:00
Andrej Karpathy	255f8b9af6	cleanly separate cpu and gpu sections	2026-01-15 23:30:11 +00:00
Andrej Karpathy	6bb92403d5	changes and optimizations to muon, making it more efficient and simpler/cleaner a bit	2026-01-15 03:20:48 +00:00
Andrej Karpathy	3142ca1a28	minor helpful message	2026-01-15 03:20:21 +00:00
Andrej Karpathy	7312ec9898	fix buggy midtrain and update all kwargs to be idiomatic. that is, argparse uses dashes variables use underscores. the underscores are just a remnant of the previous Configurator object. This is the right way	2026-01-13 22:45:27 +00:00
Andrej Karpathy	3b50b77ed3	fix base_loss to report correct loss by switching the dataloader to the new default	2026-01-13 22:09:36 +00:00
Andrej Karpathy	f92efce169	add negative result about not allowing attention across BOS tokens. A lot more code complexity for basically no gain in performance	2026-01-13 21:33:54 +00:00
Andrej Karpathy	43c29dd9d5	Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training The new DataLoader ensures that every token sequence in train/val batches has a BOS token at the beginning. Therefore, no token streams start abruptly in the middle of a document, which could be confusing for the model. Note that this changes the loss scale because there are fewer confusing tokens in the train/val batches. The main downside is that we now waste about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md entry for this change for a lot more information.	2026-01-13 20:05:47 +00:00
Andrej Karpathy	23985413aa	adjust the comment on the regex pattern per recent experimnet see dev/LOG.md	2026-01-13 17:50:39 +00:00
Andrej Karpathy	64b48d0e5c	validated that \p{N}{1,2} is the correct number of digits to group up to in the regex pattern of the GPT-4 tokenizer (2 down from 3), leading to the best val_bpb for 32K vocabs	2026-01-13 17:45:06 +00:00
Andrej Karpathy	238353c998	document my struggle with fp8 integration yesterday, it's not working like i thought it would and i suffered. one day i will return to continue the fight.	2026-01-13 17:14:29 +00:00
Andrej Karpathy	4610a838a1	record negative result on MTP	2026-01-12 05:23:47 +00:00
Andrej Karpathy	21608ec51e	allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway	2026-01-12 03:10:13 +00:00
Andrej Karpathy	aa95fb2e03	make miniseries more generic and easier to run and less hard coded	2026-01-12 02:54:35 +00:00
Andrej Karpathy	b33e394528	oops actually make SSSL the default window pattern	2026-01-11 21:50:35 +00:00
Andrej Karpathy	fbc1484e8c	add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb	2026-01-11 21:49:54 +00:00
Andrej Karpathy	2ff7d51252	integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge	2026-01-11 20:33:19 +00:00
Andrej Karpathy	201d705957	recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints	2026-01-11 20:13:12 +00:00
Andrej Karpathy	aa530cdad5	Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb	2026-01-11 18:47:35 +00:00
Andrej Karpathy	2c4473dd1b	Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.	2026-01-11 16:56:59 +00:00
Andrej Karpathy	f5a0ea4d3f	take out these gitignore dirs	2026-01-08 18:18:42 +00:00
Andrej Karpathy	4ddc803797	fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug	2026-01-08 18:18:42 +00:00
Sofie Van Landeghem	a1ccb3dc0b	remove rust compilation as rustbpe is now installed from separate package (#416 )	2026-01-08 06:18:37 -08:00
Andrej Karpathy	061f83c152	delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then	2026-01-08 02:16:50 +00:00
Andrej Karpathy	e8c30c3b19	add notebook used for scaling laws analysis	2026-01-07 22:28:53 +00:00
Andrej Karpathy	3af4dcf6ee	also add scaling_laws.sh script if it's a useful reference	2026-01-07 22:25:13 +00:00
Andrej Karpathy	4cc605b940	quick pointer to miniseries post in readme for now	2026-01-07 22:14:21 +00:00
Andrej Karpathy	ccf4b7f9bf	nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script	2026-01-07 22:11:59 +00:00
Adria Blancafort	1b5de29e71	Fix undefined variable in chat_rl after recent refactor * Fix undefined variable * Remove unused import Remove unused import 're' from chat_rl.py	2026-01-07 09:08:57 -08:00
Andrej Karpathy	ae0bf52529	tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3	2026-01-05 18:57:46 +00:00

1 2 3 4 5 ...

253 Commits