nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-03-11 03:35:32 +00:00

Author	SHA1	Message	Date
xiayan0118	6a477eedbd	fix: pass device_type to compute_init in engine.__main__ (#451 ) When running engine.py directly on non-GPU devices (CPU, MPS), compute_init() needs the device_type parameter to initialize correctly. This fixes failures on machines without CUDA support.	2026-01-19 17:19:51 -08:00
Andrej Karpathy	63bb5831e2	something i've wanted to do for a while - move all .sh runs to their own directory so they don't pollute root dir	2026-01-18 15:27:41 +00:00
Andrej Karpathy	a91743c168	Merge branch 've'	2026-01-18 15:14:39 +00:00
Andrej Karpathy	d58fcd9d73	log for jan 17	2026-01-18 03:01:17 +00:00
Andrej Karpathy	babde18ce1	small tweaks	2026-01-18 03:00:38 +00:00
Andrej Karpathy	cf5c9e5b8e	resolve a crash for odd depths because FA3 needs head_dim % 8 == 0	2026-01-18 00:07:08 +00:00
Andrej Karpathy	413e91aa0f	optimal ratio is now around 4	2026-01-17 23:51:09 +00:00
Andrej Karpathy	e7ed2082b8	update the default GPTConfig kwargs otherwise they are confusing	2026-01-17 21:16:46 +00:00
karpathy	f9a7e0f111	update the CPU/MPS script to give reasonable results. The model can at least answer that Paris is the capital of France and knows that the sky is blue, for about 40 minutes of training on my macbook. Also fixed a bug that existed due to KVCache bfloat16 dtype assumption	2026-01-17 12:27:30 -08:00
Andrej Karpathy	f5425245f9	more GPU types from PR 147 thanks @Qubitium	2026-01-17 03:22:20 +00:00
Andrej Karpathy	2955650327	add detection of device to report more correct mfu for bf16	2026-01-17 03:16:14 +00:00
Yury Kirpichev	77a46902e4	Fix WANDB_RUN parameter passing in runcpu.sh (#407 ) - Add --run=$WANDB_RUN to base_train, mid_train, and chat_sft calls - Ensures wandb logging works when WANDB_RUN environment variable is set - Matches the behavior in speedrun.sh Co-authored-by: svlandeg <svlandeg@github.com>	2026-01-16 18:59:44 -08:00
Barış Özmen	bbc4413c58	Add high value engine tests for core invariants (33 LoC) (#396 ) * test: add engine generation tests for expected invariants - test_seed_reproducibility - test_temperature_zero_determinism - test_max_tokens_respected - test_num_samples_count 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix temperature test * add test for seed variation in sampling Add test for seed variation in sampling with temperature > 0. * Rename test for clarity * Shorten assert msg --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2026-01-16 18:59:12 -08:00
Nitish Pandey	f42ae9e901	fix condition to perform bpb evaluation (#324 ) Co-authored-by: svlandeg <svlandeg@github.com>	2026-01-16 18:56:43 -08:00
Yamahammer	e1dafc510f	Reduce token waste in BOS bestfit by cropping shortest doc (#445 ) When no document fits the remaining row space, crop the shortest document in the buffer instead of the first. This minimizes discarded tokens. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-16 18:50:34 -08:00
Andrej Karpathy	6460dc6382	tweaks to readme a bit	2026-01-17 02:28:31 +00:00
Andrej Karpathy	1933e85046	brief update to log	2026-01-17 00:25:50 +00:00
Andrej Karpathy	3b95d4fd39	allow label for scaling laws script	2026-01-17 00:23:30 +00:00
Andrej Karpathy	e85db6b4a4	alternating design	2026-01-16 23:52:12 +00:00
Andrej Karpathy	9a88194c3f	simply one VE per layer, works best	2026-01-16 22:08:52 +00:00
Andrej Karpathy	0b58d70e99	full ve version works very well	2026-01-16 21:16:47 +00:00
Andrej Karpathy	e3f58b838e	ranked version	2026-01-16 20:59:42 +00:00
Andrej Karpathy	184d4c12b1	also add to log about the FA3 changes	2026-01-16 18:25:04 +00:00
Andrej Karpathy	b62a5bc44a	naturally i failed to include the actual code in the previous commit facepalm	2026-01-16 17:39:41 +00:00
Andrej Karpathy	8203efa919	implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user.	2026-01-16 17:37:51 +00:00
Haoyu Wang	50413d2d67	typo in comments: change "GAPO" to "DAPO"	2026-01-15 22:03:42 -08:00
Andrej Karpathy	fbf2bbea25	update log with a bunch of attempts	2026-01-16 02:21:17 +00:00
Andrej Karpathy	747ed4491f	add negative result on olmo3 pretraining mix	2026-01-16 00:44:01 +00:00
Andrej Karpathy	7d1700c521	add zstd lib	2026-01-16 00:44:01 +00:00
Sofie Van Landeghem	d4ea28d4e2	Fix args in readme (#438 ) * fix commands in readme, using new arg format * fix typo * add required -i flag to chat_eval example runs	2026-01-15 16:26:38 -08:00
Andrej Karpathy	bdcc030ffa	oops legacy spurious line now	2026-01-15 23:32:20 +00:00
Andrej Karpathy	22a71aa3d3	fuse adamw into a single torch compiled kernel similar to muon. it's about 1.7X faster, but overall it's so tiny that it's not making a major dent	2026-01-15 23:30:44 +00:00
Andrej Karpathy	255f8b9af6	cleanly separate cpu and gpu sections	2026-01-15 23:30:11 +00:00
Andrej Karpathy	6bb92403d5	changes and optimizations to muon, making it more efficient and simpler/cleaner a bit	2026-01-15 03:20:48 +00:00
Andrej Karpathy	3142ca1a28	minor helpful message	2026-01-15 03:20:21 +00:00
Andrej Karpathy	7312ec9898	fix buggy midtrain and update all kwargs to be idiomatic. that is, argparse uses dashes variables use underscores. the underscores are just a remnant of the previous Configurator object. This is the right way	2026-01-13 22:45:27 +00:00
Andrej Karpathy	3b50b77ed3	fix base_loss to report correct loss by switching the dataloader to the new default	2026-01-13 22:09:36 +00:00
Andrej Karpathy	f92efce169	add negative result about not allowing attention across BOS tokens. A lot more code complexity for basically no gain in performance	2026-01-13 21:33:54 +00:00
Andrej Karpathy	43c29dd9d5	Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training The new DataLoader ensures that every token sequence in train/val batches has a BOS token at the beginning. Therefore, no token streams start abruptly in the middle of a document, which could be confusing for the model. Note that this changes the loss scale because there are fewer confusing tokens in the train/val batches. The main downside is that we now waste about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md entry for this change for a lot more information.	2026-01-13 20:05:47 +00:00
Andrej Karpathy	23985413aa	adjust the comment on the regex pattern per recent experimnet see dev/LOG.md	2026-01-13 17:50:39 +00:00
Andrej Karpathy	64b48d0e5c	validated that \p{N}{1,2} is the correct number of digits to group up to in the regex pattern of the GPT-4 tokenizer (2 down from 3), leading to the best val_bpb for 32K vocabs	2026-01-13 17:45:06 +00:00
Andrej Karpathy	238353c998	document my struggle with fp8 integration yesterday, it's not working like i thought it would and i suffered. one day i will return to continue the fight.	2026-01-13 17:14:29 +00:00
Andrej Karpathy	4610a838a1	record negative result on MTP	2026-01-12 05:23:47 +00:00
Andrej Karpathy	21608ec51e	allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway	2026-01-12 03:10:13 +00:00
Andrej Karpathy	aa95fb2e03	make miniseries more generic and easier to run and less hard coded	2026-01-12 02:54:35 +00:00
Andrej Karpathy	b33e394528	oops actually make SSSL the default window pattern	2026-01-11 21:50:35 +00:00
Andrej Karpathy	fbc1484e8c	add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb	2026-01-11 21:49:54 +00:00
Andrej Karpathy	2ff7d51252	integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge	2026-01-11 20:33:19 +00:00
Andrej Karpathy	201d705957	recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints	2026-01-11 20:13:12 +00:00
Andrej Karpathy	aa530cdad5	Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb	2026-01-11 18:47:35 +00:00

1 2 3 4 5 ...

264 Commits