nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-01-20 18:34:14 +00:00

Author	SHA1	Message	Date
Andrej Karpathy	061f83c152	delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then	2026-01-08 02:16:50 +00:00
Andrej Karpathy	ccf4b7f9bf	nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script	2026-01-07 22:11:59 +00:00
Andrej Karpathy	ae0bf52529	tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3	2026-01-05 18:57:46 +00:00
Andrej Karpathy	9d4c9b786d	many small fixes to base_train: reporting ETA, allowing some additional kwarg flexibility, making sure we don't crash when e.g. depth = 11 - we now calculate the closest num_heads that works	2026-01-05 00:38:09 +00:00
Andrej Karpathy	eb7bbc1b66	delete the configurator in favor of argparse and clean up a lot of kwarg details to make them more consistent across all scripts	2026-01-04 19:14:23 +00:00
Andrej Karpathy	48abd7d85f	simplify, clarify and slightly tune model initialization. should be very slightly better possibly, but certainly a lot clearer	2026-01-01 21:15:09 +00:00
Andrej Karpathy	2874eda59a	update to new os env var to get rid of deprecation warning	2025-12-28 03:32:46 +00:00
Sanzo00	53b3a4fb81	fix: missing val_bpb on resume	2025-11-22 11:04:20 +08:00
Andrej Karpathy	c6abcdfe3a	big change: add pretraining resumption logic so that checkpoints can now be approximately resumed and training can continue. this is useful for very long runs when you don't want the anxiety of your run crashing for some reason. alternatively, it's a way to recover training in the event of loss spikes. i mean, this should have been there in v0 but it's ok. the resumption is approximate to control complexity and bloat, but it's possible we want to change that in the future. to use, set --save_every to a step interval to write checkpoints with, and then use --resume_from_step to resume optimization from a given step. only base model training (pretraining) supports this atm, but it's ok because midtraining is comparably quite a bit faster.	2025-11-13 15:34:40 +00:00
Andrej Karpathy	c6b7ab7440	grad clip logging and printing and cosmetics	2025-11-05 21:08:30 +00:00
Andrej	dfc88334b6	fix tok/sec calculation bug when grad accum steps > 1 Fix tok/sec metrics for base_train and mid_train when gradient accumulation is not 1	2025-10-30 08:36:32 -07:00
svlandeg	8c9b004c99	typo fixes in scripts	2025-10-28 20:17:31 +01:00
water-vapor	a9de4b1038	Fix tok/sec metrics for base_train and mid_train when gradient accumulation is not 1	2025-10-26 01:43:49 -05:00
Andrej Karpathy	81597cd616	move the lr schedule args up in base_train so they are tunable in configurator	2025-10-24 13:27:31 +00:00
Andrej Karpathy	a088b7a6ec	use enable_gqa of pytorch sdpa, allows us to delete some code, didnt realize it's available	2025-10-21 18:07:33 +00:00
Andrej Karpathy	5bdc99abfb	merge and resolve conflict	2025-10-21 17:19:10 +00:00
Andrej Karpathy	dfcb1c16f1	Merge branch 'master' into cpu-mps-dev	2025-10-21 17:15:53 +00:00
Andrej Karpathy	c1d2ed1c13	use orig_model in sampling, silly of me to miss this	2025-10-20 00:05:09 +00:00
Andrej Karpathy	2bc521a6de	use orig_model in sampling, silly of me to miss this	2025-10-20 00:04:15 +00:00
karpathy	df600b6ed5	many small tweaks. base, eval, core work now i think	2025-10-16 15:46:18 -07:00
karpathy	786119d593	add autodetect of device and related stuff. getting weird warnings/errors still, so wip	2025-10-16 10:26:19 -07:00
karpathy	279b74312c	adjust comment/guidance on device type	2025-10-16 10:06:39 -07:00
karpathy	306bc380ab	add support for CPU and for MPS. I had to change a few cosmetic things. I also discovered I think a bit of a bug, where I was casting wte to bfloat16 in the wrong place (the model init) instead of in init_weights	2025-10-16 10:04:43 -07:00
Andrej Karpathy	722da4f543	trying to add basic cpu support, will try mps too	2025-10-16 16:14:38 +00:00
karpathy	3a5e0bc50b	initial commit	2025-10-13 06:49:24 -07:00

25 Commits