nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-03-20 20:03:19 +00:00

Author	SHA1	Message	Date
William Thurston	0c942a8c00	Add tie_embeddings support and configurable logging interval Implement weight tying between token embeddings and lm_head to reduce parameter count. When enabled, logits are scaled by 1/√d_model, lm_head zeroing is skipped, and optimizer groups are deduplicated. Param counting uses unique parameters while Chinchilla ratio calculation adds back the would-be lm_head size for comparability. Also adds boolean flag parsing (--flag without =value) to the configurator, an auto-derived log_every interval, and minor shell script fixes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 14:42:58 -08:00
William Thurston	76227f70d3	Add MOE debug interval and logging for gradient statistics - Introduced `MOE_DEBUG_INTERVAL` parameter in `runmps.sh` to control debug logging frequency during training. - Enhanced `base_train.py` to log gradients of routed and shared weights at specified intervals, aiding in monitoring model performance. - Updated `gpt.py` to adjust router bias calculations, improving load balancing among experts. - Added unit tests in `test_moe.py` to validate the behavior of the MoE implementation and ensure correctness of gradient calculations.	2025-11-13 16:22:20 -08:00
William Thurston	25d2573f47	Add MoE configuration and implementation in training scripts and model architecture - Introduced parameters for Mixture of Experts (MoE) in `runmps.sh`, `base_train.py`, and `gpt.py`, allowing for dynamic configuration of experts during training. - Enhanced `gpt.py` with new classes `MoEFeedForward` and `ExpertFFN` to implement MoE functionality in the model architecture. - Updated `configurator.py` to handle type conversions for new MoE parameters. - Improved logging in `base_train.py` to include MoE-related metrics and configurations during training. - Added assertions and derived defaults for MoE parameters to ensure valid configurations. - Implemented methods to estimate and log FLOPs for both dense and MoE active configurations during training. - Enhanced gradient handling in `muon.py` to accommodate potential absence of gradients for unused experts.	2025-11-11 19:58:38 -08:00
William Thurston	9550053cc1	Enhance model tagging support in training and evaluation scripts - Added model tagging functionality to `runmps.sh`, allowing for dynamic model tagging based on the W&B run name. - Updated `base_train.py`, `mid_train.py`, and `chat_sft.py` to utilize model tags for checkpoint management. - Enhanced `base_eval.py` to accept model tags for loading models during evaluation. - Improved handling of model tags to ensure proper checkpoint directory naming and logging.	2025-11-10 19:45:02 -08:00
William Thurston	8a6d34daf7	Add kv_head_mult parameter for training and evaluation scripts - Introduced `kv_head_mult` to control the number of query heads sharing a key/value head in `base_train.py`, `mid_train.py`, and `runmps.sh`. - Updated logging to include global token per second metrics during training. - Added assertions to ensure `kv_head_mult` is valid and properly integrated into model calculations.	2025-11-09 14:23:45 -08:00
William Thurston	b1d49aade5	Add scripts for running evaluations and training with W&B integration - Added `dev/runmps_evals.sh` for evaluating checkpoints and logging results to W&B. - Introduced `dev/runmps.sh` for orchestrating training stages with W&B support. - Updated `.gitignore` to include `wandb/` and `.runmps_wandb_ids`. - Changed permissions for `dev/runcpu.sh` and added executable flag. - Enhanced existing scripts to log metrics to W&B during training and evaluation processes.	2025-11-05 11:49:50 -08:00
Luke Stanley	901b075605	Fix GPU-less CPU use on Linux with specific Torch indexes	2025-10-21 23:14:16 +00:00
Andrej Karpathy	94ee507054	quick fix base eval due to fewshot requirement	2025-10-21 17:56:08 +00:00
Andrej Karpathy	5bdc99abfb	merge and resolve conflict	2025-10-21 17:19:10 +00:00
Andrej Karpathy	fe5aed940b	add personality to nanochat. breaks previous code on git pull and requires download of a new file from s3, but there is a helpful error message so hopefully its ok	2025-10-21 15:04:58 +00:00
karpathy	2e9669e03a	upgrading all other files to be able to use cpu/mps as well as cuda. various minor other changes ,e.g. changing max_iterations to num_iterations in sft script for consistency in naming	2025-10-20 10:15:17 -07:00
karpathy	a53833d04f	add nanochat logo png	2025-10-13 06:59:59 -07:00
karpathy	3a5e0bc50b	initial commit	2025-10-13 06:49:24 -07:00

13 Commits