nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-06-15 18:49:10 +00:00

Author	SHA1	Message	Date
Santhosh Kumar Ravindran	29dcbc92e2	Merge `ac5927e158` into `4e1694cc95`	2026-03-24 22:14:00 -07:00
Andrej Karpathy	4e1694cc95	bunch of ideas tried from openai/parameter-golf, all negative results for nanochat	2026-03-24 22:13:13 +00:00
Andrej Karpathy	1cd94d768f	bump D:N ratio to 12 per recent scaling laws re-run	2026-03-24 19:25:50 +00:00
Andrej Karpathy	c16db281ff	fix small bug with params logging and batch size	2026-03-24 19:25:34 +00:00
Andrej Karpathy	5019accc5b	fix scaling laws scripts after the bigram embeddings were removed	2026-03-17 16:55:56 +00:00
Andrej Karpathy	1b1cc3c599	submit new time to GPT-2 leaderboard entry: 99 minutes	2026-03-14 17:15:01 +00:00
Andrej Karpathy	a825e63f81	Autoresearch round 2: smear, backout, and hyperparameter tuning New architectural features: - Smear: mix previous token embedding into current position via learned gate, providing cheap bigram-like info (works in training + KV cache) - Backout: subtract learned fraction of mid-layer residual before logit projection to remove low-level features Hyperparameter tuning: - Muon momentum warmdown 0.97→0.90 during LR warmdown phase - Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05 - c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4 - Speedrun data:params ratio reduced to 8 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-14 17:03:06 +00:00
Andrej Karpathy	f068604948	new leaderboard entry coming from improvements of autoresearch round 1, time to gpt-2 from 2.02 hours to 1.80 hours	2026-03-10 06:26:39 +00:00
Andrej Karpathy	6ed7d1d82c	All of these improvements were developed by Claude running autonomously over ~2 days using autoresearch. I didn't touch anything - incredible. All tuning was done on d12 but generalized easily to larger models (e.g. d24 in particular). This means we will also get a new "Time to GPT-2" Leaderboard entry, which I will push separately. Optimizer & schedule changes: - Increase unembedding LR 0.004 -> 0.008, weight decay 0.2 -> 0.28 - Per-group Adam betas and weight decay (instead of shared global betas) - Muon beta2 0.95 -> 0.9, momentum warmup target 0.95 -> 0.97 over 400 steps - Warmup: ratio-based -> absolute steps (default 40) - Warmdown ratio 0.5 -> 0.65, final LR fraction 0.0 -> 0.05 - Weight decay schedule: linear -> cosine decay - Polar express norm factor 1.02 -> 1.01 Architecture & init changes: - VE gate: channels 32 -> 12, scale range 2x -> 3x, init small positive - Add post-QK-norm scaling (q,k *= 1.15) for sharper attention - Embedding init std 1.0 -> 0.8, MLP c_fc init 0.5x smaller - RoPE base theta 10K -> 100K - Short attention window: seq_len/2 -> ~seq_len/3 (ceil to 128 tile) - Logit softcap 20 -> 15	2026-03-09 20:45:17 +00:00
santhoshravindran7	ac5927e158	security: add rate limiting, CORS fix, stats key guard, log redaction, macOS memory limits H-2 (High) — scripts/chat_web.py Fix CORS misconfiguration: remove allow_credentials=True (incompatible with wildcard origin) and restrict allow_methods/allow_headers to the minimum required set (GET, POST / Content-Type, X-Stats-Key). M-5 (Medium) — scripts/chat_web.py Add sliding-window rate limiter on /chat/completions keyed by client IP. Implemented without additional dependencies using asyncio + defaultdict. Configurable via NANOCHAT_RATE_LIMIT and NANOCHAT_RATE_WINDOW env vars (defaults: 10 requests per 60 seconds). M-1 (Medium) — scripts/chat_web.py Protect /health and /stats with an optional API key dependency. When NANOCHAT_STATS_KEY env var is set, both endpoints require the value in the X-Stats-Key header. Uses secrets.compare_digest to prevent timing attacks. No-op when env var is unset (backwards compatible). M-4 (Medium) — scripts/chat_web.py Redact full conversation content from server logs. User message bodies are no longer logged at INFO level; only message count and a 120-char preview at DEBUG level. Assistant response logs now record character count only, not content. L-2 (Low) — nanochat/execution.py Enforce memory limits on macOS in the code execution sandbox. Previously the entire resource limit block was skipped on Darwin with a comment 'seem to fail'. RLIMIT_AS is indeed unsupported on macOS, but RLIMIT_DATA is. Linux now uses both RLIMIT_AS and RLIMIT_DATA; macOS uses RLIMIT_DATA. Both paths are guarded by a None check. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-08 23:19:32 -07:00
santhoshravindran7	3fa394c93f	security: fix unsafe deserialization, XSS, HTTPS enforcement, and temp file race Five targeted security fixes — all non-breaking, no behaviour change on the happy path. H-1 (High) — nanochat/checkpoint_manager.py Add weights_only=True to all three torch.load() calls. torch.load() uses pickle by default; loading a malicious .pt file from an untrusted source allows arbitrary code execution. weights_only=True restricts deserialization to tensors and primitives, blocking this attack surface. Refs: https://pytorch.org/docs/stable/generated/torch.load.html H-3 (High) — nanochat/ui.html Replace innerHTML injection with createElement + textContent for error display. error.message was interpolated directly into innerHTML, creating an XSS sink: a crafted server error response could inject and execute arbitrary JavaScript. textContent escapes all HTML entities, closing the injection path. L-1 (Low) — scripts/chat_web.py Fix misleading role validation error message. The error string claimed 'system' was a valid role, but the guard only accepts 'user' and 'assistant'. Corrected to reflect the actual allowed values. M-3 (Medium) — nanochat/common.py Reject non-HTTPS URLs in download_file_with_lock(). urlopen() follows redirects including HTTPS->HTTP downgrades, enabling MITM attacks on downloaded model/tokenizer files. Added an explicit scheme check that raises ValueError for any non-HTTPS URL before the request is made. L-3 (Low) — nanochat/dataset.py Replace predictable .tmp suffix with tempfile.NamedTemporaryFile. The previous filepath + '.tmp' naming caused a TOCTOU race when multiple worker processes downloaded the same shard concurrently, and is vulnerable to symlink attacks on shared filesystems. NamedTemporaryFile generates a unique path; os.replace() provides an atomic rename on POSIX. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-08 23:12:50 -07:00
Andrej Karpathy	1076f97059	delete autocast, an unnecessary thorn in my side, manage dtypes directly	2026-03-04 23:55:30 +00:00
Sofie Van Landeghem	752abc836e	Ensure that inputs and targets are contiguous (#569 ) * call reshape instead of view in case the tensors are not contiguous * fix directly in data loader instead	2026-03-04 13:58:27 -08:00
Andrej Karpathy	4b4077425b	Document new Leaderboard entry congrats @ddudek for pointing out ClimbMix, time to GPT-2 is now 2.01 hours, down from 2.76 previously	2026-03-04 20:02:07 +00:00
Andrej Karpathy	324e69c45d	big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise	2026-03-04 19:47:12 +00:00
Andrej Karpathy	b07604ebaa	document the legacy fineweb100b dataset and the new climbmix400b dataset	2026-03-03 17:24:31 +00:00
Andrej Karpathy	aba30cb037	tune logit softcap?	2026-03-03 00:38:53 +00:00
Anish	83dccc20ae	Restore completion-only loss masking in SFT dataloader (#582 ) * printing steps count * adding reply only loss for chat * using the mask by render_conversation function of tokeniser * undoing some changes * putting back the comment which got removed accidently, no functionality change	2026-03-02 16:37:47 -08:00
Dipesh Babu	c7ba252142	docs: fix typos in experiment log (#547 )	2026-02-20 08:03:45 -08:00
Andrej Karpathy	2dffdc8cf6	document MoE exploration	2026-02-19 02:53:47 +00:00
Andrej Karpathy	48804bff3a	report negative result on fineweb dataset	2026-02-18 23:45:31 +00:00
Andrej Karpathy	bb5137860e	fix comment	2026-02-18 23:26:22 +00:00
Andrej Karpathy	458555117b	Merge branch 'Chetter2-patch-1'	2026-02-18 23:17:39 +00:00
Andrej Karpathy	bac5a35dd7	fix minor bug in fp8 application to skip tiny matmuls	2026-02-18 23:17:29 +00:00
George Shakan	ad55575326	Fix bug in setting precision (#538 )	2026-02-18 15:49:18 +00:00
Sofie Van Landeghem	cac43e8511	Fix MockModel's device definition (#535 ) * fix MockModel's device definition * cleanup	2026-02-18 15:49:18 +00:00
Andrej Karpathy	f5fe7925ed	update dev log with recent	2026-02-18 15:49:18 +00:00
Andrej Karpathy	1415fb7617	tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft	2026-02-18 15:49:18 +00:00
Andrej Karpathy	77f8fb8303	a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps	2026-02-18 15:49:18 +00:00
George Shakan	0a23f87643	Fix bug in setting precision (#538 )	2026-02-18 07:42:11 -08:00
Sofie Van Landeghem	4800c62f6e	Fix MockModel's device definition (#535 ) * fix MockModel's device definition * cleanup	2026-02-17 16:03:46 -08:00
Andrej Karpathy	4a6e47b0c6	update dev log with recent	2026-02-17 15:44:54 +00:00
Andrej Karpathy	8180e1d8c1	tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft	2026-02-16 20:23:04 +00:00
Andrej Karpathy	788dadeb88	a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps	2026-02-16 14:41:53 +00:00
Alan	124f49be98	Removed redundant qunatization of gradients	2026-02-15 15:41:33 +00:00
Alan	d9678ff0f9	Save FP8 tensors in autograd ctx instead of full-precision inputs Store quantized input/weight and their inverse scales in _Float8Matmul ctx to avoid re-quantization in backward and reduce saved-activation memory without changing numerics.	2026-02-15 14:31:54 +00:00
Andrej Karpathy	2f09686724	clarify that this is bf16 mfu we're talking about	2026-02-10 23:35:00 +00:00
Andrej Karpathy	e569b59f92	delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm	2026-02-10 18:46:39 +00:00
Andrej Karpathy	1ec0a34779	at 28 and above we start to need batch size 8	2026-02-08 18:26:34 +00:00
Andrej Karpathy	ff46300720	tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing	2026-02-08 17:54:12 +00:00
Andrej Karpathy	aeff095e97	better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon	2026-02-06 19:22:28 +00:00
Andrej Karpathy	685271dc8d	new optimal ratio for d26 training	2026-02-06 19:21:27 +00:00
Andrej Karpathy	e527521a3f	briefly mention batch ramp experimentation too, too weak to merge in my few attempts	2026-02-05 22:21:03 +00:00
Andrej Karpathy	96522798f1	docs docs docs	2026-02-05 20:27:07 +00:00
Andrej Karpathy	5fdd5cdb24	new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier	2026-02-05 20:11:32 +00:00
Andrej Karpathy	2c062aaa94	nit: don't mutate args, create new var for total_batch_size	2026-02-05 19:59:46 +00:00
Andrej Karpathy	f41dd3cbd7	auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on	2026-02-05 19:40:37 +00:00
Andrej Karpathy	98eed6df18	bring back an assert guarding against bad param sizing	2026-02-05 18:14:30 +00:00
Sofie Van Landeghem	012da1a78b	Typo fixes (#480 ) * small typo * few more small fixes * small fixes in leaderboard.md	2026-02-05 19:12:50 +01:00
Andrej Karpathy	75b302f331	fix hash commit on leaderboard and a paragraph clarification	2026-02-05 16:14:28 +00:00

1 2 3 4 5 ...

358 Commits