nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-05-08 08:49:53 +00:00

Author	SHA1	Message	Date
Manmohan Sharma	f642cb2eb6	feat(sft): add r7 think+tool prep scripts and compose cleanup - allow assistant list-shaped content in CustomJSON for joint think+tool JSONL - add gen_joint_think_tool, filter_reasoning_jsonl, eval_suite_v2 (think_plus_tool probes) - fix CI: uv sync --no-install-workspace; uv run pytest - remove unused local inference service from compose; document Modal URL in env examples Made-with: Cursor	2026-04-22 14:22:47 -07:00
Manmohan Sharma	aa7a907063	feat(frontend): wire frontend to real backend auth + chat-api services Remove NextAuth and replace with token-based auth against the backend auth service (OAuth + JWT). The frontend now redirects login to /api/auth/google and /api/auth/github (proxied by nginx to the auth service), captures the JWT from the redirect query param, and uses it for all API calls. Key changes: - Remove next-auth dependency and all NextAuth config/routes - Add lib/auth-client.ts (JWT token storage + auth headers) - Add hooks/useAuth.ts (client-side auth state + token capture) - Rewrite middleware.ts to pass-through (client-side auth only) - Login page uses plain <a> links to /api/auth/{provider} - Chat page captures access_token from OAuth redirect - Zustand store fetches conversations from real chat-api via JWT - API routes proxy /api/conversations/* to chat-api with auth - chat/stream route supports conversationId + auth header forwarding - useSSE hook accepts auth headers for authenticated streaming - Sidebar loads conversations from API, supports delete - Landing page (Hero, LandingNav) uses useAuth instead of useSession - Add .env.production.example and scripts/generate-jwt-keys.sh Mock echo fallback preserved when CHAT_API_URL is not set. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 13:21:38 -07:00
Manmohan Sharma	0b8f9f0a5f	feat(ops): Day 2 operations automation and chaos runbook (#10 ) Adds tooling and documentation for Day 2 cluster operations: - scripts/rotate-nodes.sh: interactive node-rotation driver that applies terraform to pick up the latest SSM-resolved EKS AMI and watches the rolling replacement. - scripts/demo-schema-change.sh: end-to-end demo of the zero-downtime is_favorited column migration via helm upgrade + migration hook. - scripts/verify-deployment.sh: post-deploy health check across pods, per-service HTTP health endpoints, rollout status, and PDBs. - docs/chaos-runbook.md: failure-mode playbook with simulate / Grafana / Loki / recovery steps for six scenarios (pod kill, node failure, DB pool exhaustion, inference OOM, high latency, SSL issues) plus a Loki quick-reference. - terraform/modules/eks: expose current_node_ami_id output, add update_config.max_unavailable_percentage (configurable, default 33) so node-group rolls are controlled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-16 12:25:47 -07:00
Manmohan	a0533f2199	Merge pull request #12 from manmohan659/feat/monorepo-scaffold [codex] Scaffold monorepo platform layout	2026-04-16 14:25:35 -04:00
Manmohan Sharma	957f66181d	scaffold monorepo platform layout	2026-04-16 11:06:29 -07:00
Manmohan Sharma	9f3973c677	Add pre-GPU tool training and checkpoint plumbing	2026-03-24 20:52:36 -04:00
Manmohan Sharma	ee34586e77	redesign UI: artisan landing page + warm chat theme + ONNX export script Landing page with desi street-food aesthetic: lemon-mirchi toran with pendulum animation, dual-script hero (Devanagari + English cursive), samosa illustration with floating animation, brass chai kettle with steam wisps, ambient chilli/lemon doodles. Chat page carries the warm samosa-chaat palette with cream/gold user bubbles, steam-wisp typing indicator, and WebGPU integration hooks (window.samosaChaat API for local inference mode switching). Added scripts/export_onnx.py for ONNX model export with KV cache support, targeting WebGPU browser inference. Credit to Andrej Karpathy's nanochat in footer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 11:54:07 -04:00
Manmohan Sharma	c767741b42	rebrand to samosaChaat: UI, logo, and server messages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 09:58:12 -04:00
Andrej Karpathy	a825e63f81	Autoresearch round 2: smear, backout, and hyperparameter tuning New architectural features: - Smear: mix previous token embedding into current position via learned gate, providing cheap bigram-like info (works in training + KV cache) - Backout: subtract learned fraction of mid-layer residual before logit projection to remove low-level features Hyperparameter tuning: - Muon momentum warmdown 0.97→0.90 during LR warmdown phase - Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05 - c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4 - Speedrun data:params ratio reduced to 8 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-14 17:03:06 +00:00
Andrej Karpathy	6ed7d1d82c	All of these improvements were developed by Claude running autonomously over ~2 days using autoresearch. I didn't touch anything - incredible. All tuning was done on d12 but generalized easily to larger models (e.g. d24 in particular). This means we will also get a new "Time to GPT-2" Leaderboard entry, which I will push separately. Optimizer & schedule changes: - Increase unembedding LR 0.004 -> 0.008, weight decay 0.2 -> 0.28 - Per-group Adam betas and weight decay (instead of shared global betas) - Muon beta2 0.95 -> 0.9, momentum warmup target 0.95 -> 0.97 over 400 steps - Warmup: ratio-based -> absolute steps (default 40) - Warmdown ratio 0.5 -> 0.65, final LR fraction 0.0 -> 0.05 - Weight decay schedule: linear -> cosine decay - Polar express norm factor 1.02 -> 1.01 Architecture & init changes: - VE gate: channels 32 -> 12, scale range 2x -> 3x, init small positive - Add post-QK-norm scaling (q,k *= 1.15) for sharper attention - Embedding init std 1.0 -> 0.8, MLP c_fc init 0.5x smaller - RoPE base theta 10K -> 100K - Short attention window: seq_len/2 -> ~seq_len/3 (ceil to 128 tile) - Logit softcap 20 -> 15	2026-03-09 20:45:17 +00:00
Andrej Karpathy	1076f97059	delete autocast, an unnecessary thorn in my side, manage dtypes directly	2026-03-04 23:55:30 +00:00
Sofie Van Landeghem	752abc836e	Ensure that inputs and targets are contiguous (#569 ) * call reshape instead of view in case the tensors are not contiguous * fix directly in data loader instead	2026-03-04 13:58:27 -08:00
Andrej Karpathy	324e69c45d	big, breaking change but large upside: swap previous FineWeb-EDU dataset to NVIDIA ClimbMix dataset. Requires people to download the data shards. The upside is that training GPT-2 capablity model now only takes ~2 hours, down from 2.76 hours, so this is a huge win data-wise	2026-03-04 19:47:12 +00:00
Anish	83dccc20ae	Restore completion-only loss masking in SFT dataloader (#582 ) * printing steps count * adding reply only loss for chat * using the mask by render_conversation function of tokeniser * undoing some changes * putting back the comment which got removed accidently, no functionality change	2026-03-02 16:37:47 -08:00
Andrej Karpathy	bac5a35dd7	fix minor bug in fp8 application to skip tiny matmuls	2026-02-18 23:17:29 +00:00
Andrej Karpathy	1415fb7617	tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft	2026-02-18 15:49:18 +00:00
Andrej Karpathy	77f8fb8303	a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps	2026-02-18 15:49:18 +00:00
Andrej Karpathy	2f09686724	clarify that this is bf16 mfu we're talking about	2026-02-10 23:35:00 +00:00
Andrej Karpathy	e569b59f92	delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm	2026-02-10 18:46:39 +00:00
Andrej Karpathy	aeff095e97	better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon	2026-02-06 19:22:28 +00:00
Andrej Karpathy	2c062aaa94	nit: don't mutate args, create new var for total_batch_size	2026-02-05 19:59:46 +00:00
Andrej Karpathy	f41dd3cbd7	auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on	2026-02-05 19:40:37 +00:00
Andrej Karpathy	6079f78fc3	add fp8 training with torchao	2026-02-03 21:03:42 +00:00
Andrej Karpathy	8ebc14b348	small touchups to the eval script, re-order items etc, cosmetic	2026-02-03 21:03:42 +00:00
Sofie Van Landeghem	72b9064f9d	remove leftover mid references (#491 )	2026-02-02 08:33:46 -08:00
Andrej Karpathy	07c4dd4cd9	manually control the over-active garbage collector, save a small few minutes from a typical run	2026-02-02 01:44:30 +00:00
Andrej Karpathy	8b4849d548	fix bug in chat_sft, the attention window must be preserved sigh	2026-02-01 20:58:44 +00:00
Andrej Karpathy	eaf49a33c8	fix path which i think was modified during the refactor and this is a bug introduced by claude i believe	2026-02-01 20:15:19 +00:00
Andrej Karpathy	31b61d2d17	fix broken import sigh	2026-02-01 05:03:44 +00:00
Andrej Karpathy	0307997f9b	merge two files base_loss and base_eval into a single file, it's nicer this way, and unify the huggingface code associated with both	2026-02-01 02:36:43 +00:00
Andrej Karpathy	1ddaad1c1c	nuke midtraining from orbit, it's not as needed now that we have a BOS-aligned dataloader. Also change the README a lot. midtrianing is not yet fully properly erased across the board, but good enough for step 1	2026-01-31 19:12:25 +00:00
Andrej Karpathy	348fbb301b	fix dataloader for midtrain to never crop data. we can't just throw it away like we do in pretraining	2026-01-31 18:21:36 +00:00
Andrej Karpathy	3c3a3d7042	warmdown of 0.5 is slightly better:	2026-01-31 01:08:44 +00:00
Aarushi Singh	ace6740bdd	feat: allow top_k=0 in web api to disable filtering (#458 ) * allow top_k=0 in web api to disable filtering * adding a comment for clear reasoning * adding change to docstring	2026-01-30 09:21:41 -08:00
Andrej Karpathy	41bb2eac32	Combine AdamW and Muon into single MuonAdamW optimizer, cleaner, ty @chrisjmccormick for idea/help	2026-01-29 00:52:08 +00:00
Andrej Karpathy	c88bbf8133	Merge branch 'engram'	2026-01-27 22:33:16 +00:00
Andrej Karpathy	c8d93beed2	add engram-lite, add log, tune scaling laws analysis scripts	2026-01-27 22:31:17 +00:00
Andrej Karpathy	8630d32be4	quick fix to not OOM main speedrun script	2026-01-26 22:31:42 +00:00
Andrej Karpathy	59e36cc727	first version of engram following modded nanogpt style	2026-01-25 18:59:51 +00:00
Andrej Karpathy	a91743c168	Merge branch 've'	2026-01-18 15:14:39 +00:00
Andrej Karpathy	cf5c9e5b8e	resolve a crash for odd depths because FA3 needs head_dim % 8 == 0	2026-01-18 00:07:08 +00:00
Andrej Karpathy	413e91aa0f	optimal ratio is now around 4	2026-01-17 23:51:09 +00:00
karpathy	f9a7e0f111	update the CPU/MPS script to give reasonable results. The model can at least answer that Paris is the capital of France and knows that the sky is blue, for about 40 minutes of training on my macbook. Also fixed a bug that existed due to KVCache bfloat16 dtype assumption	2026-01-17 12:27:30 -08:00
Andrej Karpathy	2955650327	add detection of device to report more correct mfu for bf16	2026-01-17 03:16:14 +00:00
Nitish Pandey	f42ae9e901	fix condition to perform bpb evaluation (#324 ) Co-authored-by: svlandeg <svlandeg@github.com>	2026-01-16 18:56:43 -08:00
Andrej Karpathy	8203efa919	implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user.	2026-01-16 17:37:51 +00:00
Haoyu Wang	50413d2d67	typo in comments: change "GAPO" to "DAPO"	2026-01-15 22:03:42 -08:00
Sofie Van Landeghem	d4ea28d4e2	Fix args in readme (#438 ) * fix commands in readme, using new arg format * fix typo * add required -i flag to chat_eval example runs	2026-01-15 16:26:38 -08:00
Andrej Karpathy	bdcc030ffa	oops legacy spurious line now	2026-01-15 23:32:20 +00:00
Andrej Karpathy	255f8b9af6	cleanly separate cpu and gpu sections	2026-01-15 23:30:11 +00:00

1 2 3

116 Commits