Commit Graph

349 Commits

Author SHA1 Message Date
Jason Kneen
5a21667990
Merge 16c37b7d1d into c7ba252142 2026-02-22 15:50:32 +00:00
Jason Kneen
16c37b7d1d fix: MPS/CPU compatibility for training and inference on Mac
Resolves all crashes and silent errors when running on Apple Silicon (MPS)
or CPU after merging upstream/master. Tested on torch 2.2.2 and 2.9.1.

nanochat/engine.py
- Replace signal.SIGALRM timeout with concurrent.futures.ThreadPoolExecutor
  so use_calculator() works from FastAPI worker threads (SIGALRM is
  Unix main-thread only, silently broken in any threaded web server)
- Guard torch.cuda.synchronize() behind device_type == 'cuda'

nanochat/gpt.py
- Extract init_rotary_embeddings() from init_weights() so checkpoint
  loading can restore non-persistent cos/sin buffers without
  re-randomising all weights
- Cast rotary cos/sin to bfloat16 on CUDA only (MPS bfloat16 requires
  torch>=2.4; float32 used on MPS/CPU)
- Update forward() dtype assertion to match device
- Add F.rms_norm fallback for torch<2.4 (rms_norm added in 2.4)

nanochat/optim.py
- _cuda_compile(): skip torch.compile(fullgraph=True) on MPS/CPU;
  return function unchanged so eager execution is used
- adamw_step_fused / muon_step_fused: move 0-D CPU scalar tensors to
  parameter device at start of function (cross-device ops crash in
  eager mode on MPS)
- muon_step_fused: use bfloat16 in polar express on CUDA only;
  fall back to float32 on MPS/CPU
- _step_muon: replace torch._foreach_copy_() with p.copy_(s) loop on
  non-CUDA (_foreach_copy_ not implemented on MPS in torch<2.4)

nanochat/flash_attention.py
- Probe SDPA for enable_gqa support at import time (added in torch 2.5;
  inspect.signature raises on C builtins in older Python/torch)
- Fall back to manual KV head repetition via repeat_interleave when
  enable_gqa is unavailable

nanochat/checkpoint_manager.py
- Call model.init_rotary_embeddings() instead of model.init_weights()
  after load_state_dict() to restore non-persistent rotary buffers
  without clobbering loaded weights

scripts/base_train.py
- Guard torch.compile(model) behind device_type == 'cuda'
- Set mfu = None on non-CUDA instead of computing 0/inf = 0.00%
- Handle mfu is None in end-of-run report

tests/test_mps_compat.py (new)
- 16 tests covering every fix; all pass on MPS (torch 2.2.2 and 2.9.1)
2026-02-22 15:50:11 +00:00
Jason Kneen
d6d8f1be45 Merge remote-tracking branch 'upstream/master' into cpu-mps-dev 2026-02-22 15:31:10 +00:00
Dipesh Babu
c7ba252142
docs: fix typos in experiment log (#547) 2026-02-20 08:03:45 -08:00
Andrej Karpathy
2dffdc8cf6 document MoE exploration 2026-02-19 02:53:47 +00:00
Andrej Karpathy
48804bff3a report negative result on fineweb dataset 2026-02-18 23:45:31 +00:00
Andrej Karpathy
bb5137860e fix comment 2026-02-18 23:26:22 +00:00
Andrej Karpathy
458555117b Merge branch 'Chetter2-patch-1' 2026-02-18 23:17:39 +00:00
Andrej Karpathy
bac5a35dd7 fix minor bug in fp8 application to skip tiny matmuls 2026-02-18 23:17:29 +00:00
George Shakan
ad55575326 Fix bug in setting precision (#538) 2026-02-18 15:49:18 +00:00
Sofie Van Landeghem
cac43e8511 Fix MockModel's device definition (#535)
* fix MockModel's device definition

* cleanup
2026-02-18 15:49:18 +00:00
Andrej Karpathy
f5fe7925ed update dev log with recent 2026-02-18 15:49:18 +00:00
Andrej Karpathy
1415fb7617 tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft 2026-02-18 15:49:18 +00:00
Andrej Karpathy
77f8fb8303 a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps 2026-02-18 15:49:18 +00:00
George Shakan
0a23f87643
Fix bug in setting precision (#538) 2026-02-18 07:42:11 -08:00
Sofie Van Landeghem
4800c62f6e
Fix MockModel's device definition (#535)
* fix MockModel's device definition

* cleanup
2026-02-17 16:03:46 -08:00
Andrej Karpathy
4a6e47b0c6 update dev log with recent 2026-02-17 15:44:54 +00:00
Andrej Karpathy
8180e1d8c1 tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft 2026-02-16 20:23:04 +00:00
Andrej Karpathy
788dadeb88 a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps 2026-02-16 14:41:53 +00:00
Alan
124f49be98
Removed redundant qunatization of gradients 2026-02-15 15:41:33 +00:00
Alan
d9678ff0f9
Save FP8 tensors in autograd ctx instead of full-precision inputs
Store quantized input/weight and their inverse scales in _Float8Matmul ctx to avoid re-quantization in backward and reduce saved-activation memory without changing numerics.
2026-02-15 14:31:54 +00:00
Andrej Karpathy
2f09686724 clarify that this is bf16 mfu we're talking about 2026-02-10 23:35:00 +00:00
Andrej Karpathy
e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm 2026-02-10 18:46:39 +00:00
Andrej Karpathy
1ec0a34779 at 28 and above we start to need batch size 8 2026-02-08 18:26:34 +00:00
Andrej Karpathy
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing 2026-02-08 17:54:12 +00:00
Andrej Karpathy
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon 2026-02-06 19:22:28 +00:00
Andrej Karpathy
685271dc8d new optimal ratio for d26 training 2026-02-06 19:21:27 +00:00
Andrej Karpathy
e527521a3f briefly mention batch ramp experimentation too, too weak to merge in my few attempts 2026-02-05 22:21:03 +00:00
Andrej Karpathy
96522798f1 docs docs docs 2026-02-05 20:27:07 +00:00
Andrej Karpathy
5fdd5cdb24 new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier 2026-02-05 20:11:32 +00:00
Andrej Karpathy
2c062aaa94 nit: don't mutate args, create new var for total_batch_size 2026-02-05 19:59:46 +00:00
Andrej Karpathy
f41dd3cbd7 auto-calculate optimal batch size. the original setting of 0.5M was only optimal for d12, but d26 prefers 1M and so on 2026-02-05 19:40:37 +00:00
Andrej Karpathy
98eed6df18 bring back an assert guarding against bad param sizing 2026-02-05 18:14:30 +00:00
Sofie Van Landeghem
012da1a78b
Typo fixes (#480)
* small typo

* few more small fixes

* small fixes in leaderboard.md
2026-02-05 19:12:50 +01:00
Andrej Karpathy
75b302f331 fix hash commit on leaderboard and a paragraph clarification 2026-02-05 16:14:28 +00:00
Andrej Karpathy
1144d186ed try and fail relu^2 -> swiglu 2026-02-05 02:42:46 +00:00
Andrej Karpathy
d63b7ab9ac try and fail relu^2 -> swiglu 2026-02-05 02:41:46 +00:00
Andrej Karpathy
718e5e9d67 correctly reference NorMuon and fix misleading terms that i may have hastily ported over from modded-nanogpt 2026-02-05 01:39:26 +00:00
Andrej Karpathy
542beb0c8c bump speedrun to be the up to date leaderboard run 2026-02-04 02:12:04 +00:00
Andrej Karpathy
d510b1385b quick experiments to log 2026-02-03 23:21:39 +00:00
Andrej Karpathy
16b8ac7da3 oops forgot to attach leaderboard file too 2026-02-03 21:06:12 +00:00
Andrej Karpathy
fe55b092b8 minor cosmetics for the table 2026-02-03 21:05:28 +00:00
Andrej Karpathy
a67eba35dc add feb2 new leaderboard record from upgrading to fp8 training, +4.3% speedup to time to GPT-2 2026-02-03 21:03:42 +00:00
Andrej Karpathy
6079f78fc3 add fp8 training with torchao 2026-02-03 21:03:42 +00:00
Andrej Karpathy
8ebc14b348 small touchups to the eval script, re-order items etc, cosmetic 2026-02-03 21:03:42 +00:00
Sofie Van Landeghem
72b9064f9d
remove leftover mid references (#491) 2026-02-02 08:33:46 -08:00
Andrej Karpathy
b19b4f3e49 fix bug in speedrun script, batch size that doesn't OOM on 8XH100 for d24 is 16 2026-02-02 15:50:14 +00:00
Andrej Karpathy
230d6cf6c6 tune the synthetic data generation script. delete the king andrej stuff lol. also, upgrade to gemini 3 2026-02-02 01:45:59 +00:00
Andrej Karpathy
07c4dd4cd9 manually control the over-active garbage collector, save a small few minutes from a typical run 2026-02-02 01:44:30 +00:00
Andrej Karpathy
e8fec97d4c slightly more efficient dataloader that reduces the number of python objects flying around and causing strain on runtime and garbage collector 2026-02-02 01:17:30 +00:00