nanochat/scripts
Matt Langston ab89d04dca
multi-node torch.compile warmup, fixes NCCL watchdog timeouts
trigger compilation on every rank with a dummy fwd+bwd after torch.compile, then a barrier, before the training loop begins. guarded by ddp_world_size > 1. without this, if one node has cached kernels from a prior run and another does not, DDP's async all-reduce lets the fast rank race ahead and the slow rank's NCCL ops lose their peer and the watchdog kills the job (see Edward Yang, "State of torch.compile for training", Aug 2025).

the warmup also pre-compiles the BF16 eval graph (FP8 disabled) so the recompile triggered by disable_fp8 does not happen lazily under full training-memory pressure at the first eval step (can crash UMA systems like DGX Spark; relates to #446).

small drive-bys: GB10 added to the peak-FLOPS table for MFU reporting, and mfu=0 initialized before the loop to avoid NameError on the edge case where --resume-from-step == num_iterations.

context: https://github.com/karpathy/nanochat/discussions/710 (the writeup was produced from my dgx-spark branch at https://github.com/matt-langston/nanochat/tree/dgx-spark, which carries these two PRs plus a DGX-Spark-Bundle-specific speedrun script I kept separate)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 19:51:22 -07:00
..
base_eval.py delete autocast, an unnecessary thorn in my side, manage dtypes directly 2026-03-04 23:55:30 +00:00
base_train.py multi-node torch.compile warmup, fixes NCCL watchdog timeouts 2026-04-17 19:51:22 -07:00
chat_cli.py delete autocast, an unnecessary thorn in my side, manage dtypes directly 2026-03-04 23:55:30 +00:00
chat_eval.py delete autocast, an unnecessary thorn in my side, manage dtypes directly 2026-03-04 23:55:30 +00:00
chat_rl.py delete autocast, an unnecessary thorn in my side, manage dtypes directly 2026-03-04 23:55:30 +00:00
chat_sft.py multi-node torch.compile warmup, fixes NCCL watchdog timeouts 2026-04-17 19:51:22 -07:00
chat_web.py delete autocast, an unnecessary thorn in my side, manage dtypes directly 2026-03-04 23:55:30 +00:00
tok_eval.py initial commit 2025-10-13 06:49:24 -07:00
tok_train.py fix: correct minor typos in help text, README, and comments 2026-03-12 17:03:26 +08:00