trigger compilation on every rank with a dummy fwd+bwd after torch.compile, then a barrier, before the training loop begins. guarded by ddp_world_size > 1. without this, if one node has cached kernels from a prior run and another does not, DDP's async all-reduce lets the fast rank race ahead and the slow rank's NCCL ops lose their peer and the watchdog kills the job (see Edward Yang, "State of torch.compile for training", Aug 2025).
the warmup also pre-compiles the BF16 eval graph (FP8 disabled) so the recompile triggered by disable_fp8 does not happen lazily under full training-memory pressure at the first eval step (can crash UMA systems like DGX Spark; relates to #446).
small drive-bys: GB10 added to the peak-FLOPS table for MFU reporting, and mfu=0 initialized before the loop to avoid NameError on the edge case where --resume-from-step == num_iterations.
context: https://github.com/karpathy/nanochat/discussions/710 (the writeup was produced from my dgx-spark branch at https://github.com/matt-langston/nanochat/tree/dgx-spark, which carries these two PRs plus a DGX-Spark-Bundle-specific speedrun script I kept separate)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When swapping Float8Linear to Linear in disable_fp8 context manager,
using device=fp8_module.weight.device directly allocates new tensors
on GPU, causing unnecessary VRAM spike (~1GB for large models).
This fix uses device='meta' to avoid physical memory allocation,
then swaps in the weight tensor reference. This eliminates the
unnecessary VRAM spike during evaluation phase.
Fixes issue #592
Co-authored-by: RoomWithOutRoof <roomwithoutroof@sparklab.ai>
The bf16 cast is intentional for speed on Hopper+ GPUs, but should be
skipped on other platforms rather than blindly applied. fp16 is unstable
here due to its limited exponent range, and fp32 platforms don't benefit
from the cast. Now: bf16 when COMPUTE_DTYPE is bf16, no cast otherwise.
Inspired by PR #667.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New architectural features:
- Smear: mix previous token embedding into current position via learned
gate, providing cheap bigram-like info (works in training + KV cache)
- Backout: subtract learned fraction of mid-layer residual before logit
projection to remove low-level features
Hyperparameter tuning:
- Muon momentum warmdown 0.97→0.90 during LR warmdown phase
- Non-uniform per-layer init: resid_lambdas 1.15→1.05, x0_lambdas 0.20→0.05
- c_fc init scale 0.4x, QK norm scale 1.2, sliding window seq_len/4
- Speedrun data:params ratio reduced to 8
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* printing steps count
* adding reply only loss for chat
* using the mask by render_conversation function of tokeniser
* undoing some changes
* putting back the comment which got removed accidently, no functionality change