vivek varikuti
f4162c9daf
Merge 55eb345515 into 7808dc7159
2026-03-26 06:26:41 +08:00
Andrej
a4ed96687b
Merge pull request #634 from 2bitbit/fix-docs-and-comments
...
fix: correct minor typos in help text, README, and comments
2026-03-25 14:31:49 -07:00
vivekvar-dl
55eb345515
fix: prevent NaN loss in SFT with fully-masked micro-batches ( #590 )
...
## Problem
When running SFT with small device-batch-size (≤8), fully-masked micro-batches
cause NaN loss from step 1, corrupting gradients permanently. This happens when
a micro-batch contains only 'User' tokens (all targets=-1), especially common
with small batch sizes on consumer GPUs.
Root cause: torch.nn.functional.cross_entropy with reduction='mean' returns NaN
when all labels are -1 (division by zero in mean computation).
## Solution
Added validation in the training loop to detect and skip fully-masked batches:
- Check (y != -1).any() before computing loss
- Skip backward() for batches with no valid targets (zero gradient contribution)
- Track skipped batches and warn user if >5% in first 100 steps
- Log skipped batches as loss=0 for transparency
## Testing
- Added comprehensive test suite (test_sft_masked_batches.py)
- Tests cover: fully masked, partially masked, and unmasked batches
- Documents cross_entropy behavior with ignore_index=-1
- Validates the fix logic
## Impact
- Fixes #590 : NaN loss with small batch sizes
- No performance impact for normal batches
- Helps users on consumer GPUs (RTX 3060, etc.)
- Prevents silent gradient corruption
Resolves #590
2026-03-22 18:15:40 +00:00
Mathieu Lacage
a641b6ca96
MMLU main split is named auxiliary_train, not train
2026-03-13 13:19:10 +01:00
2bitbit
2bb93b2ae4
fix: correct minor typos in help text, README, and comments
2026-03-12 17:03:26 +08:00
Andrej Karpathy
1076f97059
delete autocast, an unnecessary thorn in my side, manage dtypes directly
2026-03-04 23:55:30 +00:00
Sofie Van Landeghem
752abc836e
Ensure that inputs and targets are contiguous ( #569 )
...
* call reshape instead of view in case the tensors are not contiguous
* fix directly in data loader instead
2026-03-04 13:58:27 -08:00
Anish
83dccc20ae
Restore completion-only loss masking in SFT dataloader ( #582 )
...
* printing steps count
* adding reply only loss for chat
* using the mask by render_conversation function of tokeniser
* undoing some changes
* putting back the comment which got removed accidently, no functionality change
2026-03-02 16:37:47 -08:00
Andrej Karpathy
8180e1d8c1
tune the data mixture a bit, load optimizer by default when SFT. These were confirmed to be best settings from sweeps of sft
2026-02-16 20:23:04 +00:00
Andrej Karpathy
788dadeb88
a number of upgrades to SFT script to bring it up to date w.r.t. pretraining and tuning some of its kwargs based on sweeps
2026-02-16 14:41:53 +00:00
Andrej Karpathy
8b4849d548
fix bug in chat_sft, the attention window must be preserved sigh
2026-02-01 20:58:44 +00:00
Andrej Karpathy
eaf49a33c8
fix path which i think was modified during the refactor and this is a bug introduced by claude i believe
2026-02-01 20:15:19 +00:00
Andrej Karpathy
1ddaad1c1c
nuke midtraining from orbit, it's not as needed now that we have a BOS-aligned dataloader. Also change the README a lot. midtrianing is not yet fully properly erased across the board, but good enough for step 1
2026-01-31 19:12:25 +00:00
Andrej Karpathy
41bb2eac32
Combine AdamW and Muon into single MuonAdamW optimizer, cleaner, ty @chrisjmccormick for idea/help
2026-01-29 00:52:08 +00:00
Andrej Karpathy
7312ec9898
fix buggy midtrain and update all kwargs to be idiomatic. that is, argparse uses dashes variables use underscores. the underscores are just a remnant of the previous Configurator object. This is the right way
2026-01-13 22:45:27 +00:00
Andrej Karpathy
eb7bbc1b66
delete the configurator in favor of argparse and clean up a lot of kwarg details to make them more consistent across all scripts
2026-01-04 19:14:23 +00:00
Andrej
088726aa7d
clean up model_tag handling across scripts a bit more.
2025-12-27 20:01:09 -08:00
Andrej Karpathy
2874eda59a
update to new os env var to get rid of deprecation warning
2025-12-28 03:32:46 +00:00
duwenjie
92c6654b95
bugfix save and load ckpt from model_tag dir
2025-12-21 15:07:04 +08:00
Eric Silberstein
f37d45c21f
remove unneeded iter()
2025-11-20 15:14:56 -05:00
svlandeg
70319851fc
fix typo
2025-10-29 19:48:34 +01:00
Andrej Karpathy
8892470f29
add the SpellingBee task so that nanochat can count r in strawberry etc. along the way we had to add a bunch of new functionality, e.g. extend the calculator to support the count function of python. possibly the current TaskMixture uses way too many synthetic examples of SpellingBee because the eval gives us exactly 100% performance on spelling. We can tune this later to reclaim some wall clock time here I think
2025-10-24 14:02:48 +00:00
Andrej Karpathy
5bdc99abfb
merge and resolve conflict
2025-10-21 17:19:10 +00:00
Andrej Karpathy
fe5aed940b
add personality to nanochat. breaks previous code on git pull and requires download of a new file from s3, but there is a helpful error message so hopefully its ok
2025-10-21 15:04:58 +00:00
karpathy
2e9669e03a
upgrading all other files to be able to use cpu/mps as well as cuda. various minor other changes ,e.g. changing max_iterations to num_iterations in sft script for consistency in naming
2025-10-20 10:15:17 -07:00
Andrej Karpathy
190d9515d0
dont evaluate the sampling evals during SFT they are too slow. keep the multiple choice evals. delete unused imports
2025-10-15 16:42:23 +00:00
karpathy
3a5e0bc50b
initial commit
2025-10-13 06:49:24 -07:00