nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-03-30 00:25:14 +00:00

History

Andrej Karpathy 43c29dd9d5 Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training The new DataLoader ensures that every token sequence in train/val batches has a BOS token at the beginning. Therefore, no token streams start abruptly in the middle of a document, which could be confusing for the model. Note that this changes the loss scale because there are fewer confusing tokens in the train/val batches. The main downside is that we now waste about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md entry for this change for a lot more information.		2026-01-13 20:05:47 +00:00
..
__init__.py	initial commit	2025-10-13 06:49:24 -07:00
adamw.py	Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb	2026-01-11 18:47:35 +00:00
checkpoint_manager.py	add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb	2026-01-11 21:49:54 +00:00
common.py	fix: safe DDP cleanup (check initialized PG, not just env) (#256 )	2025-12-27 20:27:40 -08:00
core_eval.py	initial commit	2025-10-13 06:49:24 -07:00
dataloader.py	Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training	2026-01-13 20:05:47 +00:00
dataset.py	initial commit	2025-10-13 06:49:24 -07:00
engine.py	integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge	2026-01-11 20:33:19 +00:00
execution.py	nit delete redundant catch/raise in execute	2025-10-29 08:10:03 -07:00
gpt.py	add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb	2026-01-11 21:49:54 +00:00
logo.svg	initial commit	2025-10-13 06:49:24 -07:00
loss_eval.py	fix typos	2025-11-14 11:20:25 +01:00
muon.py	Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.	2026-01-11 16:56:59 +00:00
report.py	fix small bug where this would break if git stage has deleted files	2026-01-04 19:11:43 +00:00
tokenizer.py	adjust the comment on the regex pattern per recent experimnet see dev/LOG.md	2026-01-13 17:50:39 +00:00
ui.html	Fix conversation scroll to bottom on some browsers + remove duplicated padding (#348 )	2025-12-31 13:03:22 -08:00