tacit

tacit synced commits to refs/pull/141/merge at tacit/nanochat from mirror 2026-01-12 06:23:49 +00:00

0c9bf1fa6e Merge c79559674b into b33e394528

b33e394528 oops actually make SSSL the default window pattern

fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb

2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge

201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints

Compare 7 commits »

tacit synced commits to refs/pull/151/merge at tacit/nanochat from mirror 2026-01-12 06:23:49 +00:00

c58fb46025 Merge 7f3154f025 into b33e394528

b33e394528 oops actually make SSSL the default window pattern

fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb

2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge

201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints

Compare 5 commits »

tacit synced commits to master at tacit/nanochat from mirror 2026-01-12 06:23:48 +00:00

4610a838a1 record negative result on MTP

21608ec51e allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway

aa95fb2e03 make miniseries more generic and easier to run and less hard coded

Compare 3 commits »

tacit synced commits to refs/pull/432/merge at tacit/nanochat from mirror 2026-01-11 22:13:43 +00:00

eee143c409 Merge 07e5509662 into aa530cdad5

aa530cdad5 Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb

2c4473dd1b Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.

Compare 3 commits »

tacit synced commits to refs/pull/433/merge at tacit/nanochat from mirror 2026-01-11 22:13:43 +00:00

702d573560 Merge c0618a6b7e into 2ff7d51252

2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge

201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints

aa530cdad5 Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb

2c4473dd1b Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.

Compare 5 commits »

tacit synced commits to refs/pull/431/merge at tacit/nanochat from mirror 2026-01-11 22:13:42 +00:00

97d9e43560 Merge d515407deb into 201d705957

201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints

aa530cdad5 Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb

2c4473dd1b Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.

Compare 4 commits »

tacit synced commits to refs/pull/414/merge at tacit/nanochat from mirror 2026-01-11 22:13:42 +00:00

48b4b584d7 Merge 57bcf6786e into 2ff7d51252

2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge

201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints

aa530cdad5 Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb

2c4473dd1b Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.

Compare 5 commits »

tacit synced commits to refs/pull/429/merge at tacit/nanochat from mirror 2026-01-11 22:13:42 +00:00

1db230f1a0 Merge b510e6648e into 2ff7d51252

2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge

201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints

aa530cdad5 Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb

2c4473dd1b Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.

Compare 5 commits »

tacit synced commits to refs/pull/151/merge at tacit/nanochat from mirror 2026-01-11 22:13:42 +00:00

1bef72552c Merge 7f3154f025 into aa530cdad5

aa530cdad5 Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb

2c4473dd1b Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.

Compare 3 commits »

tacit synced commits to master at tacit/nanochat from mirror 2026-01-11 22:13:41 +00:00

b33e394528 oops actually make SSSL the default window pattern

fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb

2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge

201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints

aa530cdad5 Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb

Compare 6 commits »

tacit synced commits to refs/pull/204/merge at tacit/nanochat from mirror 2026-01-10 21:43:42 +00:00

6e7e4d735f Merge 69b7cc9ac5 into f5a0ea4d3f

f5a0ea4d3f take out these gitignore dirs

4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug

Compare 3 commits »

tacit synced commits to refs/pull/85/merge at tacit/nanochat from mirror 2026-01-10 21:43:42 +00:00

bc40bc2311 Merge 04862cbfea into f5a0ea4d3f

f5a0ea4d3f take out these gitignore dirs

4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug

Compare 3 commits »

tacit synced commits to refs/pull/141/merge at tacit/nanochat from mirror 2026-01-10 05:24:03 +00:00

0be264f9a7 Merge c79559674b into f5a0ea4d3f

f5a0ea4d3f take out these gitignore dirs

4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug

a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)

061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then

Compare 10 commits »

tacit synced commits to refs/pull/93/merge at tacit/nanochat from mirror 2026-01-10 05:24:03 +00:00

c2c7d56166 Merge 7950813a41 into f5a0ea4d3f

f5a0ea4d3f take out these gitignore dirs

4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug

a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)

061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then

Compare 10 commits »

tacit synced commits to refs/pull/296/merge at tacit/nanochat from mirror 2026-01-09 21:22:41 +00:00

827ca252b7 Merge 5172ea11bb into f5a0ea4d3f

f5a0ea4d3f take out these gitignore dirs

4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug

a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)

061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then

Compare 14 commits »

tacit synced commits to refs/pull/400/merge at tacit/nanochat from mirror 2026-01-09 21:22:41 +00:00

e7363af1a9 Merge 32ce342c88 into f5a0ea4d3f

f5a0ea4d3f take out these gitignore dirs

4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug

a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)

Compare 4 commits »

tacit synced commits to refs/pull/396/merge at tacit/nanochat from mirror 2026-01-09 13:12:48 +00:00

8a356009f8 Merge 7f6219e092 into f5a0ea4d3f

f5a0ea4d3f take out these gitignore dirs

4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug

a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)

061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then

Compare 9 commits »

tacit synced commits to refs/pull/151/merge at tacit/nanochat from mirror 2026-01-09 13:12:47 +00:00

23af2ad572 Merge 7f3154f025 into f5a0ea4d3f

f5a0ea4d3f take out these gitignore dirs

4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug

a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)

Compare 4 commits »

tacit synced commits to refs/pull/59/merge at tacit/nanochat from mirror 2026-01-09 05:02:39 +00:00

a84577cfbb Merge 23393eae83 into f5a0ea4d3f

f5a0ea4d3f take out these gitignore dirs

4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug

a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)

061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then

Compare 9 commits »

tacit synced commits to refs/pull/407/merge at tacit/nanochat from mirror 2026-01-09 05:02:39 +00:00

b7c83c8bba Merge 47885e743b into f5a0ea4d3f

f5a0ea4d3f take out these gitignore dirs

4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug

a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)

Compare 4 commits »