• Joined on 2024-05-31
tacit synced commits to refs/pull/59/merge at tacit/nanochat from mirror 2026-01-12 06:23:55 +00:00
b33e394528 oops actually make SSSL the default window pattern
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge
201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints
Compare 7 commits »
tacit synced commits to refs/pull/433/merge at tacit/nanochat from mirror 2026-01-12 06:23:54 +00:00
b33e394528 oops actually make SSSL the default window pattern
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
Compare 3 commits »
tacit synced commits to refs/pull/431/merge at tacit/nanochat from mirror 2026-01-12 06:23:54 +00:00
4610a838a1 record negative result on MTP
21608ec51e allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway
1a9df65ee7 Merge branch 'karpathy:master' into master
d47431d87d feat: restore flash attention
Compare 10 commits »
tacit synced commits to refs/pull/431/head at tacit/nanochat from mirror 2026-01-12 06:23:54 +00:00
1a9df65ee7 Merge branch 'karpathy:master' into master
d47431d87d feat: restore flash attention
aa95fb2e03 make miniseries more generic and easier to run and less hard coded
8d89db3195 Merge branch 'master' into master
b33e394528 oops actually make SSSL the default window pattern
Compare 10 commits »
tacit synced commits to refs/pull/432/merge at tacit/nanochat from mirror 2026-01-12 06:23:54 +00:00
4610a838a1 record negative result on MTP
21608ec51e allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway
aa95fb2e03 make miniseries more generic and easier to run and less hard coded
b33e394528 oops actually make SSSL the default window pattern
Compare 8 commits »
tacit synced commits to refs/pull/429/merge at tacit/nanochat from mirror 2026-01-12 06:23:53 +00:00
b33e394528 oops actually make SSSL the default window pattern
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
Compare 3 commits »
tacit synced commits to refs/pull/425/merge at tacit/nanochat from mirror 2026-01-12 06:23:53 +00:00
b33e394528 oops actually make SSSL the default window pattern
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge
201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints
Compare 7 commits »
tacit synced commits to refs/pull/414/merge at tacit/nanochat from mirror 2026-01-12 06:23:52 +00:00
b33e394528 oops actually make SSSL the default window pattern
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
Compare 3 commits »
tacit synced commits to refs/pull/412/merge at tacit/nanochat from mirror 2026-01-12 06:23:52 +00:00
b33e394528 oops actually make SSSL the default window pattern
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge
201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints
Compare 7 commits »
tacit synced commits to refs/pull/409/merge at tacit/nanochat from mirror 2026-01-12 06:23:51 +00:00
b33e394528 oops actually make SSSL the default window pattern
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge
201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints
Compare 7 commits »
tacit synced commits to refs/pull/407/merge at tacit/nanochat from mirror 2026-01-12 06:23:51 +00:00
b33e394528 oops actually make SSSL the default window pattern
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge
201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints
Compare 7 commits »
tacit synced commits to refs/pull/400/merge at tacit/nanochat from mirror 2026-01-12 06:23:51 +00:00
21608ec51e allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway
aa95fb2e03 make miniseries more generic and easier to run and less hard coded
b33e394528 oops actually make SSSL the default window pattern
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
Compare 9 commits »
tacit synced commits to refs/pull/396/merge at tacit/nanochat from mirror 2026-01-12 06:23:51 +00:00
b33e394528 oops actually make SSSL the default window pattern
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge
201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints
Compare 7 commits »
tacit synced commits to refs/pull/370/merge at tacit/nanochat from mirror 2026-01-12 06:23:50 +00:00
21608ec51e allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway
aa95fb2e03 make miniseries more generic and easier to run and less hard coded
b33e394528 oops actually make SSSL the default window pattern
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
Compare 9 commits »
tacit synced commits to refs/pull/324/merge at tacit/nanochat from mirror 2026-01-12 06:23:50 +00:00
b33e394528 oops actually make SSSL the default window pattern
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge
201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints
Compare 7 commits »
tacit synced commits to refs/pull/296/merge at tacit/nanochat from mirror 2026-01-12 06:23:50 +00:00
b33e394528 oops actually make SSSL the default window pattern
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge
201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints
Compare 7 commits »
tacit synced commits to refs/pull/151/merge at tacit/nanochat from mirror 2026-01-12 06:23:49 +00:00
b33e394528 oops actually make SSSL the default window pattern
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge
201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints
Compare 5 commits »
tacit synced commits to refs/pull/141/merge at tacit/nanochat from mirror 2026-01-12 06:23:49 +00:00
b33e394528 oops actually make SSSL the default window pattern
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge
201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints
Compare 7 commits »
tacit synced commits to master at tacit/nanochat from mirror 2026-01-12 06:23:48 +00:00
4610a838a1 record negative result on MTP
21608ec51e allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway
aa95fb2e03 make miniseries more generic and easier to run and less hard coded
Compare 3 commits »
tacit synced commits to refs/pull/432/merge at tacit/nanochat from mirror 2026-01-11 22:13:43 +00:00
aa530cdad5 Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb
2c4473dd1b Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.
Compare 3 commits »