tacit

tacit synced commits to refs/pull/59/merge at tacit/nanochat from mirror 2026-01-12 06:23:55 +00:00

7547624a69 Merge 23393eae83 into b33e394528

b33e394528 oops actually make SSSL the default window pattern

fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb

2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge

201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints

Compare 7 commits »

tacit synced commits to refs/pull/433/merge at tacit/nanochat from mirror 2026-01-12 06:23:54 +00:00

1892bae753 Merge c0618a6b7e into b33e394528

b33e394528 oops actually make SSSL the default window pattern

fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb

Compare 3 commits »

tacit synced commits to refs/pull/431/merge at tacit/nanochat from mirror 2026-01-12 06:23:54 +00:00

238c4d1a3c Merge 1a9df65ee7 into 4610a838a1

4610a838a1 record negative result on MTP

21608ec51e allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway

1a9df65ee7 Merge branch 'karpathy:master' into master

d47431d87d feat: restore flash attention

Compare 10 commits »

tacit synced commits to refs/pull/431/head at tacit/nanochat from mirror 2026-01-12 06:23:54 +00:00

1a9df65ee7 Merge branch 'karpathy:master' into master

d47431d87d feat: restore flash attention

aa95fb2e03 make miniseries more generic and easier to run and less hard coded

8d89db3195 Merge branch 'master' into master

b33e394528 oops actually make SSSL the default window pattern

Compare 10 commits »

tacit synced commits to refs/pull/432/merge at tacit/nanochat from mirror 2026-01-12 06:23:54 +00:00

a88a94f5bd Merge 07e5509662 into 4610a838a1

4610a838a1 record negative result on MTP

21608ec51e allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway

aa95fb2e03 make miniseries more generic and easier to run and less hard coded

b33e394528 oops actually make SSSL the default window pattern

Compare 8 commits »

tacit synced commits to refs/pull/429/merge at tacit/nanochat from mirror 2026-01-12 06:23:53 +00:00

d7482c97e2 Merge b510e6648e into b33e394528

b33e394528 oops actually make SSSL the default window pattern

fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb

Compare 3 commits »

tacit synced commits to refs/pull/425/merge at tacit/nanochat from mirror 2026-01-12 06:23:53 +00:00

9f661bc358 Merge eebab89a11 into b33e394528

b33e394528 oops actually make SSSL the default window pattern

fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb

2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge

201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints

Compare 7 commits »

tacit synced commits to refs/pull/414/merge at tacit/nanochat from mirror 2026-01-12 06:23:52 +00:00

a0617c2616 Merge 57bcf6786e into b33e394528

b33e394528 oops actually make SSSL the default window pattern

fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb

Compare 3 commits »

tacit synced commits to refs/pull/412/merge at tacit/nanochat from mirror 2026-01-12 06:23:52 +00:00

0a3b7450b6 Merge db5e62fc2a into b33e394528

b33e394528 oops actually make SSSL the default window pattern

fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb

2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge

201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints

Compare 7 commits »

tacit synced commits to refs/pull/409/merge at tacit/nanochat from mirror 2026-01-12 06:23:51 +00:00

78062e4651 Merge 489075bdbd into b33e394528

b33e394528 oops actually make SSSL the default window pattern

fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb

2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge

201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints

Compare 7 commits »

tacit synced commits to refs/pull/407/merge at tacit/nanochat from mirror 2026-01-12 06:23:51 +00:00

a86320d7cf Merge 47885e743b into b33e394528

b33e394528 oops actually make SSSL the default window pattern

fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb

2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge

201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints

Compare 7 commits »

tacit synced commits to refs/pull/400/merge at tacit/nanochat from mirror 2026-01-12 06:23:51 +00:00

76a9627814 Merge 32ce342c88 into 21608ec51e

21608ec51e allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway

aa95fb2e03 make miniseries more generic and easier to run and less hard coded

b33e394528 oops actually make SSSL the default window pattern

fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb

Compare 9 commits »

tacit synced commits to refs/pull/396/merge at tacit/nanochat from mirror 2026-01-12 06:23:51 +00:00

c7bc9000bf Merge 7f6219e092 into b33e394528

b33e394528 oops actually make SSSL the default window pattern

fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb

2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge

201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints

Compare 7 commits »

tacit synced commits to refs/pull/370/merge at tacit/nanochat from mirror 2026-01-12 06:23:50 +00:00

a491c61da7 Merge 9a9b12b1be into 21608ec51e

21608ec51e allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway

aa95fb2e03 make miniseries more generic and easier to run and less hard coded

b33e394528 oops actually make SSSL the default window pattern

fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb

Compare 9 commits »

tacit synced commits to refs/pull/324/merge at tacit/nanochat from mirror 2026-01-12 06:23:50 +00:00

8b3200b4e7 Merge e00c73322c into b33e394528

b33e394528 oops actually make SSSL the default window pattern

fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb

2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge

201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints

Compare 7 commits »

tacit synced commits to refs/pull/296/merge at tacit/nanochat from mirror 2026-01-12 06:23:50 +00:00

54aa1a18d1 Merge 5172ea11bb into b33e394528

b33e394528 oops actually make SSSL the default window pattern

fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb

2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge

201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints

Compare 7 commits »

tacit synced commits to refs/pull/151/merge at tacit/nanochat from mirror 2026-01-12 06:23:49 +00:00

c58fb46025 Merge 7f3154f025 into b33e394528

b33e394528 oops actually make SSSL the default window pattern

fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb

2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge

201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints

Compare 5 commits »

tacit synced commits to refs/pull/141/merge at tacit/nanochat from mirror 2026-01-12 06:23:49 +00:00

0c9bf1fa6e Merge c79559674b into b33e394528

b33e394528 oops actually make SSSL the default window pattern

fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb

2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge

201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints

Compare 7 commits »

tacit synced commits to master at tacit/nanochat from mirror 2026-01-12 06:23:48 +00:00

4610a838a1 record negative result on MTP

21608ec51e allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway

aa95fb2e03 make miniseries more generic and easier to run and less hard coded

Compare 3 commits »

tacit synced commits to refs/pull/432/merge at tacit/nanochat from mirror 2026-01-11 22:13:43 +00:00

eee143c409 Merge 07e5509662 into aa530cdad5

aa530cdad5 Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb

2c4473dd1b Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.

Compare 3 commits »