• Joined on 2024-05-31
tacit synced commits to master at tacit/nanochat from mirror 2026-01-11 22:13:41 +00:00
b33e394528 oops actually make SSSL the default window pattern
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge
201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints
aa530cdad5 Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb
Compare 6 commits »
tacit synced commits to refs/pull/204/merge at tacit/nanochat from mirror 2026-01-10 21:43:42 +00:00
f5a0ea4d3f take out these gitignore dirs
4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug
Compare 3 commits »
tacit synced commits to refs/pull/85/merge at tacit/nanochat from mirror 2026-01-10 21:43:42 +00:00
f5a0ea4d3f take out these gitignore dirs
4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug
Compare 3 commits »
tacit synced commits to refs/pull/93/merge at tacit/nanochat from mirror 2026-01-10 05:24:03 +00:00
f5a0ea4d3f take out these gitignore dirs
4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug
a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)
061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then
Compare 10 commits »
tacit synced commits to refs/pull/141/merge at tacit/nanochat from mirror 2026-01-10 05:24:03 +00:00
f5a0ea4d3f take out these gitignore dirs
4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug
a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)
061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then
Compare 10 commits »
tacit synced commits to refs/pull/400/merge at tacit/nanochat from mirror 2026-01-09 21:22:41 +00:00
f5a0ea4d3f take out these gitignore dirs
4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug
a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)
Compare 4 commits »
tacit synced commits to refs/pull/296/merge at tacit/nanochat from mirror 2026-01-09 21:22:41 +00:00
f5a0ea4d3f take out these gitignore dirs
4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug
a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)
061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then
Compare 14 commits »
tacit synced commits to refs/pull/396/merge at tacit/nanochat from mirror 2026-01-09 13:12:48 +00:00
f5a0ea4d3f take out these gitignore dirs
4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug
a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)
061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then
Compare 9 commits »
tacit synced commits to refs/pull/151/merge at tacit/nanochat from mirror 2026-01-09 13:12:47 +00:00
f5a0ea4d3f take out these gitignore dirs
4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug
a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)
Compare 4 commits »
tacit synced commits to refs/pull/405/merge at tacit/nanochat from mirror 2026-01-09 05:02:39 +00:00
f5a0ea4d3f take out these gitignore dirs
4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug
Compare 3 commits »
tacit synced commits to refs/pull/399/merge at tacit/nanochat from mirror 2026-01-09 05:02:39 +00:00
f5a0ea4d3f take out these gitignore dirs
4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug
a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)
061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then
Compare 10 commits »
tacit synced commits to refs/pull/59/merge at tacit/nanochat from mirror 2026-01-09 05:02:39 +00:00
f5a0ea4d3f take out these gitignore dirs
4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug
a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)
061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then
Compare 9 commits »
tacit synced commits to refs/pull/407/merge at tacit/nanochat from mirror 2026-01-09 05:02:39 +00:00
f5a0ea4d3f take out these gitignore dirs
4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug
a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)
Compare 4 commits »
tacit synced commits to refs/pull/414/merge at tacit/nanochat from mirror 2026-01-08 20:52:41 +00:00
f5a0ea4d3f take out these gitignore dirs
4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug
a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)
061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then
Compare 5 commits »
tacit synced commits to refs/pull/85/merge at tacit/nanochat from mirror 2026-01-08 20:52:41 +00:00
a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)
Compare 2 commits »
tacit synced commits to refs/pull/412/merge at tacit/nanochat from mirror 2026-01-08 20:52:40 +00:00
f5a0ea4d3f take out these gitignore dirs
4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug
a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)
Compare 4 commits »
tacit synced commits to refs/pull/409/merge at tacit/nanochat from mirror 2026-01-08 20:52:40 +00:00
f5a0ea4d3f take out these gitignore dirs
4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug
a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)
Compare 4 commits »
tacit synced commits to refs/pull/400/merge at tacit/nanochat from mirror 2026-01-08 20:52:39 +00:00
061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then
e8c30c3b19 add notebook used for scaling laws analysis
3af4dcf6ee also add scaling_laws.sh script if it's a useful reference
4cc605b940 quick pointer to miniseries post in readme for now
Compare 6 commits »
tacit synced commits to refs/pull/405/merge at tacit/nanochat from mirror 2026-01-08 20:52:39 +00:00
a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)
Compare 2 commits »
tacit synced commits to refs/pull/204/merge at tacit/nanochat from mirror 2026-01-08 20:52:38 +00:00
a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416)
061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then
e8c30c3b19 add notebook used for scaling laws analysis
3af4dcf6ee also add scaling_laws.sh script if it's a useful reference
Compare 25 commits »