• Joined on 2024-05-31
tacit synced commits to refs/pull/516/merge at tacit/nanochat from mirror 2026-02-10 20:49:59 +00:00
e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
Compare 2 commits »
tacit synced commits to refs/pull/513/merge at tacit/nanochat from mirror 2026-02-10 20:49:59 +00:00
e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
Compare 2 commits »
tacit synced commits to refs/pull/515/merge at tacit/nanochat from mirror 2026-02-10 20:49:59 +00:00
e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
Compare 2 commits »
tacit synced commits to refs/pull/510/merge at tacit/nanochat from mirror 2026-02-10 20:49:58 +00:00
e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
Compare 2 commits »
tacit synced commits to refs/pull/509/merge at tacit/nanochat from mirror 2026-02-10 20:49:58 +00:00
e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
Compare 2 commits »
tacit synced commits to refs/pull/501/merge at tacit/nanochat from mirror 2026-02-10 20:49:58 +00:00
e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
Compare 2 commits »
tacit synced commits to refs/pull/489/merge at tacit/nanochat from mirror 2026-02-10 20:49:58 +00:00
e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
1ec0a34779 at 28 and above we start to need batch size 8
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
Compare 4 commits »
tacit synced commits to refs/pull/485/merge at tacit/nanochat from mirror 2026-02-10 20:49:58 +00:00
e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
Compare 2 commits »
tacit synced commits to refs/pull/492/merge at tacit/nanochat from mirror 2026-02-10 20:49:58 +00:00
e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
Compare 2 commits »
tacit synced commits to refs/pull/486/merge at tacit/nanochat from mirror 2026-02-10 20:49:58 +00:00
e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
Compare 2 commits »
tacit synced commits to refs/pull/442/merge at tacit/nanochat from mirror 2026-02-10 20:49:57 +00:00
e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
1ec0a34779 at 28 and above we start to need batch size 8
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
Compare 6 commits »
tacit synced commits to master at tacit/nanochat from mirror 2026-02-10 20:49:57 +00:00
e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
tacit synced commits to refs/pull/437/merge at tacit/nanochat from mirror 2026-02-10 20:49:57 +00:00
e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
1ec0a34779 at 28 and above we start to need batch size 8
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
Compare 6 commits »
tacit synced commits to refs/pull/483/merge at tacit/nanochat from mirror 2026-02-10 20:49:57 +00:00
e569b59f92 delete torchao dependency, create our own exact API-matched version of Float8Linear, document it very well. for some poorly understood reason, the performance is not only ~identical but actually runs 3% faster. despite of it being significantly simpler and much less code. i don't fully understand why/how atm
Compare 2 commits »
tacit synced commits to refs/pull/425/merge at tacit/nanochat from mirror 2026-02-10 04:29:51 +00:00
1ec0a34779 at 28 and above we start to need batch size 8
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
685271dc8d new optimal ratio for d26 training
Compare 5 commits »
tacit synced commits to refs/pull/483/head at tacit/nanochat from mirror 2026-02-10 04:29:51 +00:00
56660c690b Merge branch 'karpathy:master' into master
1ec0a34779 at 28 and above we start to need batch size 8
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
685271dc8d new optimal ratio for d26 training
Compare 16 commits »
tacit synced commits to refs/pull/85/merge at tacit/nanochat from mirror 2026-02-10 04:29:51 +00:00
1ec0a34779 at 28 and above we start to need batch size 8
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
685271dc8d new optimal ratio for d26 training
Compare 16 commits »
tacit synced commits to refs/pull/483/merge at tacit/nanochat from mirror 2026-02-10 04:29:51 +00:00
56660c690b Merge branch 'karpathy:master' into master
Compare 2 commits »
tacit synced commits to refs/pull/141/merge at tacit/nanochat from mirror 2026-02-10 04:29:50 +00:00
1ec0a34779 at 28 and above we start to need batch size 8
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
Compare 3 commits »
tacit synced commits to refs/pull/510/head at tacit/nanochat from mirror 2026-02-09 20:19:56 +00:00
f999d06d58 Merge branch 'master' into fix/comment
1ec0a34779 at 28 and above we start to need batch size 8
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
685271dc8d new optimal ratio for d26 training
Compare 5 commits »