• Joined on 2024-05-31
tacit synced commits to refs/pull/370/merge at tacit/nanochat from mirror 2026-02-09 12:09:49 +00:00
1ec0a34779 at 28 and above we start to need batch size 8
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
Compare 3 commits »
tacit synced commits to refs/pull/512/merge at tacit/nanochat from mirror 2026-02-09 03:59:52 +00:00
1ec0a34779 at 28 and above we start to need batch size 8
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
Compare 3 commits »
tacit synced commits to refs/pull/511/merge at tacit/nanochat from mirror 2026-02-09 03:59:52 +00:00
1ec0a34779 at 28 and above we start to need batch size 8
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
Compare 3 commits »
tacit synced commits to refs/pull/486/merge at tacit/nanochat from mirror 2026-02-09 03:59:52 +00:00
1ec0a34779 at 28 and above we start to need batch size 8
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
Compare 3 commits »
tacit synced commits to refs/pull/509/merge at tacit/nanochat from mirror 2026-02-08 19:49:50 +00:00
1ec0a34779 at 28 and above we start to need batch size 8
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
Compare 3 commits »
tacit synced commits to refs/pull/483/merge at tacit/nanochat from mirror 2026-02-08 19:49:50 +00:00
1ec0a34779 at 28 and above we start to need batch size 8
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
Compare 3 commits »
tacit synced commits to refs/pull/513/merge at tacit/nanochat from mirror 2026-02-08 19:49:50 +00:00
1ec0a34779 at 28 and above we start to need batch size 8
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
Compare 3 commits »
tacit synced commits to refs/pull/492/merge at tacit/nanochat from mirror 2026-02-08 19:49:50 +00:00
1ec0a34779 at 28 and above we start to need batch size 8
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
Compare 3 commits »
tacit synced commits to refs/pull/501/merge at tacit/nanochat from mirror 2026-02-08 19:49:50 +00:00
1ec0a34779 at 28 and above we start to need batch size 8
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
Compare 3 commits »
tacit synced commits to master at tacit/nanochat from mirror 2026-02-08 19:49:49 +00:00
1ec0a34779 at 28 and above we start to need batch size 8
ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing
Compare 2 commits »
tacit synced commits to refs/pull/489/merge at tacit/nanochat from mirror 2026-02-07 11:09:58 +00:00
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
685271dc8d new optimal ratio for d26 training
Compare 3 commits »
tacit synced commits to refs/pull/370/merge at tacit/nanochat from mirror 2026-02-07 11:09:57 +00:00
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
685271dc8d new optimal ratio for d26 training
Compare 3 commits »
tacit synced commits to refs/pull/141/merge at tacit/nanochat from mirror 2026-02-07 11:09:56 +00:00
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
685271dc8d new optimal ratio for d26 training
e527521a3f briefly mention batch ramp experimentation too, too weak to merge in my few attempts
96522798f1 docs docs docs
Compare 14 commits »
tacit synced commits to refs/pull/501/merge at tacit/nanochat from mirror 2026-02-07 02:59:55 +00:00
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
685271dc8d new optimal ratio for d26 training
Compare 3 commits »
tacit synced commits to refs/pull/486/merge at tacit/nanochat from mirror 2026-02-07 02:59:55 +00:00
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
685271dc8d new optimal ratio for d26 training
Compare 3 commits »
tacit synced commits to refs/pull/492/merge at tacit/nanochat from mirror 2026-02-07 02:59:55 +00:00
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
685271dc8d new optimal ratio for d26 training
Compare 3 commits »
tacit synced commits to refs/pull/509/merge at tacit/nanochat from mirror 2026-02-07 02:59:55 +00:00
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
685271dc8d new optimal ratio for d26 training
Compare 3 commits »
tacit synced commits to refs/pull/483/merge at tacit/nanochat from mirror 2026-02-07 02:59:54 +00:00
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
685271dc8d new optimal ratio for d26 training
Compare 3 commits »
tacit synced commits to master at tacit/nanochat from mirror 2026-02-07 02:59:54 +00:00
aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon
685271dc8d new optimal ratio for d26 training
Compare 2 commits »
tacit synced and deleted reference refs/tags/refs/pull/508/merge at tacit/nanochat from mirror 2026-02-07 02:59:53 +00:00