tacit

tacit synced commits to refs/pull/370/merge at tacit/nanochat from mirror 2026-02-09 12:09:49 +00:00

3f84390627 Merge 9a9b12b1be into 1ec0a34779

1ec0a34779 at 28 and above we start to need batch size 8

ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing

Compare 3 commits »

tacit synced commits to refs/pull/512/merge at tacit/nanochat from mirror 2026-02-09 03:59:52 +00:00

210d420c37 Merge 2e0fda1893 into 1ec0a34779

1ec0a34779 at 28 and above we start to need batch size 8

ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing

Compare 3 commits »

tacit synced commits to refs/pull/511/merge at tacit/nanochat from mirror 2026-02-09 03:59:52 +00:00

c70f03da94 Merge 5ba77a31b9 into 1ec0a34779

1ec0a34779 at 28 and above we start to need batch size 8

ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing

Compare 3 commits »

tacit synced commits to refs/pull/486/merge at tacit/nanochat from mirror 2026-02-09 03:59:52 +00:00

899119ad73 Merge 7c0eb3f00b into 1ec0a34779

1ec0a34779 at 28 and above we start to need batch size 8

ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing

Compare 3 commits »

tacit synced commits to refs/pull/509/merge at tacit/nanochat from mirror 2026-02-08 19:49:50 +00:00

e3bb618163 Merge 9caf6690a1 into 1ec0a34779

1ec0a34779 at 28 and above we start to need batch size 8

ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing

Compare 3 commits »

tacit synced commits to refs/pull/483/merge at tacit/nanochat from mirror 2026-02-08 19:49:50 +00:00

5a72263964 Merge 5e5c609b05 into 1ec0a34779

1ec0a34779 at 28 and above we start to need batch size 8

ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing

Compare 3 commits »

tacit synced commits to refs/pull/513/merge at tacit/nanochat from mirror 2026-02-08 19:49:50 +00:00

79aa242302 Merge 23acb17f17 into 1ec0a34779

1ec0a34779 at 28 and above we start to need batch size 8

ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing

Compare 3 commits »

tacit synced commits to refs/pull/492/merge at tacit/nanochat from mirror 2026-02-08 19:49:50 +00:00

19ea3598b3 Merge 7ac837cff8 into 1ec0a34779

1ec0a34779 at 28 and above we start to need batch size 8

ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing

Compare 3 commits »

tacit synced commits to refs/pull/501/merge at tacit/nanochat from mirror 2026-02-08 19:49:50 +00:00

9e95480bbd Merge 005daea668 into 1ec0a34779

1ec0a34779 at 28 and above we start to need batch size 8

ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing

Compare 3 commits »

tacit synced commits to master at tacit/nanochat from mirror 2026-02-08 19:49:49 +00:00

1ec0a34779 at 28 and above we start to need batch size 8

ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing

Compare 2 commits »

tacit synced commits to refs/pull/489/merge at tacit/nanochat from mirror 2026-02-07 11:09:58 +00:00

88dea55f24 Merge 79b7b04ca0 into aeff095e97

aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon

685271dc8d new optimal ratio for d26 training

Compare 3 commits »

tacit synced commits to refs/pull/370/merge at tacit/nanochat from mirror 2026-02-07 11:09:57 +00:00

ad7e3f03bf Merge 9a9b12b1be into aeff095e97

aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon

685271dc8d new optimal ratio for d26 training

Compare 3 commits »

tacit synced commits to refs/pull/141/merge at tacit/nanochat from mirror 2026-02-07 11:09:56 +00:00

fce7fea168 Merge 65865df300 into aeff095e97

aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon

685271dc8d new optimal ratio for d26 training

e527521a3f briefly mention batch ramp experimentation too, too weak to merge in my few attempts

96522798f1 docs docs docs

Compare 14 commits »

tacit synced commits to refs/pull/501/merge at tacit/nanochat from mirror 2026-02-07 02:59:55 +00:00

2134348a5c Merge 005daea668 into aeff095e97

aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon

685271dc8d new optimal ratio for d26 training

Compare 3 commits »

tacit synced commits to refs/pull/486/merge at tacit/nanochat from mirror 2026-02-07 02:59:55 +00:00

e1d6e5ff9c Merge 7c0eb3f00b into aeff095e97

aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon

685271dc8d new optimal ratio for d26 training

Compare 3 commits »

tacit synced commits to refs/pull/492/merge at tacit/nanochat from mirror 2026-02-07 02:59:55 +00:00

8282bc3f77 Merge 7ac837cff8 into aeff095e97

aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon

685271dc8d new optimal ratio for d26 training

Compare 3 commits »

tacit synced commits to refs/pull/509/merge at tacit/nanochat from mirror 2026-02-07 02:59:55 +00:00

1def718457 Merge 9caf6690a1 into aeff095e97

aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon

685271dc8d new optimal ratio for d26 training

Compare 3 commits »

tacit synced commits to refs/pull/483/merge at tacit/nanochat from mirror 2026-02-07 02:59:54 +00:00

851463e628 Merge 5e5c609b05 into aeff095e97

aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon

685271dc8d new optimal ratio for d26 training

Compare 3 commits »

tacit synced commits to master at tacit/nanochat from mirror 2026-02-07 02:59:54 +00:00

aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon

685271dc8d new optimal ratio for d26 training

Compare 2 commits »

tacit synced and deleted reference refs/tags/refs/pull/508/merge at tacit/nanochat from mirror 2026-02-07 02:59:53 +00:00