tacit

tacit synced commits to refs/pull/492/merge at tacit/nanochat from mirror 2026-02-08 19:49:50 +00:00

19ea3598b3 Merge 7ac837cff8 into 1ec0a34779

1ec0a34779 at 28 and above we start to need batch size 8

ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing

Compare 3 commits »

tacit synced commits to refs/pull/501/merge at tacit/nanochat from mirror 2026-02-08 19:49:50 +00:00

9e95480bbd Merge 005daea668 into 1ec0a34779

1ec0a34779 at 28 and above we start to need batch size 8

ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing

Compare 3 commits »

tacit synced commits to refs/pull/509/merge at tacit/nanochat from mirror 2026-02-08 19:49:50 +00:00

e3bb618163 Merge 9caf6690a1 into 1ec0a34779

1ec0a34779 at 28 and above we start to need batch size 8

ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing

Compare 3 commits »

tacit synced commits to refs/pull/513/merge at tacit/nanochat from mirror 2026-02-08 19:49:50 +00:00

79aa242302 Merge 23acb17f17 into 1ec0a34779

1ec0a34779 at 28 and above we start to need batch size 8

ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing

Compare 3 commits »

tacit synced commits to master at tacit/nanochat from mirror 2026-02-08 19:49:49 +00:00

1ec0a34779 at 28 and above we start to need batch size 8

ff46300720 tune miniseries just a bit, fairly cosmetic, keep to even depths where the math works out nicely in model sizing

Compare 2 commits »

tacit synced commits to refs/pull/489/merge at tacit/nanochat from mirror 2026-02-07 11:09:58 +00:00

88dea55f24 Merge 79b7b04ca0 into aeff095e97

aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon

685271dc8d new optimal ratio for d26 training

Compare 3 commits »

tacit synced commits to refs/pull/370/merge at tacit/nanochat from mirror 2026-02-07 11:09:57 +00:00

ad7e3f03bf Merge 9a9b12b1be into aeff095e97

aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon

685271dc8d new optimal ratio for d26 training

Compare 3 commits »

tacit synced commits to refs/pull/141/merge at tacit/nanochat from mirror 2026-02-07 11:09:56 +00:00

fce7fea168 Merge 65865df300 into aeff095e97

aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon

685271dc8d new optimal ratio for d26 training

e527521a3f briefly mention batch ramp experimentation too, too weak to merge in my few attempts

96522798f1 docs docs docs

Compare 14 commits »

tacit synced commits to refs/pull/486/merge at tacit/nanochat from mirror 2026-02-07 02:59:55 +00:00

e1d6e5ff9c Merge 7c0eb3f00b into aeff095e97

aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon

685271dc8d new optimal ratio for d26 training

Compare 3 commits »

tacit synced commits to refs/pull/492/merge at tacit/nanochat from mirror 2026-02-07 02:59:55 +00:00

8282bc3f77 Merge 7ac837cff8 into aeff095e97

aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon

685271dc8d new optimal ratio for d26 training

Compare 3 commits »

tacit synced commits to refs/pull/501/merge at tacit/nanochat from mirror 2026-02-07 02:59:55 +00:00

2134348a5c Merge 005daea668 into aeff095e97

aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon

685271dc8d new optimal ratio for d26 training

Compare 3 commits »

tacit synced commits to refs/pull/509/merge at tacit/nanochat from mirror 2026-02-07 02:59:55 +00:00

1def718457 Merge 9caf6690a1 into aeff095e97

aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon

685271dc8d new optimal ratio for d26 training

Compare 3 commits »

tacit synced commits to master at tacit/nanochat from mirror 2026-02-07 02:59:54 +00:00

aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon

685271dc8d new optimal ratio for d26 training

Compare 2 commits »

tacit synced commits to refs/pull/483/merge at tacit/nanochat from mirror 2026-02-07 02:59:54 +00:00

851463e628 Merge 5e5c609b05 into aeff095e97

aeff095e97 better comments/flow on all the hyperparameter transfer stuff, and change the WD scaling from my empirical 1/d^2 to a bit more principled version based on Tepoch. All of that theory is based on AdamW and could be suboptimal for Muon

685271dc8d new optimal ratio for d26 training

Compare 3 commits »

tacit synced and deleted reference refs/tags/refs/pull/508/merge at tacit/nanochat from mirror 2026-02-07 02:59:53 +00:00

tacit synced commits to refs/pull/489/merge at tacit/nanochat from mirror 2026-02-06 18:49:50 +00:00

993b75d121 Merge 79b7b04ca0 into e527521a3f

e527521a3f briefly mention batch ramp experimentation too, too weak to merge in my few attempts

96522798f1 docs docs docs

5fdd5cdb24 new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier

2c062aaa94 nit: don't mutate args, create new var for total_batch_size

Compare 9 commits »

tacit synced commits to refs/pull/414/merge at tacit/nanochat from mirror 2026-02-06 18:49:49 +00:00

8c71b092d2 Merge 57bcf6786e into e527521a3f

e527521a3f briefly mention batch ramp experimentation too, too weak to merge in my few attempts

96522798f1 docs docs docs

5fdd5cdb24 new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier

2c062aaa94 nit: don't mutate args, create new var for total_batch_size

Compare 9 commits »

tacit synced commits to refs/pull/425/merge at tacit/nanochat from mirror 2026-02-06 18:49:49 +00:00

f94cee74eb Merge eebab89a11 into e527521a3f

e527521a3f briefly mention batch ramp experimentation too, too weak to merge in my few attempts

96522798f1 docs docs docs

5fdd5cdb24 new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier

2c062aaa94 nit: don't mutate args, create new var for total_batch_size

Compare 9 commits »

tacit synced commits to refs/pull/485/merge at tacit/nanochat from mirror 2026-02-06 18:49:49 +00:00

9260fbde65 Merge 181e7f1c15 into e527521a3f

e527521a3f briefly mention batch ramp experimentation too, too weak to merge in my few attempts

96522798f1 docs docs docs

5fdd5cdb24 new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier

2c062aaa94 nit: don't mutate args, create new var for total_batch_size

Compare 9 commits »

tacit synced commits to refs/pull/151/merge at tacit/nanochat from mirror 2026-02-06 18:49:49 +00:00

b68a941741 Merge 7f3154f025 into e527521a3f

e527521a3f briefly mention batch ramp experimentation too, too weak to merge in my few attempts

96522798f1 docs docs docs

5fdd5cdb24 new leaderboard record via new auto-calculated optimal batch size. for d26 it is 1M, up from 0.5M that was default earlier

2c062aaa94 nit: don't mutate args, create new var for total_batch_size

Compare 9 commits »