tacit

tacit synced commits to refs/pull/407/merge at tacit/nanochat from mirror 2026-01-08 04:32:39 +00:00

7e3a197c43 Merge 1e04f9846e44fd602ac2232db056fe95c891adb8 into 061f83c152

061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then

e8c30c3b19 add notebook used for scaling laws analysis

3af4dcf6ee also add scaling_laws.sh script if it's a useful reference

4cc605b940 quick pointer to miniseries post in readme for now

Compare 7 commits »

tacit synced commits to refs/pull/400/merge at tacit/nanochat from mirror 2026-01-08 04:32:35 +00:00

2001fdf074 Merge 32ce342c88 into 1b5de29e71

1b5de29e71 Fix undefined variable in chat_rl after recent refactor

Compare 2 commits »

tacit synced commits to refs/pull/151/merge at tacit/nanochat from mirror 2026-01-08 04:32:35 +00:00

639a1185ef Merge 7f3154f025 into 061f83c152

061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then

e8c30c3b19 add notebook used for scaling laws analysis

3af4dcf6ee also add scaling_laws.sh script if it's a useful reference

4cc605b940 quick pointer to miniseries post in readme for now

Compare 7 commits »

tacit synced commits to refs/pull/396/merge at tacit/nanochat from mirror 2026-01-08 04:32:35 +00:00

791ba639c4 Merge 7f6219e092 into 1b5de29e71

1b5de29e71 Fix undefined variable in chat_rl after recent refactor

Compare 2 commits »

tacit synced commits to master at tacit/nanochat from mirror 2026-01-08 04:32:34 +00:00

061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then

e8c30c3b19 add notebook used for scaling laws analysis

3af4dcf6ee also add scaling_laws.sh script if it's a useful reference

4cc605b940 quick pointer to miniseries post in readme for now

ccf4b7f9bf nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script

Compare 5 commits »

tacit synced commits to master at tacit/nanochat from mirror 2026-01-07 20:22:44 +00:00

1b5de29e71 Fix undefined variable in chat_rl after recent refactor

tacit synced and deleted reference refs/tags/refs/pull/417/merge at tacit/nanochat from mirror 2026-01-07 20:22:44 +00:00

tacit synced and deleted reference refs/tags/refs/pull/40/merge at tacit/nanochat from mirror 2026-01-07 20:22:43 +00:00

tacit synced commits to refs/pull/407/merge at tacit/nanochat from mirror 2026-01-07 12:12:44 +00:00

fdccec1819 Merge 1e04f9846e44fd602ac2232db056fe95c891adb8 into ae0bf52529

ae0bf52529 tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3

eec0c79563 also add matplotlib dep so that we can have jupyter notebooks

54e59c38ad add notebook on deriving the CORE estimates for the GPT-3 miniseries.

Compare 4 commits »

tacit synced commits to refs/pull/399/merge at tacit/nanochat from mirror 2026-01-07 04:02:32 +00:00

aff3de502b Merge de50af6283 into ae0bf52529

ae0bf52529 tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3

eec0c79563 also add matplotlib dep so that we can have jupyter notebooks

54e59c38ad add notebook on deriving the CORE estimates for the GPT-3 miniseries.

Compare 4 commits »

tacit synced commits to refs/pull/370/merge at tacit/nanochat from mirror 2026-01-06 11:42:35 +00:00

a0415ee1ce Merge 9a9b12b1be into ae0bf52529

ae0bf52529 tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3

eec0c79563 also add matplotlib dep so that we can have jupyter notebooks

54e59c38ad add notebook on deriving the CORE estimates for the GPT-3 miniseries.

9d4c9b786d many small fixes to base_train: reporting ETA, allowing some additional kwarg flexibility, making sure we don't crash when e.g. depth = 11 - we now calculate the closest num_heads that works

Compare 5 commits »

tacit synced commits to refs/pull/141/merge at tacit/nanochat from mirror 2026-01-06 11:42:35 +00:00

d2fffad039 Merge c79559674b into ae0bf52529

ae0bf52529 tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3

eec0c79563 also add matplotlib dep so that we can have jupyter notebooks

54e59c38ad add notebook on deriving the CORE estimates for the GPT-3 miniseries.

Compare 4 commits »

tacit synced commits to refs/pull/85/merge at tacit/nanochat from mirror 2026-01-06 03:33:07 +00:00

e5243472c2 Merge 04862cbfea into ae0bf52529

ae0bf52529 tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3

eec0c79563 also add matplotlib dep so that we can have jupyter notebooks

54e59c38ad add notebook on deriving the CORE estimates for the GPT-3 miniseries.

9d4c9b786d many small fixes to base_train: reporting ETA, allowing some additional kwarg flexibility, making sure we don't crash when e.g. depth = 11 - we now calculate the closest num_heads that works

Compare 14 commits »

tacit synced commits to refs/pull/93/merge at tacit/nanochat from mirror 2026-01-06 03:33:07 +00:00

feb267462e Merge 7950813a41 into ae0bf52529

ae0bf52529 tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3

eec0c79563 also add matplotlib dep so that we can have jupyter notebooks

54e59c38ad add notebook on deriving the CORE estimates for the GPT-3 miniseries.

Compare 4 commits »

tacit synced commits to refs/pull/414/merge at tacit/nanochat from mirror 2026-01-06 03:33:06 +00:00

559bb820b5 Merge 57bcf6786e into ae0bf52529

ae0bf52529 tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3

eec0c79563 also add matplotlib dep so that we can have jupyter notebooks

54e59c38ad add notebook on deriving the CORE estimates for the GPT-3 miniseries.

Compare 4 commits »

tacit synced commits to refs/pull/59/merge at tacit/nanochat from mirror 2026-01-06 03:33:06 +00:00

342eb3475d Merge 23393eae83 into ae0bf52529

ae0bf52529 tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3

eec0c79563 also add matplotlib dep so that we can have jupyter notebooks

54e59c38ad add notebook on deriving the CORE estimates for the GPT-3 miniseries.

9d4c9b786d many small fixes to base_train: reporting ETA, allowing some additional kwarg flexibility, making sure we don't crash when e.g. depth = 11 - we now calculate the closest num_heads that works

Compare 10 commits »

tacit synced commits to refs/pull/412/merge at tacit/nanochat from mirror 2026-01-06 03:33:03 +00:00

ab173ba89c Merge db5e62fc2a into ae0bf52529

ae0bf52529 tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3

eec0c79563 also add matplotlib dep so that we can have jupyter notebooks

54e59c38ad add notebook on deriving the CORE estimates for the GPT-3 miniseries.

Compare 4 commits »

tacit synced commits to refs/pull/400/merge at tacit/nanochat from mirror 2026-01-06 03:33:02 +00:00

27de58a15e Merge 32ce342c88 into ae0bf52529

ae0bf52529 tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3

eec0c79563 also add matplotlib dep so that we can have jupyter notebooks

54e59c38ad add notebook on deriving the CORE estimates for the GPT-3 miniseries.

Compare 4 commits »

tacit synced commits to refs/pull/405/merge at tacit/nanochat from mirror 2026-01-06 03:33:02 +00:00

14a33af1d9 Merge 3c4be194e9 into ae0bf52529

ae0bf52529 tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3

eec0c79563 also add matplotlib dep so that we can have jupyter notebooks

54e59c38ad add notebook on deriving the CORE estimates for the GPT-3 miniseries.

Compare 4 commits »

tacit synced commits to refs/pull/40/merge at tacit/nanochat from mirror 2026-01-06 03:33:01 +00:00

f7c68d58b5 Merge 919ea572b0 into ae0bf52529

ae0bf52529 tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3

eec0c79563 also add matplotlib dep so that we can have jupyter notebooks

54e59c38ad add notebook on deriving the CORE estimates for the GPT-3 miniseries.

9d4c9b786d many small fixes to base_train: reporting ETA, allowing some additional kwarg flexibility, making sure we don't crash when e.g. depth = 11 - we now calculate the closest num_heads that works

Compare 14 commits »