diff --git a/dev/LEADERBOARD.md b/dev/LEADERBOARD.md index 556ec3c..6fdeaa3 100644 --- a/dev/LEADERBOARD.md +++ b/dev/LEADERBOARD.md @@ -36,7 +36,7 @@ Note that: - `target-param-data-ratio=8.25` controls the training horizon, which is determined in the script by taking the number of non-embedding model parameters and simply multiplying by this number. The current optimal Tokens:Params ratio can be seen in the defaults of the `base_train.py` script (it is 10.5). 10.5 would produce the *compute optimal* model given the currently measured scaling laws. However, GPT-2 capability is currently somewhere in between a d24 and d26. So to reach it exactly, we want to either overtrain d24 or undertrain d26. In this particular example, I am choosing to slightly undertrain a d26. Note that odd depths (e.g. d25) are not super recommended to use because the math around the transformer sizing and its head dimensions doesn't come out neatly. - `--fp8` turns on fp8 training. If your GPU does not support fp8, you can leave this out and the code will simply train in bf16. bf16 is higher precision than fp8, so you can actually expect that you might be able to do fewer steps (lower the `target-param-data-ratio`) to achieve the same capability. -Once you kick off the run, you wait ~3 hours and then at the end you'll see something like: +Once you kick off the run, you wait ~2 hours and then at the end you'll see something like: ``` wandb: Run summary: