fix: correct CSV extraction in scaling_laws.sh

Two bugs caused all parameter columns and tokens_trained to be silently empty/wrong in the results CSV: 1. Parameter grep patterns did not account for the padded key format. base_train.py prints parameters as `{key:24s}: {value:,}`, e.g. `wte : 33,554,432`, so patterns like `grep "wte:"` never matched. Fixed by using `grep -P "wte\s+:"` to handle the spaces. 2. tokens_trained was hardcoded as `NUM_ITERS * 524288`, but the batch size is auto-computed by base_train.py and may differ from 524288 depending on the FLOPs budget and model size. Fixed by extracting the actual value from the log line "Total number of training tokens: X".
2026-03-07 01:40:30 +00:00 · 2026-02-28 16:37:04 +00:00 · 2026-02-28 16:37:04 +00:00 · fb2be07e17
commit fb2be07e17
parent c7ba252142
1 changed files with 9 additions and 9 deletions
--- a/runs/scaling_laws.sh
+++ b/runs/scaling_laws.sh
@ -86,17 +86,17 @@ for flops in "${FLOPS_BUDGETS[@]}"; do
        LOG_FILE="$RESULTS_DIR/${TAG}_train.log"

        # Extract detailed parameter counts (for scaling law analysis with different conventions)
-        PARAMS_WTE=$(grep "wte:" "$LOG_FILE" | tail -1 | grep -oP '[\d,]+' | tr -d ',')
-        PARAMS_BIGRAM=$(grep "bigram_embed:" "$LOG_FILE" | tail -1 | grep -oP '[\d,]+' | tr -d ',')
-        PARAMS_VE=$(grep "value_embeds:" "$LOG_FILE" | tail -1 | grep -oP '[\d,]+' | tr -d ',')
-        PARAMS_LM=$(grep "lm_head:" "$LOG_FILE" | tail -1 | grep -oP '[\d,]+' | tr -d ',')
-        PARAMS_TRANSFORMER=$(grep "transformer_matrices:" "$LOG_FILE" | tail -1 | grep -oP '[\d,]+' | tr -d ',')
-        PARAMS_SCALARS=$(grep "scalars:" "$LOG_FILE" | tail -1 | grep -oP '[\d,]+' | tr -d ',')
-        PARAMS_TOTAL=$(grep "total:" "$LOG_FILE" | tail -1 | grep -oP '[\d,]+' | tr -d ',')
+        PARAMS_WTE=$(grep -P "wte\s+:" "$LOG_FILE" | tail -1 | grep -oP '[\d,]+' | tr -d ',')
+        PARAMS_BIGRAM=$(grep -P "bigram_embed\s+:" "$LOG_FILE" | tail -1 | grep -oP '[\d,]+' | tr -d ',')
+        PARAMS_VE=$(grep -P "value_embeds\s+:" "$LOG_FILE" | tail -1 | grep -oP '[\d,]+' | tr -d ',')
+        PARAMS_LM=$(grep -P "lm_head\s+:" "$LOG_FILE" | tail -1 | grep -oP '[\d,]+' | tr -d ',')
+        PARAMS_TRANSFORMER=$(grep -P "transformer_matrices\s+:" "$LOG_FILE" | tail -1 | grep -oP '[\d,]+' | tr -d ',')
+        PARAMS_SCALARS=$(grep -P "scalars\s+:" "$LOG_FILE" | tail -1 | grep -oP '[\d,]+' | tr -d ',')
+        PARAMS_TOTAL=$(grep -P "total\s+:" "$LOG_FILE" | tail -1 | grep -oP '[\d,]+' | tr -d ',')

        NUM_ITERS=$(grep "Calculated number of iterations" "$LOG_FILE" | tail -1 | sed 's/.*: //' | tr -d ',')
-        # Calculate tokens trained (iterations * batch_size, default 524288)
-        TOKENS_TRAINED=$((NUM_ITERS * 524288))
+        # Extract actual tokens trained from log (batch size is auto-computed, may differ from 524288)
+        TOKENS_TRAINED=$(grep "Total number of training tokens:" "$LOG_FILE" | tail -1 | grep -oP '[\d,]+' | tr -d ',')
        # Model dim
        MODEL_DIM=$((d * 64))
        # Val BPB from final eval