nanochat/dev/estimate_gpt3_core.ipynb

2191 lines
71 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Estimating CORE Metric for GPT-3 Models\n",
"\n",
"**Authors**: Claude Code Opus 4.5, Andrej Karpathy\n",
"\n",
"**Date**: Jan 2026\n",
"\n",
"## Motivation\n",
"\n",
"The [CORE metric](https://arxiv.org/abs/2406.11794) (introduced in the DCLM paper) is a composite benchmark that evaluates pretrained language models across 22 diverse tasks spanning world knowledge, language understanding, commonsense reasoning, symbolic problem solving, and reading comprehension. It provides a single score that captures a model's general capabilities.\n",
"\n",
"We want to compare nanochat models against the GPT-3 model family from OpenAI's [\"Language Models are Few-Shot Learners\"](https://arxiv.org/abs/2005.14165) paper (2020). However, there's a problem: **GPT-3 models were never evaluated on CORE** (which didn't exist in 2020), and the models were never publicly released, so we can't evaluate them ourselves.\n",
"\n",
"## Our Approach\n",
"\n",
"We estimate CORE scores for GPT-3 by:\n",
"\n",
"1. **Identifying overlapping tasks** between the GPT-3 paper and CORE that were evaluated with similar methodology\n",
"2. **Using GPT-2 as calibration data** — we have actual CORE scores for all 4 GPT-2 models, plus the GPT-3 paper reports results on GPT-2-equivalent tasks\n",
"3. **Fitting a regression model** from the overlapping task scores to the full CORE score\n",
"4. **Applying the model to GPT-3** using their reported task scores\n",
"\n",
"This notebook documents our methodology in detail for reproducibility."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from pathlib import Path\n",
"import pandas as pd\n",
"\n",
"# For nice table display\n",
"pd.set_option('display.precision', 4)\n",
"pd.set_option('display.max_columns', 20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 1: Understanding CORE\n",
"\n",
"CORE consists of **22 tasks** evaluated in specific few-shot settings. The key innovation is **centering**: raw accuracies are adjusted to account for random guessing baselines.\n",
"\n",
"$$\\text{centered accuracy} = \\frac{\\text{accuracy} - \\text{baseline}}{1 - \\text{baseline}}$$\n",
"\n",
"The final CORE score is simply the **mean of all 22 centered accuracies**.\n",
"\n",
"### CORE Tasks\n",
"\n",
"| Category | Tasks |\n",
"|----------|-------|\n",
"| World Knowledge | Jeopardy, ARC Easy, ARC Challenge, BigBench QA Wikidata |\n",
"| Language Understanding | HellaSwag (0-shot & 10-shot), LAMBADA, Winograd, Winogrande, BigBench Language ID |\n",
"| Commonsense Reasoning | COPA, CommonsenseQA, PIQA, OpenBookQA |\n",
"| Symbolic Problem Solving | BigBench Dyck, Operators, CS Algorithms, Repeat Copy Logic, AGI Eval LSAT-AR |\n",
"| Reading Comprehension | SQuAD, CoQA, BoolQ |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 2: Task Overlap Analysis\n",
"\n",
"We carefully compared the evaluation methodology between GPT-3 and CORE for each task. Key considerations:\n",
"\n",
"1. **Number of few-shot examples (K)**: GPT-3 often uses more examples than CORE\n",
"2. **Task format**: Some tasks use different prompting strategies\n",
"3. **Scoring method**: GPT-3 uses unconditional probability normalization for some tasks\n",
"4. **Data split**: dev vs test set\n",
"\n",
"### Selection Criteria\n",
"\n",
"We applied a conservative filter: **both evaluations must use K=0 (zero-shot) or both must use K>0 (few-shot)**. We excluded tasks that mix zero-shot with few-shot, as this introduces systematic differences.\n",
"\n",
"### Tasks We Excluded\n",
"\n",
"| Task | GPT-3 K | CORE K | Reason for Exclusion |\n",
"|------|---------|--------|----------------------|\n",
"| Winograd | 7 | 0 | Mixing K>0 with K=0 |\n",
"| Winogrande | 50 | 0 | Mixing K>0 with K=0 |\n",
"| COPA | 32 | 0 | Mixing K>0 with K=0 |\n",
"| OpenBookQA | 100 | 0 | Mixing K>0 with K=0, also uses unconditional normalization |\n",
"| BoolQ | 32 | 10 | High sensitivity to K (17% gap between 0-shot and few-shot in GPT-3) |\n",
"| CoQA | 5 | 0 | Different metric (F1 vs accuracy) |\n",
"| LAMBADA few-shot | 15 | 0 | GPT-3 uses special fill-in-blank format |\n",
"\n",
"### Tasks Not in GPT-3 Paper\n",
"\n",
"These CORE tasks simply don't appear in GPT-3 (many didn't exist in 2020):\n",
"- All 6 BigBench tasks (Dyck, Operators, CS Algorithms, Repeat Copy Logic, Language ID, QA Wikidata)\n",
"- Jeopardy, CommonsenseQA, AGI Eval LSAT-AR\n",
"- SQuAD v1 (GPT-3 uses v2)\n",
"\n",
"### Final Selected Tasks (6 tasks)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Task</th>\n",
" <th>GPT-3 K</th>\n",
" <th>CORE K</th>\n",
" <th>Match</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>HellaSwag 0-shot</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Both zero-shot</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>LAMBADA</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Both zero-shot</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>HellaSwag 10-shot</td>\n",
" <td>20</td>\n",
" <td>10</td>\n",
" <td>Both few-shot (K differs slightly)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>PIQA</td>\n",
" <td>50</td>\n",
" <td>10</td>\n",
" <td>Both few-shot</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>ARC Easy</td>\n",
" <td>50</td>\n",
" <td>10</td>\n",
" <td>Both few-shot</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>ARC Challenge</td>\n",
" <td>50</td>\n",
" <td>10</td>\n",
" <td>Both few-shot</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Task GPT-3 K CORE K Match\n",
"0 HellaSwag 0-shot 0 0 Both zero-shot\n",
"1 LAMBADA 0 0 Both zero-shot\n",
"2 HellaSwag 10-shot 20 10 Both few-shot (K differs slightly)\n",
"3 PIQA 50 10 Both few-shot\n",
"4 ARC Easy 50 10 Both few-shot\n",
"5 ARC Challenge 50 10 Both few-shot"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# The 6 tasks we selected for overlap\n",
"selected_tasks = pd.DataFrame([\n",
" {'Task': 'HellaSwag 0-shot', 'GPT-3 K': 0, 'CORE K': 0, 'Match': 'Both zero-shot'},\n",
" {'Task': 'LAMBADA', 'GPT-3 K': 0, 'CORE K': 0, 'Match': 'Both zero-shot'},\n",
" {'Task': 'HellaSwag 10-shot', 'GPT-3 K': 20, 'CORE K': 10, 'Match': 'Both few-shot (K differs slightly)'},\n",
" {'Task': 'PIQA', 'GPT-3 K': 50, 'CORE K': 10, 'Match': 'Both few-shot'},\n",
" {'Task': 'ARC Easy', 'GPT-3 K': 50, 'CORE K': 10, 'Match': 'Both few-shot'},\n",
" {'Task': 'ARC Challenge', 'GPT-3 K': 50, 'CORE K': 10, 'Match': 'Both few-shot'},\n",
"])\n",
"selected_tasks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Rationale for K differences:** Looking at GPT-3's own data, the difference between different K values is typically small. Here's the evidence from the GPT-3 175B model:\n",
"\n",
"| Task | 0-shot | Few-shot | K | Δ |\n",
"|------|--------|----------|---|---|\n",
"| HellaSwag | 78.9% | 79.3% | 20 | +0.4% |\n",
"| PIQA | 81.0% | 82.3% | 50 | +1.3% |\n",
"| ARC Easy | 68.8% | 70.1% | 50 | +1.3% |\n",
"| ARC Challenge | 51.4% | 51.5% | 50 | +0.1% |\n",
"| Winograd | 88.3% | 88.6% | 7 | +0.3% |\n",
"| COPA | 91.0% | 92.0% | 32 | +1.0% |\n",
"\n",
"For most tasks, the gap between 0-shot and few-shot (with K=20-50) is only 0.1-1.3%. This suggests that differences between K=10 and K=50 would be even smaller, making our task selection reasonable.\n",
"\n",
"**Note:** Some tasks show larger sensitivity (Winogrande: +7.5%, BoolQ: +17%), which is why we excluded them."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 3: Calibration Data (GPT-2 Family)\n",
"\n",
"We have actual CORE scores for all 4 GPT-2 models. These serve as our calibration data."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Random baselines for centering (from CORE specification)\n",
"BASELINES = {\n",
" 'hellaswag_zeroshot': 0.25,\n",
" 'lambada_openai': 0.0,\n",
" 'hellaswag': 0.25,\n",
" 'piqa': 0.50,\n",
" 'arc_easy': 0.25,\n",
" 'arc_challenge': 0.25,\n",
"}\n",
"\n",
"TASK_ORDER = ['hellaswag_zeroshot', 'lambada_openai', 'hellaswag', 'piqa', 'arc_easy', 'arc_challenge']\n",
"TASK_NAMES = ['HellaSwag 0-shot', 'LAMBADA', 'HellaSwag 10-shot', 'PIQA', 'ARC Easy', 'ARC Challenge']\n",
"\n",
"def center_accuracy(acc, baseline):\n",
" \"\"\"Convert raw accuracy to centered accuracy.\"\"\"\n",
" return (acc - baseline) / (1.0 - baseline)\n",
"\n",
"def parse_csv(filepath):\n",
" \"\"\"Parse a CORE results CSV file.\"\"\"\n",
" results = {}\n",
" with open(filepath) as f:\n",
" for line in f:\n",
" parts = [p.strip() for p in line.strip().split(',')]\n",
" if len(parts) >= 3 and parts[0] != 'Task':\n",
" task = parts[0]\n",
" try:\n",
" acc = float(parts[1]) if parts[1] else None\n",
" centered = float(parts[2]) if parts[2] else None\n",
" results[task] = {'accuracy': acc, 'centered': centered}\n",
" except ValueError:\n",
" pass\n",
" return results"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GPT-2 Family: Raw Accuracies and CORE Scores\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Model</th>\n",
" <th>Params</th>\n",
" <th>HellaSwag 0-shot</th>\n",
" <th>LAMBADA</th>\n",
" <th>HellaSwag 10-shot</th>\n",
" <th>PIQA</th>\n",
" <th>ARC Easy</th>\n",
" <th>ARC Challenge</th>\n",
" <th>CORE</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>GPT-2</td>\n",
" <td>124M</td>\n",
" <td>30.9%</td>\n",
" <td>32.3%</td>\n",
" <td>30.8%</td>\n",
" <td>62.3%</td>\n",
" <td>41.2%</td>\n",
" <td>22.2%</td>\n",
" <td>0.1139</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>GPT-2 Medium</td>\n",
" <td>355M</td>\n",
" <td>39.0%</td>\n",
" <td>42.6%</td>\n",
" <td>39.5%</td>\n",
" <td>67.0%</td>\n",
" <td>48.0%</td>\n",
" <td>26.2%</td>\n",
" <td>0.1849</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>GPT-2 Large</td>\n",
" <td>774M</td>\n",
" <td>44.0%</td>\n",
" <td>48.8%</td>\n",
" <td>44.4%</td>\n",
" <td>69.8%</td>\n",
" <td>53.5%</td>\n",
" <td>26.4%</td>\n",
" <td>0.2146</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>GPT-2 XL</td>\n",
" <td>1558M</td>\n",
" <td>50.2%</td>\n",
" <td>52.3%</td>\n",
" <td>51.2%</td>\n",
" <td>72.5%</td>\n",
" <td>59.5%</td>\n",
" <td>29.9%</td>\n",
" <td>0.2565</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Model Params HellaSwag 0-shot LAMBADA HellaSwag 10-shot PIQA \\\n",
"0 GPT-2 124M 30.9% 32.3% 30.8% 62.3% \n",
"1 GPT-2 Medium 355M 39.0% 42.6% 39.5% 67.0% \n",
"2 GPT-2 Large 774M 44.0% 48.8% 44.4% 69.8% \n",
"3 GPT-2 XL 1558M 50.2% 52.3% 51.2% 72.5% \n",
"\n",
" ARC Easy ARC Challenge CORE \n",
"0 41.2% 22.2% 0.1139 \n",
"1 48.0% 26.2% 0.1849 \n",
"2 53.5% 26.4% 0.2146 \n",
"3 59.5% 29.9% 0.2565 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Load GPT-2 CORE results\n",
"knowledge_dir = Path(\"/home/ubuntu/.cache/nanochat/eval_bundle\")\n",
"\n",
"gpt2_models = [\n",
" ('GPT-2', 'openai-community-gpt2.csv', 124e6),\n",
" ('GPT-2 Medium', 'openai-community-gpt2-medium.csv', 355e6),\n",
" ('GPT-2 Large', 'openai-community-gpt2-large.csv', 774e6),\n",
" ('GPT-2 XL', 'openai-community-gpt2-xl.csv', 1558e6),\n",
"]\n",
"\n",
"gpt2_data = []\n",
"for name, filename, params in gpt2_models:\n",
" results = parse_csv(knowledge_dir / filename)\n",
" core = results['CORE']['centered']\n",
" task_accs = [results[task]['accuracy'] for task in TASK_ORDER]\n",
" gpt2_data.append({\n",
" 'name': name,\n",
" 'params': params,\n",
" 'task_accs': task_accs,\n",
" 'core': core,\n",
" })\n",
"\n",
"# Display as DataFrame\n",
"gpt2_df = pd.DataFrame([\n",
" {\n",
" 'Model': d['name'],\n",
" 'Params': f\"{d['params']/1e6:.0f}M\",\n",
" **{name: f\"{acc:.1%}\" for name, acc in zip(TASK_NAMES, d['task_accs'])},\n",
" 'CORE': f\"{d['core']:.4f}\"\n",
" }\n",
" for d in gpt2_data\n",
"])\n",
"print(\"GPT-2 Family: Raw Accuracies and CORE Scores\")\n",
"gpt2_df"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GPT-2 Family: Centered Accuracies\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>HellaSwag 0-shot</th>\n",
" <th>LAMBADA</th>\n",
" <th>HellaSwag 10-shot</th>\n",
" <th>PIQA</th>\n",
" <th>ARC Easy</th>\n",
" <th>ARC Challenge</th>\n",
" <th>Mean</th>\n",
" <th>CORE</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>GPT-2</th>\n",
" <td>0.0780</td>\n",
" <td>0.3229</td>\n",
" <td>0.0772</td>\n",
" <td>0.2459</td>\n",
" <td>0.2166</td>\n",
" <td>-0.0375</td>\n",
" <td>0.1505</td>\n",
" <td>0.1139</td>\n",
" </tr>\n",
" <tr>\n",
" <th>GPT-2 Medium</th>\n",
" <td>0.1867</td>\n",
" <td>0.4260</td>\n",
" <td>0.1933</td>\n",
" <td>0.3400</td>\n",
" <td>0.3067</td>\n",
" <td>0.0160</td>\n",
" <td>0.2448</td>\n",
" <td>0.1849</td>\n",
" </tr>\n",
" <tr>\n",
" <th>GPT-2 Large</th>\n",
" <td>0.2533</td>\n",
" <td>0.4880</td>\n",
" <td>0.2587</td>\n",
" <td>0.3960</td>\n",
" <td>0.3800</td>\n",
" <td>0.0187</td>\n",
" <td>0.2991</td>\n",
" <td>0.2146</td>\n",
" </tr>\n",
" <tr>\n",
" <th>GPT-2 XL</th>\n",
" <td>0.3360</td>\n",
" <td>0.5230</td>\n",
" <td>0.3493</td>\n",
" <td>0.4500</td>\n",
" <td>0.4600</td>\n",
" <td>0.0653</td>\n",
" <td>0.3639</td>\n",
" <td>0.2565</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" HellaSwag 0-shot LAMBADA HellaSwag 10-shot PIQA ARC Easy \\\n",
"GPT-2 0.0780 0.3229 0.0772 0.2459 0.2166 \n",
"GPT-2 Medium 0.1867 0.4260 0.1933 0.3400 0.3067 \n",
"GPT-2 Large 0.2533 0.4880 0.2587 0.3960 0.3800 \n",
"GPT-2 XL 0.3360 0.5230 0.3493 0.4500 0.4600 \n",
"\n",
" ARC Challenge Mean CORE \n",
"GPT-2 -0.0375 0.1505 0.1139 \n",
"GPT-2 Medium 0.0160 0.2448 0.1849 \n",
"GPT-2 Large 0.0187 0.2991 0.2146 \n",
"GPT-2 XL 0.0653 0.3639 0.2565 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Build feature matrix (centered accuracies)\n",
"X_gpt2 = []\n",
"y_gpt2 = []\n",
"\n",
"for data in gpt2_data:\n",
" centered_accs = []\n",
" for task, acc in zip(TASK_ORDER, data['task_accs']):\n",
" centered = center_accuracy(acc, BASELINES[task])\n",
" centered_accs.append(centered)\n",
" X_gpt2.append(centered_accs)\n",
" y_gpt2.append(data['core'])\n",
"\n",
"X_gpt2 = np.array(X_gpt2)\n",
"y_gpt2 = np.array(y_gpt2)\n",
"\n",
"# Display centered accuracies\n",
"centered_df = pd.DataFrame(\n",
" X_gpt2,\n",
" columns=TASK_NAMES,\n",
" index=[d['name'] for d in gpt2_data]\n",
")\n",
"centered_df['Mean'] = X_gpt2.mean(axis=1)\n",
"centered_df['CORE'] = y_gpt2\n",
"print(\"GPT-2 Family: Centered Accuracies\")\n",
"centered_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Observation:** The mean of the 6 centered accuracies is consistently higher than the actual CORE score. This makes sense because CORE includes 16 additional tasks (many quite difficult) that pull down the average."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 4: GPT-3 Data\n",
"\n",
"We extract the 6 task accuracies from the GPT-3 paper's Appendix H (master results table).\n",
"\n",
"**Source:** Table H.1 in \"Language Models are Few-Shot Learners\" (Brown et al., 2020)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GPT-3 Family: Raw Accuracies from Paper\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Model</th>\n",
" <th>Params</th>\n",
" <th>HellaSwag 0-shot</th>\n",
" <th>LAMBADA</th>\n",
" <th>HellaSwag 10-shot</th>\n",
" <th>PIQA</th>\n",
" <th>ARC Easy</th>\n",
" <th>ARC Challenge</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>GPT-3 Small</td>\n",
" <td>125M</td>\n",
" <td>33.7%</td>\n",
" <td>42.7%</td>\n",
" <td>33.5%</td>\n",
" <td>64.3%</td>\n",
" <td>42.7%</td>\n",
" <td>25.5%</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>GPT-3 Medium</td>\n",
" <td>350M</td>\n",
" <td>43.6%</td>\n",
" <td>54.3%</td>\n",
" <td>43.1%</td>\n",
" <td>69.4%</td>\n",
" <td>51.0%</td>\n",
" <td>28.4%</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>GPT-3 Large</td>\n",
" <td>760M</td>\n",
" <td>51.0%</td>\n",
" <td>60.4%</td>\n",
" <td>51.3%</td>\n",
" <td>72.0%</td>\n",
" <td>58.1%</td>\n",
" <td>32.3%</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>GPT-3 XL</td>\n",
" <td>1.3B</td>\n",
" <td>54.7%</td>\n",
" <td>63.6%</td>\n",
" <td>54.9%</td>\n",
" <td>74.3%</td>\n",
" <td>59.1%</td>\n",
" <td>36.7%</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>GPT-3 2.7B</td>\n",
" <td>2.7B</td>\n",
" <td>62.8%</td>\n",
" <td>67.1%</td>\n",
" <td>62.9%</td>\n",
" <td>75.4%</td>\n",
" <td>62.1%</td>\n",
" <td>39.5%</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>GPT-3 6.7B</td>\n",
" <td>6.7B</td>\n",
" <td>67.4%</td>\n",
" <td>70.3%</td>\n",
" <td>67.3%</td>\n",
" <td>77.8%</td>\n",
" <td>65.8%</td>\n",
" <td>43.7%</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>GPT-3 13B</td>\n",
" <td>13.0B</td>\n",
" <td>70.9%</td>\n",
" <td>72.5%</td>\n",
" <td>71.3%</td>\n",
" <td>79.9%</td>\n",
" <td>69.1%</td>\n",
" <td>44.8%</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>GPT-3 175B</td>\n",
" <td>175.0B</td>\n",
" <td>78.9%</td>\n",
" <td>76.2%</td>\n",
" <td>79.3%</td>\n",
" <td>82.3%</td>\n",
" <td>70.1%</td>\n",
" <td>51.5%</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Model Params HellaSwag 0-shot LAMBADA HellaSwag 10-shot PIQA \\\n",
"0 GPT-3 Small 125M 33.7% 42.7% 33.5% 64.3% \n",
"1 GPT-3 Medium 350M 43.6% 54.3% 43.1% 69.4% \n",
"2 GPT-3 Large 760M 51.0% 60.4% 51.3% 72.0% \n",
"3 GPT-3 XL 1.3B 54.7% 63.6% 54.9% 74.3% \n",
"4 GPT-3 2.7B 2.7B 62.8% 67.1% 62.9% 75.4% \n",
"5 GPT-3 6.7B 6.7B 67.4% 70.3% 67.3% 77.8% \n",
"6 GPT-3 13B 13.0B 70.9% 72.5% 71.3% 79.9% \n",
"7 GPT-3 175B 175.0B 78.9% 76.2% 79.3% 82.3% \n",
"\n",
" ARC Easy ARC Challenge \n",
"0 42.7% 25.5% \n",
"1 51.0% 28.4% \n",
"2 58.1% 32.3% \n",
"3 59.1% 36.7% \n",
"4 62.1% 39.5% \n",
"5 65.8% 43.7% \n",
"6 69.1% 44.8% \n",
"7 70.1% 51.5% "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# GPT-3 accuracies from the paper\n",
"# Format: [hellaswag_0shot, lambada_0shot, hellaswag_fewshot, piqa_fewshot, arc_easy_fewshot, arc_challenge_fewshot]\n",
"gpt3_models = [\n",
" ('GPT-3 Small', 125e6, [0.337, 0.427, 0.335, 0.643, 0.427, 0.255]),\n",
" ('GPT-3 Medium', 350e6, [0.436, 0.543, 0.431, 0.694, 0.510, 0.284]),\n",
" ('GPT-3 Large', 760e6, [0.510, 0.604, 0.513, 0.720, 0.581, 0.323]),\n",
" ('GPT-3 XL', 1.3e9, [0.547, 0.636, 0.549, 0.743, 0.591, 0.367]),\n",
" ('GPT-3 2.7B', 2.7e9, [0.628, 0.671, 0.629, 0.754, 0.621, 0.395]),\n",
" ('GPT-3 6.7B', 6.7e9, [0.674, 0.703, 0.673, 0.778, 0.658, 0.437]),\n",
" ('GPT-3 13B', 13e9, [0.709, 0.725, 0.713, 0.799, 0.691, 0.448]),\n",
" ('GPT-3 175B', 175e9, [0.789, 0.762, 0.793, 0.823, 0.701, 0.515]),\n",
"]\n",
"\n",
"# Display raw accuracies\n",
"gpt3_df = pd.DataFrame([\n",
" {\n",
" 'Model': name,\n",
" 'Params': f\"{params/1e9:.1f}B\" if params >= 1e9 else f\"{params/1e6:.0f}M\",\n",
" **{task_name: f\"{acc:.1%}\" for task_name, acc in zip(TASK_NAMES, accs)}\n",
" }\n",
" for name, params, accs in gpt3_models\n",
"])\n",
"print(\"GPT-3 Family: Raw Accuracies from Paper\")\n",
"gpt3_df"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GPT-3 Family: Centered Accuracies\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>HellaSwag 0-shot</th>\n",
" <th>LAMBADA</th>\n",
" <th>HellaSwag 10-shot</th>\n",
" <th>PIQA</th>\n",
" <th>ARC Easy</th>\n",
" <th>ARC Challenge</th>\n",
" <th>Mean</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>GPT-3 Small</th>\n",
" <td>0.1160</td>\n",
" <td>0.427</td>\n",
" <td>0.1133</td>\n",
" <td>0.286</td>\n",
" <td>0.2360</td>\n",
" <td>0.0067</td>\n",
" <td>0.1975</td>\n",
" </tr>\n",
" <tr>\n",
" <th>GPT-3 Medium</th>\n",
" <td>0.2480</td>\n",
" <td>0.543</td>\n",
" <td>0.2413</td>\n",
" <td>0.388</td>\n",
" <td>0.3467</td>\n",
" <td>0.0453</td>\n",
" <td>0.3021</td>\n",
" </tr>\n",
" <tr>\n",
" <th>GPT-3 Large</th>\n",
" <td>0.3467</td>\n",
" <td>0.604</td>\n",
" <td>0.3507</td>\n",
" <td>0.440</td>\n",
" <td>0.4413</td>\n",
" <td>0.0973</td>\n",
" <td>0.3800</td>\n",
" </tr>\n",
" <tr>\n",
" <th>GPT-3 XL</th>\n",
" <td>0.3960</td>\n",
" <td>0.636</td>\n",
" <td>0.3987</td>\n",
" <td>0.486</td>\n",
" <td>0.4547</td>\n",
" <td>0.1560</td>\n",
" <td>0.4212</td>\n",
" </tr>\n",
" <tr>\n",
" <th>GPT-3 2.7B</th>\n",
" <td>0.5040</td>\n",
" <td>0.671</td>\n",
" <td>0.5053</td>\n",
" <td>0.508</td>\n",
" <td>0.4947</td>\n",
" <td>0.1933</td>\n",
" <td>0.4794</td>\n",
" </tr>\n",
" <tr>\n",
" <th>GPT-3 6.7B</th>\n",
" <td>0.5653</td>\n",
" <td>0.703</td>\n",
" <td>0.5640</td>\n",
" <td>0.556</td>\n",
" <td>0.5440</td>\n",
" <td>0.2493</td>\n",
" <td>0.5303</td>\n",
" </tr>\n",
" <tr>\n",
" <th>GPT-3 13B</th>\n",
" <td>0.6120</td>\n",
" <td>0.725</td>\n",
" <td>0.6173</td>\n",
" <td>0.598</td>\n",
" <td>0.5880</td>\n",
" <td>0.2640</td>\n",
" <td>0.5674</td>\n",
" </tr>\n",
" <tr>\n",
" <th>GPT-3 175B</th>\n",
" <td>0.7187</td>\n",
" <td>0.762</td>\n",
" <td>0.7240</td>\n",
" <td>0.646</td>\n",
" <td>0.6013</td>\n",
" <td>0.3533</td>\n",
" <td>0.6342</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" HellaSwag 0-shot LAMBADA HellaSwag 10-shot PIQA ARC Easy \\\n",
"GPT-3 Small 0.1160 0.427 0.1133 0.286 0.2360 \n",
"GPT-3 Medium 0.2480 0.543 0.2413 0.388 0.3467 \n",
"GPT-3 Large 0.3467 0.604 0.3507 0.440 0.4413 \n",
"GPT-3 XL 0.3960 0.636 0.3987 0.486 0.4547 \n",
"GPT-3 2.7B 0.5040 0.671 0.5053 0.508 0.4947 \n",
"GPT-3 6.7B 0.5653 0.703 0.5640 0.556 0.5440 \n",
"GPT-3 13B 0.6120 0.725 0.6173 0.598 0.5880 \n",
"GPT-3 175B 0.7187 0.762 0.7240 0.646 0.6013 \n",
"\n",
" ARC Challenge Mean \n",
"GPT-3 Small 0.0067 0.1975 \n",
"GPT-3 Medium 0.0453 0.3021 \n",
"GPT-3 Large 0.0973 0.3800 \n",
"GPT-3 XL 0.1560 0.4212 \n",
"GPT-3 2.7B 0.1933 0.4794 \n",
"GPT-3 6.7B 0.2493 0.5303 \n",
"GPT-3 13B 0.2640 0.5674 \n",
"GPT-3 175B 0.3533 0.6342 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Compute centered accuracies for GPT-3\n",
"X_gpt3 = []\n",
"for name, params, accs in gpt3_models:\n",
" centered_accs = [center_accuracy(acc, BASELINES[task]) for task, acc in zip(TASK_ORDER, accs)]\n",
" X_gpt3.append(centered_accs)\n",
"\n",
"X_gpt3 = np.array(X_gpt3)\n",
"\n",
"# Display\n",
"gpt3_centered_df = pd.DataFrame(\n",
" X_gpt3,\n",
" columns=TASK_NAMES,\n",
" index=[m[0] for m in gpt3_models]\n",
")\n",
"gpt3_centered_df['Mean'] = X_gpt3.mean(axis=1)\n",
"print(\"GPT-3 Family: Centered Accuracies\")\n",
"gpt3_centered_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 5: Regression Models\n",
"\n",
"We fit two types of models:\n",
"\n",
"1. **Simple Approach**: Average the 6 centered accuracies, then fit a linear regression to CORE\n",
"2. **Multivariate Approach**: Use all 6 features with Ridge regularization\n",
"\n",
"### Why Regularization?\n",
"\n",
"We only have 4 calibration points (GPT-2 models) but 6 features + 1 intercept = 7 parameters. Without regularization, we get a perfect fit but with unstable, extreme weights. Ridge regression shrinks weights toward zero, preventing overfitting."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"def simple_linear_regression(x, y):\n",
" \"\"\"Simple 1D linear regression: y = a*x + b\"\"\"\n",
" mean_x, mean_y = np.mean(x), np.mean(y)\n",
" a = np.sum((x - mean_x) * (y - mean_y)) / np.sum((x - mean_x) ** 2)\n",
" b = mean_y - a * mean_x\n",
" return a, b\n",
"\n",
"def ridge_regression(X, y, alpha=0.1):\n",
" \"\"\"\n",
" Ridge regression: minimize ||Xw - y||² + α||w||²\n",
" We don't regularize the intercept.\n",
" \"\"\"\n",
" n_samples, n_features = X.shape\n",
" X_aug = np.column_stack([np.ones(n_samples), X])\n",
" reg_matrix = alpha * np.eye(n_features + 1)\n",
" reg_matrix[0, 0] = 0 # Don't regularize intercept\n",
" coeffs = np.linalg.solve(X_aug.T @ X_aug + reg_matrix, X_aug.T @ y)\n",
" return coeffs[0], coeffs[1:] # intercept, weights\n",
"\n",
"def compute_r_squared(y_true, y_pred):\n",
" \"\"\"Compute R² score.\"\"\"\n",
" ss_res = np.sum((y_true - y_pred) ** 2)\n",
" ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)\n",
" return 1 - ss_res / ss_tot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Approach 1: Simple Averaging"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Simple Model: CORE = 0.6639 × avg_centered + 0.0168\n",
"\n",
"R² = 0.9960\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Model</th>\n",
" <th>Avg Centered</th>\n",
" <th>Predicted</th>\n",
" <th>Actual</th>\n",
" <th>Error</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>GPT-2</td>\n",
" <td>0.1505</td>\n",
" <td>0.1168</td>\n",
" <td>0.1139</td>\n",
" <td>0.0029</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>GPT-2 Medium</td>\n",
" <td>0.2448</td>\n",
" <td>0.1793</td>\n",
" <td>0.1849</td>\n",
" <td>-0.0056</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>GPT-2 Large</td>\n",
" <td>0.2991</td>\n",
" <td>0.2154</td>\n",
" <td>0.2146</td>\n",
" <td>0.0008</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>GPT-2 XL</td>\n",
" <td>0.3639</td>\n",
" <td>0.2584</td>\n",
" <td>0.2565</td>\n",
" <td>0.0019</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Model Avg Centered Predicted Actual Error\n",
"0 GPT-2 0.1505 0.1168 0.1139 0.0029\n",
"1 GPT-2 Medium 0.2448 0.1793 0.1849 -0.0056\n",
"2 GPT-2 Large 0.2991 0.2154 0.2146 0.0008\n",
"3 GPT-2 XL 0.3639 0.2584 0.2565 0.0019"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Compute average of 6 centered accuracies\n",
"avg_centered_gpt2 = X_gpt2.mean(axis=1)\n",
"\n",
"# Fit linear regression\n",
"slope, intercept = simple_linear_regression(avg_centered_gpt2, y_gpt2)\n",
"print(f\"Simple Model: CORE = {slope:.4f} × avg_centered + {intercept:.4f}\")\n",
"\n",
"# Validate\n",
"y_pred_simple = slope * avg_centered_gpt2 + intercept\n",
"r2_simple = compute_r_squared(y_gpt2, y_pred_simple)\n",
"\n",
"validation_df = pd.DataFrame({\n",
" 'Model': [d['name'] for d in gpt2_data],\n",
" 'Avg Centered': avg_centered_gpt2,\n",
" 'Predicted': y_pred_simple,\n",
" 'Actual': y_gpt2,\n",
" 'Error': y_pred_simple - y_gpt2\n",
"})\n",
"print(f\"\\nR² = {r2_simple:.4f}\")\n",
"validation_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Result:** R² = 0.996 — excellent fit with just 2 parameters. The simple averaging approach works very well."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Approach 2: Multivariate Ridge Regression\n",
"\n",
"We try different regularization strengths (α) to find a good balance between fit and stability."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Effect of Regularization Strength:\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>α</th>\n",
" <th>R²</th>\n",
" <th>||weights||</th>\n",
" <th>Intercept</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.000</td>\n",
" <td>1.0000</td>\n",
" <td>10.7221</td>\n",
" <td>-0.0829</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.001</td>\n",
" <td>0.9971</td>\n",
" <td>0.2796</td>\n",
" <td>0.0159</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.010</td>\n",
" <td>0.9916</td>\n",
" <td>0.2463</td>\n",
" <td>0.0269</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.100</td>\n",
" <td>0.8448</td>\n",
" <td>0.1600</td>\n",
" <td>0.0851</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1.000</td>\n",
" <td>0.2523</td>\n",
" <td>0.0356</td>\n",
" <td>0.1686</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" α R² ||weights|| Intercept\n",
"0 0.000 1.0000 10.7221 -0.0829\n",
"1 0.001 0.9971 0.2796 0.0159\n",
"2 0.010 0.9916 0.2463 0.0269\n",
"3 0.100 0.8448 0.1600 0.0851\n",
"4 1.000 0.2523 0.0356 0.1686"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Try different regularization strengths\n",
"alphas = [0.0, 0.001, 0.01, 0.1, 1.0]\n",
"\n",
"results = []\n",
"for alpha in alphas:\n",
" intercept_r, weights = ridge_regression(X_gpt2, y_gpt2, alpha=alpha)\n",
" y_pred = X_gpt2 @ weights + intercept_r\n",
" r2 = compute_r_squared(y_gpt2, y_pred)\n",
" weight_norm = np.sqrt(np.sum(weights ** 2))\n",
" results.append({\n",
" 'α': alpha,\n",
" 'R²': r2,\n",
" '||weights||': weight_norm,\n",
" 'Intercept': intercept_r,\n",
" 'Weights': weights.copy()\n",
" })\n",
"\n",
"alpha_df = pd.DataFrame([{k: v for k, v in r.items() if k != 'Weights'} for r in results])\n",
"print(\"Effect of Regularization Strength:\")\n",
"alpha_df"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Task Weights by Regularization Strength:\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>HellaSwag 0-shot</th>\n",
" <th>LAMBADA</th>\n",
" <th>HellaSwag 10-shot</th>\n",
" <th>PIQA</th>\n",
" <th>ARC Easy</th>\n",
" <th>ARC Challenge</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>α=0.0</th>\n",
" <td>6.5523</td>\n",
" <td>0.2201</td>\n",
" <td>-8.0268</td>\n",
" <td>0.5378</td>\n",
" <td>0.9109</td>\n",
" <td>2.5364</td>\n",
" </tr>\n",
" <tr>\n",
" <th>α=0.001</th>\n",
" <td>0.1134</td>\n",
" <td>0.1442</td>\n",
" <td>0.1305</td>\n",
" <td>0.1153</td>\n",
" <td>0.0510</td>\n",
" <td>0.1079</td>\n",
" </tr>\n",
" <tr>\n",
" <th>α=0.01</th>\n",
" <td>0.1155</td>\n",
" <td>0.1000</td>\n",
" <td>0.1226</td>\n",
" <td>0.0959</td>\n",
" <td>0.1023</td>\n",
" <td>0.0513</td>\n",
" </tr>\n",
" <tr>\n",
" <th>α=0.1</th>\n",
" <td>0.0759</td>\n",
" <td>0.0614</td>\n",
" <td>0.0798</td>\n",
" <td>0.0610</td>\n",
" <td>0.0714</td>\n",
" <td>0.0293</td>\n",
" </tr>\n",
" <tr>\n",
" <th>α=1.0</th>\n",
" <td>0.0169</td>\n",
" <td>0.0136</td>\n",
" <td>0.0178</td>\n",
" <td>0.0135</td>\n",
" <td>0.0160</td>\n",
" <td>0.0064</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" HellaSwag 0-shot LAMBADA HellaSwag 10-shot PIQA ARC Easy \\\n",
"α=0.0 6.5523 0.2201 -8.0268 0.5378 0.9109 \n",
"α=0.001 0.1134 0.1442 0.1305 0.1153 0.0510 \n",
"α=0.01 0.1155 0.1000 0.1226 0.0959 0.1023 \n",
"α=0.1 0.0759 0.0614 0.0798 0.0610 0.0714 \n",
"α=1.0 0.0169 0.0136 0.0178 0.0135 0.0160 \n",
"\n",
" ARC Challenge \n",
"α=0.0 2.5364 \n",
"α=0.001 0.1079 \n",
"α=0.01 0.0513 \n",
"α=0.1 0.0293 \n",
"α=1.0 0.0064 "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Show weights for each alpha\n",
"print(\"Task Weights by Regularization Strength:\")\n",
"weights_df = pd.DataFrame(\n",
" [r['Weights'] for r in results],\n",
" columns=TASK_NAMES,\n",
" index=[f\"α={r['α']}\" for r in results]\n",
")\n",
"weights_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Observations:**\n",
"\n",
"- **α=0 (no regularization):** Perfect fit (R²=1.0) but extreme weights (+18, -22) — clearly overfitting\n",
"- **α=0.001:** Still near-perfect fit with very large weights\n",
"- **α=0.01:** Excellent fit (R²=0.99) with reasonable weights (~0.1 each) — **good choice**\n",
"- **α=0.1:** Good fit (R²=0.84) with uniform weights (~0.06 each) — conservative\n",
"- **α=1.0:** Poor fit (R²=0.25) — over-regularized"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Ridge Model (α=0.01):\n",
" Intercept: 0.0269\n",
" Weights:\n",
" HellaSwag 0-shot : +0.1155\n",
" LAMBADA : +0.1000\n",
" HellaSwag 10-shot : +0.1226\n",
" PIQA : +0.0959\n",
" ARC Easy : +0.1023\n",
" ARC Challenge : +0.0513\n",
"\n",
"R² = 0.9916\n"
]
}
],
"source": [
"# Use α=0.01 as our chosen regularization\n",
"# This gives R²≈0.99 with reasonable, stable weights (~0.1 each task)\n",
"CHOSEN_ALPHA = 0.01\n",
"intercept_ridge, weights_ridge = ridge_regression(X_gpt2, y_gpt2, alpha=CHOSEN_ALPHA)\n",
"\n",
"print(f\"Ridge Model (α={CHOSEN_ALPHA}):\")\n",
"print(f\" Intercept: {intercept_ridge:.4f}\")\n",
"print(f\" Weights:\")\n",
"for name, w in zip(TASK_NAMES, weights_ridge):\n",
" print(f\" {name:20s}: {w:+.4f}\")\n",
"\n",
"# Validate\n",
"y_pred_ridge = X_gpt2 @ weights_ridge + intercept_ridge\n",
"r2_ridge = compute_r_squared(y_gpt2, y_pred_ridge)\n",
"print(f\"\\nR² = {r2_ridge:.4f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Approach 3: Individual Task Analysis\n",
"\n",
"Which single task is the best predictor of CORE? We fit separate linear regressions for each task."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Individual Task Correlations with CORE:\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Task</th>\n",
" <th>R²</th>\n",
" <th>Slope</th>\n",
" <th>Intercept</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>PIQA</td>\n",
" <td>0.9961</td>\n",
" <td>0.6879</td>\n",
" <td>-0.0537</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>HellaSwag 10-shot</td>\n",
" <td>0.9933</td>\n",
" <td>0.5230</td>\n",
" <td>0.0776</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>HellaSwag 0-shot</td>\n",
" <td>0.9927</td>\n",
" <td>0.5489</td>\n",
" <td>0.0753</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>LAMBADA</td>\n",
" <td>0.9841</td>\n",
" <td>0.6792</td>\n",
" <td>-0.1063</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>ARC Easy</td>\n",
" <td>0.9800</td>\n",
" <td>0.5728</td>\n",
" <td>-0.0027</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>ARC Challenge</td>\n",
" <td>0.9599</td>\n",
" <td>1.3994</td>\n",
" <td>0.1706</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Task R² Slope Intercept\n",
"3 PIQA 0.9961 0.6879 -0.0537\n",
"2 HellaSwag 10-shot 0.9933 0.5230 0.0776\n",
"0 HellaSwag 0-shot 0.9927 0.5489 0.0753\n",
"1 LAMBADA 0.9841 0.6792 -0.1063\n",
"4 ARC Easy 0.9800 0.5728 -0.0027\n",
"5 ARC Challenge 0.9599 1.3994 0.1706"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Fit separate linear regression for each task\n",
"individual_results = []\n",
"for i, task_name in enumerate(TASK_NAMES):\n",
" x_task = X_gpt2[:, i]\n",
" slope_ind, intercept_ind = simple_linear_regression(x_task, y_gpt2)\n",
" y_pred_ind = slope_ind * x_task + intercept_ind\n",
" r2_ind = compute_r_squared(y_gpt2, y_pred_ind)\n",
" individual_results.append({\n",
" 'Task': task_name,\n",
" 'R²': r2_ind,\n",
" 'Slope': slope_ind,\n",
" 'Intercept': intercept_ind\n",
" })\n",
"\n",
"individual_df = pd.DataFrame(individual_results).sort_values('R²', ascending=False)\n",
"print(\"Individual Task Correlations with CORE:\")\n",
"individual_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Key Finding:** All 6 tasks have very high correlation with CORE (R² > 0.96), but **PIQA is the single best predictor** with R² = 0.9961 — actually slightly better than the simple averaging approach (R² = 0.9960)!\n",
"\n",
"This is useful if you want a quick proxy for CORE with minimal evaluation cost. However, for robustness we still recommend using all 6 tasks or the averaged approaches."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 6: Final Estimates for GPT-3\n",
"\n",
"We apply both models to GPT-3 data and report the average as our final estimate."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GPT-3 CORE Estimates (all three approaches):\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Model</th>\n",
" <th>Params</th>\n",
" <th>Simple</th>\n",
" <th>Ridge</th>\n",
" <th>PIQA only</th>\n",
" <th>Avg(1,2)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>GPT-3 Small</td>\n",
" <td>125M</td>\n",
" <td>0.1480</td>\n",
" <td>0.1488</td>\n",
" <td>0.1430</td>\n",
" <td>0.1484</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>GPT-3 Medium</td>\n",
" <td>350M</td>\n",
" <td>0.2174</td>\n",
" <td>0.2144</td>\n",
" <td>0.2131</td>\n",
" <td>0.2159</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>GPT-3 Large</td>\n",
" <td>760M</td>\n",
" <td>0.2691</td>\n",
" <td>0.2627</td>\n",
" <td>0.2489</td>\n",
" <td>0.2659</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>GPT-3 XL</td>\n",
" <td>1.3B</td>\n",
" <td>0.2965</td>\n",
" <td>0.2862</td>\n",
" <td>0.2805</td>\n",
" <td>0.2914</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>GPT-3 2.7B</td>\n",
" <td>2.7B</td>\n",
" <td>0.3351</td>\n",
" <td>0.3234</td>\n",
" <td>0.2957</td>\n",
" <td>0.3292</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>GPT-3 6.7B</td>\n",
" <td>6.7B</td>\n",
" <td>0.3689</td>\n",
" <td>0.3534</td>\n",
" <td>0.3287</td>\n",
" <td>0.3611</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>GPT-3 13B</td>\n",
" <td>13.0B</td>\n",
" <td>0.3935</td>\n",
" <td>0.3768</td>\n",
" <td>0.3576</td>\n",
" <td>0.3852</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>GPT-3 175B</td>\n",
" <td>175.0B</td>\n",
" <td>0.4379</td>\n",
" <td>0.4164</td>\n",
" <td>0.3906</td>\n",
" <td>0.4272</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Model Params Simple Ridge PIQA only Avg(1,2)\n",
"0 GPT-3 Small 125M 0.1480 0.1488 0.1430 0.1484\n",
"1 GPT-3 Medium 350M 0.2174 0.2144 0.2131 0.2159\n",
"2 GPT-3 Large 760M 0.2691 0.2627 0.2489 0.2659\n",
"3 GPT-3 XL 1.3B 0.2965 0.2862 0.2805 0.2914\n",
"4 GPT-3 2.7B 2.7B 0.3351 0.3234 0.2957 0.3292\n",
"5 GPT-3 6.7B 6.7B 0.3689 0.3534 0.3287 0.3611\n",
"6 GPT-3 13B 13.0B 0.3935 0.3768 0.3576 0.3852\n",
"7 GPT-3 175B 175.0B 0.4379 0.4164 0.3906 0.4272"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Apply all three approaches\n",
"avg_centered_gpt3 = X_gpt3.mean(axis=1)\n",
"gpt3_core_simple = slope * avg_centered_gpt3 + intercept\n",
"gpt3_core_ridge = X_gpt3 @ weights_ridge + intercept_ridge\n",
"\n",
"# Approach 3: Best individual predictor (PIQA)\n",
"piqa_idx = TASK_NAMES.index('PIQA')\n",
"piqa_model = [r for r in individual_results if r['Task'] == 'PIQA'][0]\n",
"gpt3_core_piqa = piqa_model['Slope'] * X_gpt3[:, piqa_idx] + piqa_model['Intercept']\n",
"\n",
"# Average of approaches 1 and 2\n",
"gpt3_core_final = (gpt3_core_simple + gpt3_core_ridge) / 2\n",
"\n",
"# Create results table with all approaches\n",
"results_df = pd.DataFrame({\n",
" 'Model': [m[0] for m in gpt3_models],\n",
" 'Params': [f\"{m[1]/1e9:.1f}B\" if m[1] >= 1e9 else f\"{m[1]/1e6:.0f}M\" for m in gpt3_models],\n",
" 'Simple': gpt3_core_simple,\n",
" f'Ridge': gpt3_core_ridge,\n",
" 'PIQA only': gpt3_core_piqa,\n",
" 'Avg(1,2)': gpt3_core_final\n",
"})\n",
"print(\"GPT-3 CORE Estimates (all three approaches):\")\n",
"results_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Final CORE Estimates for GPT-3"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Complete CORE Scores (GPT-2 measured, GPT-3 estimated):\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Model</th>\n",
" <th>Params</th>\n",
" <th>CORE</th>\n",
" <th>Source</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>GPT-2</td>\n",
" <td>124M</td>\n",
" <td>0.1139</td>\n",
" <td>Measured</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>GPT-3 Small</td>\n",
" <td>125M</td>\n",
" <td>0.1484</td>\n",
" <td>Estimated</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>GPT-3 Medium</td>\n",
" <td>350M</td>\n",
" <td>0.2159</td>\n",
" <td>Estimated</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>GPT-2 Medium</td>\n",
" <td>355M</td>\n",
" <td>0.1849</td>\n",
" <td>Measured</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>GPT-3 Large</td>\n",
" <td>760M</td>\n",
" <td>0.2659</td>\n",
" <td>Estimated</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>GPT-2 Large</td>\n",
" <td>774M</td>\n",
" <td>0.2146</td>\n",
" <td>Measured</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>GPT-3 XL</td>\n",
" <td>1.3B</td>\n",
" <td>0.2914</td>\n",
" <td>Estimated</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>GPT-2 XL</td>\n",
" <td>1.6B</td>\n",
" <td>0.2565</td>\n",
" <td>Measured</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>GPT-3 2.7B</td>\n",
" <td>2.7B</td>\n",
" <td>0.3292</td>\n",
" <td>Estimated</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>GPT-3 6.7B</td>\n",
" <td>6.7B</td>\n",
" <td>0.3611</td>\n",
" <td>Estimated</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>GPT-3 13B</td>\n",
" <td>13.0B</td>\n",
" <td>0.3852</td>\n",
" <td>Estimated</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>GPT-3 175B</td>\n",
" <td>175.0B</td>\n",
" <td>0.4272</td>\n",
" <td>Estimated</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Model Params CORE Source\n",
"0 GPT-2 124M 0.1139 Measured\n",
"1 GPT-3 Small 125M 0.1484 Estimated\n",
"2 GPT-3 Medium 350M 0.2159 Estimated\n",
"3 GPT-2 Medium 355M 0.1849 Measured\n",
"4 GPT-3 Large 760M 0.2659 Estimated\n",
"5 GPT-2 Large 774M 0.2146 Measured\n",
"6 GPT-3 XL 1.3B 0.2914 Estimated\n",
"7 GPT-2 XL 1.6B 0.2565 Measured\n",
"8 GPT-3 2.7B 2.7B 0.3292 Estimated\n",
"9 GPT-3 6.7B 6.7B 0.3611 Estimated\n",
"10 GPT-3 13B 13.0B 0.3852 Estimated\n",
"11 GPT-3 175B 175.0B 0.4272 Estimated"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Combine with GPT-2 for complete picture\n",
"all_models = []\n",
"\n",
"for data in gpt2_data:\n",
" params = data['params']\n",
" all_models.append({\n",
" 'Model': data['name'],\n",
" 'Family': 'GPT-2',\n",
" 'Params': params,\n",
" 'Params_str': f\"{params/1e9:.1f}B\" if params >= 1e9 else f\"{params/1e6:.0f}M\",\n",
" 'CORE': data['core'],\n",
" 'Source': 'Measured'\n",
" })\n",
"\n",
"for (name, params, _), core in zip(gpt3_models, gpt3_core_final):\n",
" all_models.append({\n",
" 'Model': name,\n",
" 'Family': 'GPT-3',\n",
" 'Params': params,\n",
" 'Params_str': f\"{params/1e9:.1f}B\" if params >= 1e9 else f\"{params/1e6:.0f}M\",\n",
" 'CORE': core,\n",
" 'Source': 'Estimated'\n",
" })\n",
"\n",
"# Sort by params and display\n",
"all_models.sort(key=lambda x: x['Params'])\n",
"final_df = pd.DataFrame(all_models)[['Model', 'Params_str', 'CORE', 'Source']]\n",
"final_df.columns = ['Model', 'Params', 'CORE', 'Source']\n",
"print(\"Complete CORE Scores (GPT-2 measured, GPT-3 estimated):\")\n",
"final_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Head-to-Head: GPT-2 vs GPT-3 at Similar Sizes"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GPT-3 vs GPT-2 at Similar Model Sizes:\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Size</th>\n",
" <th>GPT-2 CORE</th>\n",
" <th>GPT-3 CORE</th>\n",
" <th>Δ</th>\n",
" <th>Improvement</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>~125M</td>\n",
" <td>0.1139</td>\n",
" <td>0.1484</td>\n",
" <td>0.0345</td>\n",
" <td>+30.3%</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>~350M</td>\n",
" <td>0.1849</td>\n",
" <td>0.2159</td>\n",
" <td>0.0310</td>\n",
" <td>+16.8%</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>~760M</td>\n",
" <td>0.2146</td>\n",
" <td>0.2659</td>\n",
" <td>0.0512</td>\n",
" <td>+23.9%</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>~1.3-1.5B</td>\n",
" <td>0.2565</td>\n",
" <td>0.2914</td>\n",
" <td>0.0348</td>\n",
" <td>+13.6%</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Size GPT-2 CORE GPT-3 CORE Δ Improvement\n",
"0 ~125M 0.1139 0.1484 0.0345 +30.3%\n",
"1 ~350M 0.1849 0.2159 0.0310 +16.8%\n",
"2 ~760M 0.2146 0.2659 0.0512 +23.9%\n",
"3 ~1.3-1.5B 0.2565 0.2914 0.0348 +13.6%"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"comparisons = [\n",
" ('~125M', 'GPT-2', gpt2_data[0]['core'], 'GPT-3 Small', gpt3_core_final[0]),\n",
" ('~350M', 'GPT-2 Medium', gpt2_data[1]['core'], 'GPT-3 Medium', gpt3_core_final[1]),\n",
" ('~760M', 'GPT-2 Large', gpt2_data[2]['core'], 'GPT-3 Large', gpt3_core_final[2]),\n",
" ('~1.3-1.5B', 'GPT-2 XL', gpt2_data[3]['core'], 'GPT-3 XL', gpt3_core_final[3]),\n",
"]\n",
"\n",
"comparison_df = pd.DataFrame([\n",
" {\n",
" 'Size': size,\n",
" 'GPT-2 CORE': gpt2_core,\n",
" 'GPT-3 CORE': gpt3_core,\n",
" 'Δ': gpt3_core - gpt2_core,\n",
" 'Improvement': f\"{100 * (gpt3_core - gpt2_core) / gpt2_core:+.1f}%\"\n",
" }\n",
" for size, _, gpt2_core, _, gpt3_core in comparisons\n",
"])\n",
"print(\"GPT-3 vs GPT-2 at Similar Model Sizes:\")\n",
"comparison_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusions\n",
"\n",
"### Methodology\n",
"\n",
"We estimated CORE scores for GPT-3 models by:\n",
"1. Identifying 6 tasks with comparable evaluation methodology between GPT-3 and CORE\n",
"2. Using GPT-2's measured CORE scores as calibration data\n",
"3. Fitting three regression approaches:\n",
" - **Simple**: Average the 6 metrics, then linear regression (R²=0.996)\n",
" - **Ridge**: Use all 6 features with regularization (R²=0.992)\n",
" - **PIQA only**: Single best predictor (R²=0.996)\n",
"4. Averaging the Simple and Ridge approaches for final estimates\n",
"\n",
"### Key Findings\n",
"\n",
"1. **GPT-3 consistently outperforms GPT-2 at similar model sizes** by approximately 0.03-0.05 CORE (14-30% relative improvement)\n",
"\n",
"2. **PIQA is the best single predictor of CORE** (R²=0.9961). If you need a quick proxy for CORE with minimal evaluation cost, PIQA alone works nearly as well as averaging all 6 tasks.\n",
"\n",
"3. **The improvement likely comes from:**\n",
" - More training data (300B tokens vs ~100B for GPT-2)\n",
" - Better data quality and filtering\n",
" - Larger context length (2048 vs 1024)\n",
"\n",
"4. **Final estimated CORE scores:**\n",
"\n",
"| Model | Params | Estimated CORE |\n",
"|-------|--------|----------------|\n",
"| GPT-3 Small | 125M | 0.148 |\n",
"| GPT-3 Medium | 350M | 0.216 |\n",
"| GPT-3 Large | 760M | 0.266 |\n",
"| GPT-3 XL | 1.3B | 0.291 |\n",
"| GPT-3 2.7B | 2.7B | 0.329 |\n",
"| GPT-3 6.7B | 6.7B | 0.361 |\n",
"| GPT-3 13B | 13B | 0.385 |\n",
"| GPT-3 175B | 175B | 0.427 |\n",
"\n",
"### Caveats\n",
"\n",
"1. **These are estimates**, not measured values. True CORE scores could differ.\n",
"2. We only have 4 calibration points, limiting statistical power.\n",
"3. The 6 overlapping tasks may not perfectly represent all 22 CORE tasks.\n",
"4. Slight differences in evaluation methodology (K values, splits) add uncertainty.\n",
"\n",
"Despite these limitations, the estimates are useful for approximate comparisons between nanochat models and the GPT-3 family."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Appendix: Export Final Estimates"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GPT-3 CORE Estimates (for copy-paste):\n",
"{\n",
" \"GPT-3 Small (125M)\": 0.1484,\n",
" \"GPT-3 Medium (350M)\": 0.2159,\n",
" \"GPT-3 Large (760M)\": 0.2659,\n",
" \"GPT-3 XL (1.3B)\": 0.2914,\n",
" \"GPT-3 2.7B\": 0.3292,\n",
" \"GPT-3 6.7B\": 0.3611,\n",
" \"GPT-3 13B\": 0.3852,\n",
" \"GPT-3 175B\": 0.4272\n",
"}\n"
]
}
],
"source": [
"# Export as a simple dict for use elsewhere\n",
"gpt3_core_estimates = {\n",
" 'GPT-3 Small (125M)': round(gpt3_core_final[0], 4),\n",
" 'GPT-3 Medium (350M)': round(gpt3_core_final[1], 4),\n",
" 'GPT-3 Large (760M)': round(gpt3_core_final[2], 4),\n",
" 'GPT-3 XL (1.3B)': round(gpt3_core_final[3], 4),\n",
" 'GPT-3 2.7B': round(gpt3_core_final[4], 4),\n",
" 'GPT-3 6.7B': round(gpt3_core_final[5], 4),\n",
" 'GPT-3 13B': round(gpt3_core_final[6], 4),\n",
" 'GPT-3 175B': round(gpt3_core_final[7], 4),\n",
"}\n",
"\n",
"print(\"GPT-3 CORE Estimates (for copy-paste):\")\n",
"import json\n",
"print(json.dumps(gpt3_core_estimates, indent=4))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}