← Back to blogMarch 29, 2026

Statistical significance for prompt testing: how many runs do you actually need?

The math behind prompt testing sample sizes, explained for people who want rigor without a statistics PhD.

The "I ran it 10 times" problem

Somebody on your team changes a prompt, runs it 10 times, sees 8 successes instead of the usual 7, and declares the new version 14% better. This happens constantly. It's also meaningless.

With 10 runs, the difference between 7/10 and 8/10 has a p-value of about 0.65. For reference, you want p < 0.05 to have any confidence. A p-value of 0.65 means there's a 65% chance you'd see this difference (or a bigger one) even if the two prompts were identical. You've learned nothing except that coins sometimes land heads.

I'm not trying to be harsh. Sample size intuition is genuinely bad for humans. We see patterns in noise. It's the same reason people think they have a "hot hand" in basketball or that a stock is "due for a correction." Our brains are pattern-matching machines running on insufficient data.

The math, briefly

Statistical significance for comparing two proportions (like success rates) uses a formula that depends on four things:

Baseline rate: Your current prompt's success rate (e.g., 75%)
Minimum detectable effect (MDE): The smallest improvement you care about (e.g., 5 percentage points)
Significance level (alpha): Usually 0.05, meaning you accept a 5% chance of a false positive
Power: Usually 0.80, meaning you want an 80% chance of detecting a real effect

The formula itself involves z-scores and pooled variances. I'm going to skip the derivation because you can look it up, and what matters more is the output. Here are the numbers that actually matter for prompt testing:

Sample size lookup table

All numbers are per variant (so double for total tasks):

| Baseline rate | Detect 20% relative improvement | Detect 10% relative | Detect 5% relative | |---|---|---|---| | 50% | 200 | 800 | 3,200 | | 70% | 250 | 950 | 3,800 | | 80% | 300 | 1,100 | 4,400 | | 90% | 500 | 2,000 | 8,000 | | 95% | 1,000 | 4,000 | 16,000 |

Notice something? As your baseline gets better, you need dramatically more data to detect improvements. Going from 50% to 60% (a 20% relative improvement) takes 200 tasks. Going from 90% to 95% (a smaller absolute change but the same 5.5% relative improvement) takes about 2,000 tasks.

This is why optimizing an already-good prompt is so much harder than fixing a bad one. The better your starting point, the more data you need to confirm that a change actually helped.

Why 10 runs is never enough

Let's make this concrete. Say your prompt has a 75% success rate and you want to test a change you think will push it to 85%. That's a meaningful improvement -- 10 percentage points. You still need about 175 tasks per variant to detect it reliably.

If the improvement is subtler -- 75% to 80% -- you need around 700 per variant.

And if you're chasing a 75% to 77% improvement? About 3,000 per variant.

Ten runs can only reliably detect a change from 75% to roughly 100%. If your improvement is that dramatic, you didn't need a test. You needed a bug fix.

Continuous metrics need less data

The numbers above are for binary outcomes (success/fail). If your metric is continuous -- like response quality on a 1-10 scale, token cost, or latency in milliseconds -- you typically need less data. Continuous metrics carry more information per observation.

For continuous metrics, sample size depends on the effect size relative to the standard deviation. A rough guide:

Large effect (improvement > 0.8 standard deviations): ~25 per variant
Medium effect (0.5 SD): ~65 per variant
Small effect (0.2 SD): ~400 per variant

So if your grader model scores responses on a 1-10 scale with a standard deviation of about 2 points, and you're looking for a 1-point improvement, that's a 0.5 SD effect. You need around 65 runs per variant. Much more manageable than the binary case.

This is a good argument for using continuous quality scores instead of binary pass/fail when you can. A human or model grader that rates responses 1-10 gives you more statistical power per task than a simple "did it work" check.

When you can get away with less

Not every experiment needs full statistical rigor. Here are situations where smaller samples are reasonable:

You're screening, not deciding. If you have 10 prompt variants and want to narrow it to 2-3 for a proper test, running 50 tasks each is fine for eliminating the obvious losers. Just don't declare any of them the winner at this stage.

The effect is huge. If your change fixes a category of failure that accounts for 30% of all failures, you'll see it in 50-100 tasks. Statistical tests will confirm what's already obvious.

You're testing a guardrail. If you added a rule like "never output raw SQL" and you're testing whether it works, you just need enough adversarial inputs to probe the boundary. This is more like security testing than A/B testing. 50-100 targeted test cases is often enough.

The cost of being wrong is low. If this is an internal tool and the worst case is slightly worse responses for a week before you revert, you can accept less certainty. Run 100 tasks, see a directional improvement, ship it, and monitor.

When you need more

Some situations demand extra rigor:

High-stakes decisions. If your agent handles medical triage, financial transactions, or legal documents, you want alpha=0.01 (99% confidence) and power=0.95. This roughly doubles the sample sizes in the table above.

You're testing multiple variants. If you're comparing 5 prompt variants simultaneously, you need to correct for multiple comparisons. The Bonferroni correction is the simplest: divide your alpha by the number of comparisons. With 5 variants and alpha=0.05, you'd need each pairwise comparison to hit p < 0.01. This requires about 50% more data per variant.

Your metric is noisy. If the same prompt gives wildly different quality scores on the same input (high variance), you need more data to see through the noise. Run a pilot with 30-50 tasks to estimate variance before sizing the full experiment.

A practical workflow

Here's what I actually recommend for most teams:

Estimate your baseline. Run 200 tasks through your current prompt. Measure your primary metric.
Decide your MDE. What's the smallest improvement that would be worth the effort of maintaining a different prompt? Usually 5-10% relative improvement.
Look up the sample size. Use the table above or a power calculator. ClawSplit calculates this for you.
Run the experiment. Don't peek until you hit the required sample size. If you must peek, use sequential testing methods that adjust for repeated looks (ClawSplit supports this).
Make the call. If the variant wins with statistical significance and the effect size is practically meaningful, ship it. If not, keep the control and move on.

The whole process might take a few days of production traffic. That's fine. A few days of patience beats months of shipping changes that might be improvements, might be regressions, and you'll never actually know.

A/B test your agent configs

ClawSplit lets you test different prompts, models, and settings to find what works best.

Start Testing

Why prompt engineers need A/B testing →How to Optimize AI Prompts: A Data-Driven Approach →SOUL.md Best Practices: Lessons From 1,000 Agent Deployments →How to A/B test your AI prompts: a practical guide →5 prompt optimization techniques that actually work →