Guide

Statistics in plain language

You do not need a statistics degree to use ClawSplit. This page explains the key concepts behind A/B testing in everyday language. When ClawSplit says “Variant B is the winner with p = 0.003,” here is what that actually means for your prompt.

What does “statistically significant” mean?

When ClawSplit declares a result “statistically significant,” it means the difference between your two prompt variants is almost certainly real — not just random luck.

Think of it like flipping a coin. If you flip 10 times and get 7 heads, that could easily be luck. But if you flip 10,000 times and get 7,000 heads, something is definitely off about that coin. Statistical significance is the math that tells you when you have flipped enough times to trust the result.

In plain terms: “Statistically significant” means “we ran enough tasks to be confident this difference is real, not a fluke.”

p-values: how confident are we?

The p-value answers one question: If there were no real difference between the two prompts, how likely would we be to see a result this extreme?

p = 0.50Coin flip. The difference you see is equally likely to be real or random. Do not trust this result.
p = 0.10Suggestive but not convincing. There is a 10% chance this result is random noise. You might want to collect more data.
p = 0.05The standard threshold. Only a 5% chance this is random. Most teams ship on this.
p = 0.01Very strong evidence. Only a 1% chance of being wrong. High-stakes decisions should aim for this.
p = 0.001Near certain. The difference is almost definitely real. You can ship with full confidence.
For your prompts: When ClawSplit shows p = 0.003 for your experiment, it means there is only a 0.3% chance the winning variant is not actually better. That is strong enough to ship.

Confidence intervals: the range of truth

A single number like “89% task completion” is a point estimate. But the true completion rate is somewhere in a range. A confidence interval tells you that range.

Example

Variant B task completion: 89% (95% CI: 84% – 94%)

This means: we are 95% confident the true completion rate for this prompt is somewhere between 84% and 94%. The most likely value is 89%, but it could be a bit higher or lower.

Confidence intervals are especially useful when comparing two variants. If the intervals do not overlap, you have strong evidence of a real difference. If they overlap a lot, you probably need more data.

Reading the overlap
No overlap: Control 78% – 84%, Variant 88% – 94% → Clear winner. Ship the variant.
Some overlap: Control 78% – 88%, Variant 84% – 94% → Probably different but need more data to be sure.
Heavy overlap: Control 80% – 90%, Variant 82% – 92% → Too close to call. Collect more samples or test a bigger change.

Sample sizes: how many tasks do I need?

The most common mistake in A/B testing is stopping too early. If you run 20 tasks and see Variant B “winning,” that is not enough data. You need a minimum sample size to trust the result.

The required sample size depends on how big the difference is that you are trying to detect:

Large (20%+ better)~100 tasks per variantCompletely rewritten SOUL.md, different model
Medium (10% better)~400 tasks per variantNew few-shot examples, restructured instructions
Small (5% better)~1,600 tasks per variantMinor wording changes, tweaked guardrails
Tiny (2% better)~10,000 tasks per variantSubtle tone shifts, formatting adjustments
Rule of thumb: If you are not sure, start with 200 tasks per variant. That is enough to detect a meaningful improvement (15%+) while keeping experiment duration reasonable for most teams.

Why can’t I just look at the numbers?

You ran 50 tasks. Variant A completed 40 (80%). Variant B completed 44 (88%). Variant B is better, right? Not necessarily.

With only 50 tasks, this 8% difference could easily be random variation. If you ran another 50 tasks, Variant A might come out ahead. The statistics exist to prevent you from shipping a change based on noise.

Without statistics
You ship Variant B because 88% > 80%. Two weeks later, performance drops back to 80%. You wasted time on a change that was never real.
With ClawSplit
ClawSplit tells you p = 0.34 — not significant. You keep the experiment running until you have enough data, or you realize the difference is too small to matter.

Common questions

What if my experiment shows "no significant difference"?
That is a valid and useful result. It means both prompts perform similarly, so you should pick the simpler or cheaper one. It also means the change you tested was not impactful enough — try a bigger change next time.
Can I stop an experiment early if one variant looks way better?
ClawSplit has built-in early stopping for safety — if one variant is performing dramatically worse, it will stop automatically. But for positive results, resist the urge to peek and stop early. Early stopping inflates false positive rates.
What is the difference between "significant" and "meaningful"?
A result can be statistically significant (definitely real) but not practically meaningful (too small to matter). If Variant B is 0.5% better with p = 0.01, the difference is real but probably not worth the complexity. Always check the effect size, not just the p-value.
Do I need to understand the math to use ClawSplit?
No. ClawSplit handles all the calculations automatically. The dashboard shows you a clear verdict: winner, no winner, or needs more data. This page exists so you can understand what is happening under the hood — but you do not need it to run experiments.
What tests does ClawSplit use under the hood?
Two-proportion z-test for completion rates (pass/fail metrics) and Welch's t-test for continuous metrics like cost and latency. Bonferroni correction is applied when multiple metrics are tested to prevent false discoveries.

Let the math work for you

ClawSplit handles all the statistics automatically. Join the waitlist and start making data-driven decisions about your agent prompts.