Quickstart
Your first experiment in 5 minutes
This tutorial walks you through creating an A/B test that compares two SOUL.md versions for a support agent. By the end, you will know how to create experiments, define variants, set success criteria, and read results.
Before you start
- A ClawSplit account (free Starter plan works)
- An API key from Settings → API Keys
- An existing OpenClaw agent with a SOUL.md you want to improve
1
Create your first experiment
An experiment is a controlled comparison between two or more prompt configurations. You give it a name, define the variants, and tell ClawSplit what to measure.
From the dashboard, click New Experiment, or use the API:
curl -X POST https://api.clawsplit.com/v1/experiments \
-H "Authorization: Bearer $CLAWSPLIT_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "support-concise-vs-verbose",
"variants": [
{ "name": "control", "soul_md_url": "./soul-control.md" },
{ "name": "concise", "soul_md_url": "./soul-concise.md" }
],
"sample_size": 200,
"primary_metric": "task_completion",
"confidence": 0.95
}'
ClawSplit creates the experiment and returns an ID (exp_abc123). The experiment starts in pending status until you start sending traffic.
Each variant is a different SOUL.md version. For this tutorial, we are testing whether a concise prompt performs as well as a verbose one — with the hypothesis that it will cut token cost significantly.
Variant A (control) — your current prompt
# SOUL.md — Control (Variant A)
You are a customer support agent for Acme Corp.
Be thorough and professional. Always greet the customer,
ask clarifying questions, and provide step-by-step solutions.
Include relevant documentation links when available.
End every conversation by asking if there's anything else
you can help with.
Variant B — the challenger
# SOUL.md — Variant B (Concise)
You are Acme Corp support. Be direct and helpful.
Answer the question, provide the fix, link to docs if relevant.
Skip pleasantries unless the customer initiates them.
Tip: Change only one thing per experiment. If Variant B has a new tone and different instructions and fewer examples, you will not know which change caused the result.
3
Set sample size and success criteria
Before you start, decide two things: how many tasks each variant needs to handle, and what metric determines the winner.
Sample size
We recommend at least 100 tasks per variant for detecting large effects (20%+ improvement) and 400+ for smaller effects. Our example uses 200 total (100 per variant).
Primary metric
The single metric that determines the winner. Common choices: task_completion, avg_token_cost, or avg_latency. Pick the one that matters most for this experiment.
Confidence level
How sure you want to be. 0.95 (95%) is standard — it means there is less than a 5% chance the result is a false positive.
ClawSplit randomly assigns each incoming task to a variant. You do not need to manage the split manually — it handles randomization and ensures balanced allocation.
4
Run and interpret results
Once your experiment reaches the target sample size, ClawSplit runs the statistical analysis automatically. You will see one of three outcomes:
Winner declared
One variant is statistically significantly better on the primary metric. Ship it with confidence.
No significant difference
Both variants performed similarly. Pick the cheaper or simpler one, or design a new experiment with a bigger change.
Stopped early
ClawSplit detected one variant performing significantly worse and stopped the experiment to protect your users.
Example results payload
{
"status": "significant",
"winner": "concise",
"p_value": 0.008,
"sample_size": { "control": 102, "concise": 98 },
"metrics": {
"task_completion": {
"control": 0.84,
"concise": 0.89,
"significant": false
},
"avg_token_cost": {
"control": 0.0041,
"concise": 0.0022,
"significant": true
},
"avg_latency_ms": {
"control": 3200,
"concise": 1800,
"significant": true
}
}
}
In this example, the concise variant won on cost (46% cheaper) and latency (44% faster), while task completion improved slightly (89% vs 84%) but did not reach significance on its own. The overall verdict: ship the concise version — it is cheaper, faster, and at least as effective.
What to do next
Ready to run your first experiment?
Join the ClawSplit waitlist and start optimizing your agents with data, not guesswork.