Your first experiment in 5 minutes

This tutorial walks you through creating an A/B test that compares two SOUL.md versions for a support agent. By the end, you will know how to create experiments, define variants, set success criteria, and read results.

Before you start

A ClawSplit account (free Starter plan works)
An API key from Settings → API Keys
An existing OpenClaw agent with a SOUL.md you want to improve

Create your first experiment

An experiment is a controlled comparison between two or more prompt configurations. You give it a name, define the variants, and tell ClawSplit what to measure.

From the dashboard, click New Experiment, or use the API:

curl -X POST https://api.clawsplit.com/v1/experiments \
  -H "Authorization: Bearer $CLAWSPLIT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "support-concise-vs-verbose",
    "variants": [
      { "name": "control", "soul_md_url": "./soul-control.md" },
      { "name": "concise", "soul_md_url": "./soul-concise.md" }
    ],
    "sample_size": 200,
    "primary_metric": "task_completion",
    "confidence": 0.95
  }'

ClawSplit creates the experiment and returns an ID (exp_abc123). The experiment starts in pending status until you start sending traffic.

Define your variants

Each variant is a different SOUL.md version. For this tutorial, we are testing whether a concise prompt performs as well as a verbose one — with the hypothesis that it will cut token cost significantly.

Variant A (control) — your current prompt

# SOUL.md — Control (Variant A)
You are a customer support agent for Acme Corp.
Be thorough and professional. Always greet the customer,
ask clarifying questions, and provide step-by-step solutions.
Include relevant documentation links when available.
End every conversation by asking if there's anything else
you can help with.

Variant B — the challenger

# SOUL.md — Variant B (Concise)
You are Acme Corp support. Be direct and helpful.
Answer the question, provide the fix, link to docs if relevant.
Skip pleasantries unless the customer initiates them.

Tip: Change only one thing per experiment. If Variant B has a new tone and different instructions and fewer examples, you will not know which change caused the result.

Set sample size and success criteria

Before you start, decide two things: how many tasks each variant needs to handle, and what metric determines the winner.

Sample size

We recommend at least 100 tasks per variant for detecting large effects (20%+ improvement) and 400+ for smaller effects. Our example uses 200 total (100 per variant).

Primary metric

The single metric that determines the winner. Common choices: task_completion, avg_token_cost, or avg_latency. Pick the one that matters most for this experiment.

Confidence level

How sure you want to be. 0.95 (95%) is standard — it means there is less than a 5% chance the result is a false positive.

ClawSplit randomly assigns each incoming task to a variant. You do not need to manage the split manually — it handles randomization and ensures balanced allocation.

Run and interpret results

Once your experiment reaches the target sample size, ClawSplit runs the statistical analysis automatically. You will see one of three outcomes:

Winner declared

One variant is statistically significantly better on the primary metric. Ship it with confidence.

No significant difference

Both variants performed similarly. Pick the cheaper or simpler one, or design a new experiment with a bigger change.

Stopped early

ClawSplit detected one variant performing significantly worse and stopped the experiment to protect your users.

Example results payload

{
  "status": "significant",
  "winner": "concise",
  "p_value": 0.008,
  "sample_size": { "control": 102, "concise": 98 },
  "metrics": {
    "task_completion": {
      "control": 0.84,
      "concise": 0.89,
      "significant": false
    },
    "avg_token_cost": {
      "control": 0.0041,
      "concise": 0.0022,
      "significant": true
    },
    "avg_latency_ms": {
      "control": 3200,
      "concise": 1800,
      "significant": true
    }
  }
}

In this example, the concise variant won on cost (46% cheaper) and latency (44% faster), while task completion improved slightly (89% vs 84%) but did not reach significance on its own. The overall verdict: ship the concise version — it is cheaper, faster, and at least as effective.

What to do next

Read the optimization playbook

Learn systematic approaches to improving SOUL.md through experiments.

Understand the statistics

Plain-language guide to p-values, confidence intervals, and sample sizes.

Explore the API reference

Full API documentation with webhook events and code samples.

Ready to run your first experiment?