Do I need to change my agent setup?

No. ClawSplit works with your existing SOUL.md and skills. Upload two variants, point it at your agent, and start an experiment. Zero config changes to your running agent.

How does statistical significance work?

ClawSplit uses standard hypothesis testing (two-proportion z-test for completion rates, Welch t-test for cost/latency). You set your confidence threshold and ClawSplit tells you when a winner is clear.

What metrics can I track?

Task completion rate, token cost, latency, and custom quality scores. Pro users also get model-level breakdowns and cost optimizer analytics.

Can I test model routing, not just prompts?

Yes. The cost optimizer lets you define routing rules (e.g., send simple tasks to Haiku) and A/B test those rules against your current setup.

Is there a free tier?

Yes. The Starter plan is free forever — 2 concurrent experiments, basic metrics, and 7-day history. No credit card required.

Does ClawSplit work with local models like Ollama or LM Studio?

Yes. ClawSplit works with any model your OpenClaw agent can talk to — including local models via Ollama, LM Studio, and other self-hosted inference servers. Point your variants at different local models or compare a local model against an API-hosted one. All metrics (completion rate, latency, cost) work the same way.

Features How it works Compare FAQ Try it free Sign in Try it free

Public beta

Test your prompts. Ship the winner.

Run two prompt variants in parallel. Ship the winner with statistical proof.

Try it free →

Join 100+ builders optimizing their prompts

Experiment: Support ToneComplete

Variant A

“You are a professional customer support agent. Use formal language...”

Winner

Variant B

“You are a friendly, empathetic support specialist. Use warm language...”

87.2%vs94.1%

A · 87.2%

B · 94.1%

500 tasks per variantp < 0.01+23% cost savings

Run your own →

Backed by the OpenClaw ecosystem

Data-driven prompt optimizationStatistical A/B testingCost-aware routing

Features

Stop guessing which prompt is better

ClawSplit replaces gut-feel iteration with statistical evidence. Run experiments, measure outcomes, ship the winner.

⚡ Find which prompt actually works

Run two or more prompt versions in parallel with automatic statistical significance testing. Math, not hunches — know which SOUL.md wins before you ship it.

🛡️ Know before you ship

Benchmark any ClawHub skill against a test suite measuring accuracy, cost, and latency. Catch regressions before your users do.

💰 Stop overspending on tokens

Route simple tasks to cheaper models like Haiku or Flash automatically. Cut your token bill 20–40% without sacrificing quality where it counts.

🏆 See the winner at a glance

Confidence intervals, winner declarations, and cost breakdowns in one dashboard. Share results with your team in a single click.

How it works

From hypothesis to winner in 4 steps

No statistics degree required. ClawSplit handles the math.

Create an experiment

Pick which config to test — SOUL.md, a skill, or model routing rules. Upload your variants.

Define your test suite

Add the tasks your agent should handle. ClawSplit runs each variant against the same inputs for a fair comparison.

Let it run

ClawSplit executes both variants in parallel, tracking completion rate, cost, latency, and quality scores.

Ship the winner

Review the results dashboard, see statistical significance, and promote the winning variant to production with one click.

Compare

What changes with ClawSplit

Without ClawSplitWith ClawSplit

Prompt iterationChange, hope, repeatHypothesis → experiment → proof

Cost optimizationGuess which model is cheapestAuto-route by task complexity

Skill testingManual spot checksAutomated benchmark suites

Confidence"I think it got better"p < 0.01 significance

Time to decisionDays of manual reviewHours with automated analysis

Team alignmentOpinions and debatesShared data dashboard

Trusted by teams

What builders are saying

“We were iterating on our support agent SOUL.md by feel for weeks. One experiment showed Variant B was 30% cheaper with identical completion rates.”
— David, Head of AI

“The cost optimizer router cut our token spend by routing simple queries to cheaper models. Took five minutes to set up.”
— Elena, Engineering Lead

“Prompt engineering finally feels like engineering. Hypothesize, test, measure, ship. The experiment loop is exactly what was missing.”
— Kai, Prompt Engineer

FAQ

Common questions

No. ClawSplit works with your existing SOUL.md and skills. Upload two variants, point it at your agent, and start an experiment. Zero config changes to your running agent.
ClawSplit uses standard hypothesis testing (two-proportion z-test for completion rates, Welch t-test for cost/latency). You set your confidence threshold and ClawSplit tells you when a winner is clear.
Task completion rate, token cost, latency, and custom quality scores. Pro users also get model-level breakdowns and cost optimizer analytics.
Yes. The cost optimizer lets you define routing rules (e.g., send simple tasks to Haiku) and A/B test those rules against your current setup.
Yes. The Starter plan is free forever — 2 concurrent experiments, basic metrics, and 7-day history. No credit card required.
Yes. ClawSplit works with any model your OpenClaw agent can talk to — including local models via Ollama, LM Studio, and other self-hosted inference servers. Point your variants at different local models or compare a local model against an API-hosted one. All metrics (completion rate, latency, cost) work the same way.

Run your first experiment in 30 seconds

No signup needed. See real results with live LLM calls.

Try it free →