Public beta

Test your prompts. Ship the winner.

Run two prompt variants in parallel. Ship the winner with statistical proof.

Join 100+ builders optimizing their prompts
Backed by the OpenClaw ecosystem
Data-driven prompt optimizationStatistical A/B testingCost-aware routing

Features

Stop guessing which prompt is better

ClawSplit replaces gut-feel iteration with statistical evidence. Run experiments, measure outcomes, ship the winner.

⚡ Find which prompt actually works
Run two or more prompt versions in parallel with automatic statistical significance testing. Math, not hunches — know which SOUL.md wins before you ship it.
🛡️ Know before you ship
Benchmark any ClawHub skill against a test suite measuring accuracy, cost, and latency. Catch regressions before your users do.
💰 Stop overspending on tokens
Route simple tasks to cheaper models like Haiku or Flash automatically. Cut your token bill 20–40% without sacrificing quality where it counts.
🏆 See the winner at a glance
Confidence intervals, winner declarations, and cost breakdowns in one dashboard. Share results with your team in a single click.

How it works

From hypothesis to winner in 4 steps

No statistics degree required. ClawSplit handles the math.

01

Create an experiment

Pick which config to test — SOUL.md, a skill, or model routing rules. Upload your variants.

02

Define your test suite

Add the tasks your agent should handle. ClawSplit runs each variant against the same inputs for a fair comparison.

03

Let it run

ClawSplit executes both variants in parallel, tracking completion rate, cost, latency, and quality scores.

04

Ship the winner

Review the results dashboard, see statistical significance, and promote the winning variant to production with one click.

Compare

What changes with ClawSplit

Without ClawSplitWith ClawSplit
Prompt iterationChange, hope, repeatHypothesis → experiment → proof
Cost optimizationGuess which model is cheapestAuto-route by task complexity
Skill testingManual spot checksAutomated benchmark suites
Confidence"I think it got better"p < 0.01 significance
Time to decisionDays of manual reviewHours with automated analysis
Team alignmentOpinions and debatesShared data dashboard

Trusted by teams

What builders are saying

We were iterating on our support agent SOUL.md by feel for weeks. One experiment showed Variant B was 30% cheaper with identical completion rates.

David, Head of AI

The cost optimizer router cut our token spend by routing simple queries to cheaper models. Took five minutes to set up.

Elena, Engineering Lead

Prompt engineering finally feels like engineering. Hypothesize, test, measure, ship. The experiment loop is exactly what was missing.

Kai, Prompt Engineer

FAQ

Common questions

  • No. ClawSplit works with your existing SOUL.md and skills. Upload two variants, point it at your agent, and start an experiment. Zero config changes to your running agent.
  • ClawSplit uses standard hypothesis testing (two-proportion z-test for completion rates, Welch t-test for cost/latency). You set your confidence threshold and ClawSplit tells you when a winner is clear.
  • Task completion rate, token cost, latency, and custom quality scores. Pro users also get model-level breakdowns and cost optimizer analytics.
  • Yes. The cost optimizer lets you define routing rules (e.g., send simple tasks to Haiku) and A/B test those rules against your current setup.
  • Yes. The Starter plan is free forever — 2 concurrent experiments, basic metrics, and 7-day history. No credit card required.
  • Yes. ClawSplit works with any model your OpenClaw agent can talk to — including local models via Ollama, LM Studio, and other self-hosted inference servers. Point your variants at different local models or compare a local model against an API-hosted one. All metrics (completion rate, latency, cost) work the same way.

Run your first experiment in 30 seconds

No signup needed. See real results with live LLM calls.

Try it free →