Features How it works Compare FAQ Try it free Sign in Try it free

Guide

The SOUL.md optimization playbook

A systematic approach to improving your OpenClaw agent through controlled experiments. This playbook covers the full optimization loop: forming hypotheses, designing experiments, choosing metrics, determining sample sizes, and interpreting results.

1. The optimization loop

Every prompt improvement follows the same cycle: observe a problem, form a hypothesis, design an experiment, run it, and act on the results. The key difference between ad-hoc tweaking and systematic optimization is rigor at each step.

ObserveIdentify a specific behavior you want to change — high token cost, low task completion, poor tone.

HypothesizeForm a testable prediction: "Shortening the system prompt by 40% will reduce token cost by 15% without affecting task completion."

DesignCreate your variant, choose metrics, and calculate required sample size.

RunExecute the experiment with ClawSplit routing tasks randomly between control and variant.

AnalyzeCheck statistical significance, look at all metrics (not just the primary), and decide: ship, iterate, or revert.

2. Choosing what to test

Not all changes are worth testing. Focus experiments on changes that are likely to have measurable impact. Here are the highest-leverage areas for SOUL.md optimization:

Personality and tone

Formal vs. conversational, verbose vs. concise, cautious vs. confident. These affect user satisfaction and task acceptance rates.

Instruction specificity

Vague guidelines vs. explicit step-by-step instructions. More specific often means better task completion but higher token cost.

Guardrail configuration

Strict vs. permissive boundaries. Tighter guardrails reduce harmful outputs but may increase refusal rates on legitimate requests.

Skill ordering and selection

Which skills are available and in what priority. Affects which tools the agent reaches for first.

Few-shot examples

Adding, removing, or changing examples in the prompt. Often the single highest-impact change you can make.

Output format

Structured vs. freeform output. JSON vs. markdown vs. plain text. Affects downstream integration reliability.

3. Metrics that matter

Always measure multiple metrics simultaneously. A change that improves one metric often degrades another. The goal is to find changes that improve your primary metric without unacceptable regressions elsewhere.

Task completion rate

Did the agent successfully complete the assigned task? The most fundamental quality metric.

Token cost per task

Total input + output tokens. Directly maps to API cost. Often 2-3x different between verbose and concise prompts.

Latency (time to completion)

End-to-end time from task start to finish. Correlates with token count but also affected by tool use patterns.

Guardrail trigger rate

How often the agent hits safety boundaries. Too high means the prompt is too aggressive. Too low might mean guardrails are too loose.

Tool use efficiency

Number of tool calls per task. Fewer calls usually means the agent understood the task better.

User satisfaction score

If you have human evaluators, their ratings. The ultimate ground truth but expensive to collect.

4. Sample sizes and statistical significance

The most common mistake in prompt testing is stopping too early. Here is how to determine the right sample size for your experiment:

# Minimum tasks per variant (80% power, p < 0.05)

Expected improvement 20% → ~100 tasks per variant

Expected improvement 10% → ~400 tasks per variant

Expected improvement 5% → ~1,600 tasks per variant

Expected improvement 2% → ~10,000 tasks per variant

Key concepts:

p-value

The probability of seeing a result this extreme if there is no real difference. We use p < 0.05 as the threshold — less than 5% chance of a false positive.

Statistical power

The probability of detecting a real improvement when one exists. We target 80% power — meaning we will catch a real improvement 80% of the time.

Effect size

How big the improvement is. Larger effects need fewer samples to detect. If you are looking for a 2% improvement, you need a lot more data than for a 20% improvement.

Multiple comparisons

If you test 5 metrics, one will likely be "significant" by chance. ClawSplit applies Bonferroni correction to prevent false discoveries.

5. Example SOUL.md experiments

Here are three real experiment patterns that teams commonly run with ClawSplit:

Experiment: Concise vs. verbose system prompt

Hypothesis

Reducing the SOUL.md from 2,000 tokens to 800 tokens will lower cost by 40% without affecting task completion.

Control (A)

Current 2,000-token SOUL.md with detailed instructions for each task type.

Variant (B)

Condensed 800-token SOUL.md that merges overlapping instructions and removes redundant examples.

Metrics

Primary: task completion rate. Secondary: token cost, latency.

Expected result

Typical outcome — cost drops 35-45%, task completion stays flat or drops 1-2%. Usually a clear win.

Experiment: Guardrail threshold tuning

Hypothesis

Relaxing the content filter threshold from 0.8 to 0.6 will reduce false refusal rate by 50% without increasing actual policy violations.

Control (A)

Strict threshold (0.8) — agent refuses anything that scores above 0.8 on the safety classifier.

Variant (B)

Relaxed threshold (0.6) — agent only refuses content scoring above 0.6.

Metrics

Primary: false refusal rate. Secondary: policy violation rate, user satisfaction.

Expected result

Highly dependent on use case. Run with care — monitor the policy violation metric closely.

Experiment: Few-shot examples

Hypothesis

Adding 3 task-specific few-shot examples to the SOUL.md will improve task completion by 15% for that task category.

Control (A)

Zero-shot SOUL.md — no examples, just instructions.

Variant (B)

Few-shot SOUL.md — same instructions plus 3 input/output examples for the target task type.

Metrics

Primary: task completion rate for target task type. Secondary: token cost, performance on non-target tasks.

Expected result

Few-shot examples almost always improve the target task. Watch for regression on other task types due to longer context.

6. Common pitfalls

Peeking at results

Checking results before the experiment reaches target sample size inflates false positive rates. Let the experiment run to completion.

Testing too many changes at once

If Variant B has a new tone, new examples, and new guardrails, you cannot tell which change caused the improvement. Change one thing at a time.

Ignoring secondary metrics

A 20% task completion improvement that comes with 5x token cost is not a win for most teams. Always check the full metric picture.

Selection bias in task routing

If certain task types are routed to specific variants, your results are contaminated. ClawSplit uses random assignment by default — do not override it.

Seasonal effects

If task difficulty varies by time of day or day of week, short experiments may be biased. Run experiments for at least a full business cycle.

7. Building an optimization culture

The biggest gains come not from any single experiment, but from making experimentation a habit. Teams that test every prompt change ship agents that are measurably better every month. Here is how to build that culture:

1.Make "what is the experiment?" the first question in every prompt review.

2.Track a leaderboard of experiment wins and their cumulative impact.

3.Celebrate experiments that show no difference — they save you from shipping noise.

4.Set a team goal: no prompt change ships without an A/B test.

5.Review experiment results weekly. Patterns across experiments reveal systematic opportunities.

Ready to start optimizing?