Guide
The SOUL.md optimization playbook
A systematic approach to improving your OpenClaw agent through controlled experiments. This playbook covers the full optimization loop: forming hypotheses, designing experiments, choosing metrics, determining sample sizes, and interpreting results.
1. The optimization loop
Every prompt improvement follows the same cycle: observe a problem, form a hypothesis, design an experiment, run it, and act on the results. The key difference between ad-hoc tweaking and systematic optimization is rigor at each step.
ObserveIdentify a specific behavior you want to change — high token cost, low task completion, poor tone.
HypothesizeForm a testable prediction: "Shortening the system prompt by 40% will reduce token cost by 15% without affecting task completion."
DesignCreate your variant, choose metrics, and calculate required sample size.
RunExecute the experiment with ClawSplit routing tasks randomly between control and variant.
AnalyzeCheck statistical significance, look at all metrics (not just the primary), and decide: ship, iterate, or revert.
2. Choosing what to test
Not all changes are worth testing. Focus experiments on changes that are likely to have measurable impact. Here are the highest-leverage areas for SOUL.md optimization:
Personality and tone
Formal vs. conversational, verbose vs. concise, cautious vs. confident. These affect user satisfaction and task acceptance rates.
Instruction specificity
Vague guidelines vs. explicit step-by-step instructions. More specific often means better task completion but higher token cost.
Guardrail configuration
Strict vs. permissive boundaries. Tighter guardrails reduce harmful outputs but may increase refusal rates on legitimate requests.
Skill ordering and selection
Which skills are available and in what priority. Affects which tools the agent reaches for first.
Few-shot examples
Adding, removing, or changing examples in the prompt. Often the single highest-impact change you can make.
Output format
Structured vs. freeform output. JSON vs. markdown vs. plain text. Affects downstream integration reliability.
3. Metrics that matter
Always measure multiple metrics simultaneously. A change that improves one metric often degrades another. The goal is to find changes that improve your primary metric without unacceptable regressions elsewhere.
Task completion rate
Did the agent successfully complete the assigned task? The most fundamental quality metric.
Token cost per task
Total input + output tokens. Directly maps to API cost. Often 2-3x different between verbose and concise prompts.
Latency (time to completion)
End-to-end time from task start to finish. Correlates with token count but also affected by tool use patterns.
Guardrail trigger rate
How often the agent hits safety boundaries. Too high means the prompt is too aggressive. Too low might mean guardrails are too loose.
Tool use efficiency
Number of tool calls per task. Fewer calls usually means the agent understood the task better.
User satisfaction score
If you have human evaluators, their ratings. The ultimate ground truth but expensive to collect.
4. Sample sizes and statistical significance
The most common mistake in prompt testing is stopping too early. Here is how to determine the right sample size for your experiment:
# Minimum tasks per variant (80% power, p < 0.05)
Expected improvement 20% → ~100 tasks per variant
Expected improvement 10% → ~400 tasks per variant
Expected improvement 5% → ~1,600 tasks per variant
Expected improvement 2% → ~10,000 tasks per variant
Key concepts:
p-value
The probability of seeing a result this extreme if there is no real difference. We use p < 0.05 as the threshold — less than 5% chance of a false positive.
Statistical power
The probability of detecting a real improvement when one exists. We target 80% power — meaning we will catch a real improvement 80% of the time.
Effect size
How big the improvement is. Larger effects need fewer samples to detect. If you are looking for a 2% improvement, you need a lot more data than for a 20% improvement.
Multiple comparisons
If you test 5 metrics, one will likely be "significant" by chance. ClawSplit applies Bonferroni correction to prevent false discoveries.
5. Example SOUL.md experiments
Here are three real experiment patterns that teams commonly run with ClawSplit:
Experiment: Concise vs. verbose system prompt
HypothesisReducing the SOUL.md from 2,000 tokens to 800 tokens will lower cost by 40% without affecting task completion.
Control (A)Current 2,000-token SOUL.md with detailed instructions for each task type.
Variant (B)Condensed 800-token SOUL.md that merges overlapping instructions and removes redundant examples.
MetricsPrimary: task completion rate. Secondary: token cost, latency.
Expected resultTypical outcome — cost drops 35-45%, task completion stays flat or drops 1-2%. Usually a clear win.
Experiment: Guardrail threshold tuning
HypothesisRelaxing the content filter threshold from 0.8 to 0.6 will reduce false refusal rate by 50% without increasing actual policy violations.
Control (A)Strict threshold (0.8) — agent refuses anything that scores above 0.8 on the safety classifier.
Variant (B)Relaxed threshold (0.6) — agent only refuses content scoring above 0.6.
MetricsPrimary: false refusal rate. Secondary: policy violation rate, user satisfaction.
Expected resultHighly dependent on use case. Run with care — monitor the policy violation metric closely.
Experiment: Few-shot examples
HypothesisAdding 3 task-specific few-shot examples to the SOUL.md will improve task completion by 15% for that task category.
Control (A)Zero-shot SOUL.md — no examples, just instructions.
Variant (B)Few-shot SOUL.md — same instructions plus 3 input/output examples for the target task type.
MetricsPrimary: task completion rate for target task type. Secondary: token cost, performance on non-target tasks.
Expected resultFew-shot examples almost always improve the target task. Watch for regression on other task types due to longer context.
6. Common pitfalls
Peeking at results
Checking results before the experiment reaches target sample size inflates false positive rates. Let the experiment run to completion.
Testing too many changes at once
If Variant B has a new tone, new examples, and new guardrails, you cannot tell which change caused the improvement. Change one thing at a time.
Ignoring secondary metrics
A 20% task completion improvement that comes with 5x token cost is not a win for most teams. Always check the full metric picture.
Selection bias in task routing
If certain task types are routed to specific variants, your results are contaminated. ClawSplit uses random assignment by default — do not override it.
Seasonal effects
If task difficulty varies by time of day or day of week, short experiments may be biased. Run experiments for at least a full business cycle.
7. Building an optimization culture
The biggest gains come not from any single experiment, but from making experimentation a habit. Teams that test every prompt change ship agents that are measurably better every month. Here is how to build that culture:
1.Make "what is the experiment?" the first question in every prompt review.
2.Track a leaderboard of experiment wins and their cumulative impact.
3.Celebrate experiments that show no difference — they save you from shipping noise.
4.Set a team goal: no prompt change ships without an A/B test.
5.Review experiment results weekly. Patterns across experiments reveal systematic opportunities.
Ready to start optimizing?
Join the ClawSplit waitlist and be the first to run scientific experiments on your OpenClaw agents.