← Back to blogMarch 25, 2026

SOUL.md Best Practices: Lessons From 1,000 Agent Deployments

We analyzed SOUL.md files from over 1,000 production OpenClaw agents to find what separates high-performing configs from underperforming ones.

SOUL.md is a simple concept: write instructions for your AI agent in Markdown. But the gap between a mediocre SOUL.md and an excellent one is enormous. Through ClawSplit's A/B testing platform, we've analyzed performance data from over 1,000 production OpenClaw agents. Here are the patterns that separate top performers from the rest.

Keep It Under 500 Lines

The single strongest predictor of agent performance is SOUL.md length, and shorter is better. Agents with SOUL.md files under 500 lines outperform those over 1,000 lines on every metric we track: task completion, consistency, and user satisfaction. The reason is context window efficiency. Every line of SOUL.md competes with conversation context for the model's attention. A bloated SOUL.md pushes out the user's actual message.

The fix is ruthless editing. Every line should earn its place. If a rule has never been triggered or a behavior instruction doesn't measurably affect output quality, cut it. ClawSplit can identify which SOUL.md sections actually influence agent behavior by running ablation experiments, removing sections one at a time and measuring the impact.

Structure With Headers, Not Paragraphs

SOUL.md files organized with clear Markdown headers outperform wall-of-text formats by 12-18% on consistency metrics. Headers help the model locate relevant instructions quickly. Use H2 headers for major sections (Identity, Behavior, Boundaries, Knowledge) and H3 for subsections.

Within each section, use bullet points rather than paragraphs. Language models parse structured content more reliably than prose. "Never share pricing information" as a bullet point gets followed more consistently than the same instruction buried in a paragraph.

Be Specific About Boundaries

The highest-performing agents have explicit boundary statements: "You do not handle refund requests. If a user asks about refunds, reply with: Please contact our refund team at refunds@example.com." Compare that to the vague version: "Redirect users to the appropriate team when their request is outside your scope." The specific version gets followed 95% of the time. The vague version: 60%.

Boundaries should cover three areas: topics the agent doesn't handle, actions it must never take, and information it must never share. Be exhaustive. Every boundary you leave implicit is one the agent will eventually cross.

Use Conditional Behavior Rules

Static rules like "be concise" apply the same behavior to every situation. Conditional rules adapt: "For simple factual questions, respond in one sentence. For troubleshooting, walk through diagnostic steps one at a time, waiting for user confirmation between steps." Agents with conditional behavior rules score 20-30% higher on user satisfaction because their responses feel appropriate to the situation instead of formulaic.

Test Every Change

This is the most important practice and the one most teams skip. Every SOUL.md edit is a hypothesis about agent behavior. Test it. Run the new version through ClawSplit with at least 200 tasks before deploying to production. The five minutes it takes to set up an experiment can prevent a week of degraded performance that you'd otherwise never notice.

A/B test your agent configs

ClawSplit lets you test different prompts, models, and settings to find what works best.

Start Testing

Why prompt engineers need A/B testing →How to Optimize AI Prompts: A Data-Driven Approach →How to A/B test your AI prompts: a practical guide →5 prompt optimization techniques that actually work →How to test AI prompts before production →