From 0.25 to 0.61, MOSS lets agents rewrite their own code

The MOSS paper proposes a self-evolution loop where agents collect failure evidence, patch source code, validate it in trial containers, and promote it with rollback.

AI 요약

What happened: The arXiv paper MOSS proposes a source-level self-modification loop for autonomous agents.
- In OpenClaw experiments, the authors report that the average grader score across four tasks rose from 0.25 to 0.61 after one evolution cycle.
The shift: Instead of only editing prompts, skills, or memory files, MOSS edits the harness code that controls routing, hook order, and runtime behavior.
Operating model: The safety envelope is built from failure evidence batches, trial worker replay, user consent, container swap, and health-probe rollback.
- The public GitHub link returned 404 when checked on May 22, 2026, so reproducibility still needs to be treated as unresolved.
Watch: The important question is not whether an agent can patch itself, but which code it may change and what validation can block a bad promotion.

The paper MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems, posted to arXiv on May 21, 2026, pushes the self-evolving agent discussion into a more practical and more dangerous layer. Until now, the phrase self-evolving agent has usually meant editing prompts, adding skill files, reshaping memory schemas, or tuning workflow graphs. MOSS asks a different question: what if the recurring failure is not in the instruction text, but in the source code of the agent harness itself?

The paper's answer is direct. The agent collects evidence from failures, turns that evidence into candidate source-code patches, validates the candidate inside an isolated trial worker by replaying the failure batch, and then asks for user consent before swapping the running container. The authors report that, on OpenClaw, the average grader score across four tasks improved from 0.25 to 0.61 after a single evolution cycle. That is a meaningful result as a demonstration. But the real story is less about the score and more about the boundary. Letting an AI agent revise its prompt and letting it revise the code that executes the agent are not the same operational risk.

Why move beyond prompts

Most agent systems in production are not just a model and a prompt. They read files, run commands, open browsers, call external tools, record failures, retry tasks, and preserve state across a long workflow. Around the model sits a harness: routing logic, hook ordering, state invariants, permission checks, dispatch rules, retry policies, logging, and cost controls. The user experiences this as "the agent responded," but the actual behavior is a combination of model calls and surrounding runtime code.

MOSS argues that existing self-evolution methods do not reach enough of that surrounding runtime. Prompts and skill files are text, so they can be edited easily. Memory schemas and workflow graphs can also be adjusted to some extent. But structural failures embedded in code may sit below that layer. If the failure is that an approval hook must run before a tool call but the order is inverted in the harness, telling the model to "check approval carefully" is not a stable fix. The harness order itself has to change.

The paper frames source-level adaptation as a more general medium. Source code is Turing-complete, it is a superset of many text-mutable artifacts, it depends less on whether the base model follows instructions on every turn, and it is less vulnerable to long-context drift than a growing pile of rules. That is a strong claim, but the underlying problem is familiar to anyone operating agents. As more failures come from the surrounding harness rather than the model's knowledge, self-improvement pressure moves outside the prompt.

The MOSS loop

The loop MOSS proposes looks less like a romantic idea of an AI "evolving itself" and more like a CI/CD pipeline for an agent runtime. It starts by collecting failure evidence from production. The paper describes this as an automatically curated batch of production-failure evidence. That grounding matters. The improvement target is not a vague self-rating, but a fixed set of failures: the inputs, states, traces, and grader outcomes that made the agent underperform.

Code modification is delegated to a pluggable external coding-agent CLI. MOSS itself is not the model that writes every patch. It orchestrates the stages and the verdicts, while an external coding agent proposes the actual code change. That separation is important. One of the riskiest designs for a self-evolving system is one where the patch author, judge, and deployer all blur together. In MOSS, at least at the design level, patch generation is delegated while validation and promotion are kept in a fixed protocol.

Candidate patches do not go straight into the production container. MOSS builds a candidate image and runs the failure batch inside an ephemeral trial worker. It checks whether the old failures still reproduce, whether the grader score improves, and whether the candidate breaks the system. This is the crucial line between "the agent wrote a plausible patch" and "the patch survived replay." Self-modifying agents can easily overfit to one failure log or improve one path while breaking another. Isolated replay is the minimum firewall.

Production failure evidence batch

↓

External coding-agent CLI proposes source-code patch

↓

Candidate image is validated by replay in an ephemeral trial worker

↓

In-place container swap after user consent

↓

Roll back if health probes fail

Promotion is also not described as fully autonomous deployment. The paper calls for a user-consent-gated in-place container swap. In other words, a candidate can pass trial replay and still wait for a human approval gate before it replaces the running container. After promotion, health probes can trigger rollback. This is where agent research starts to sound like operations engineering. If self-evolution is going to leave the demo environment, the promotion and rollback contract matters more than the word "evolution."

What the 0.25 to 0.61 score does and does not prove

The most visible number in the paper is the OpenClaw result: a single MOSS cycle lifted the average grader score across four tasks from 0.25 to 0.61. As a proof point that source-level self-modification can improve an agentic substrate, that is substantial. It is especially notable because the target is not a prompt but the source-level rewriting of the production agentic substrate.

The number still needs careful reading. First, this is an arXiv preprint. Second, while the paper lists a public code link, the GitHub repository returned 404 when checked on May 22, 2026. Third, the reported evaluation covers four tasks, which is still a narrow slice. Fourth, a higher grader score does not automatically prove broader operational reliability, security boundary preservation, permission safety, or regression resistance.

That caution does not erase the result. Agent improvement is often described as "attach a better model" or "tune the prompt." MOSS points to a different lever: the harness can significantly shape agent performance and failure modes even when the model layer is unchanged. In the coding-agent era, performance is not only a function of model weights. It is also shaped by what failures are stored, what evidence is replayed, what source files can be modified, and what validation must pass before deployment.

Text-artifact evolution versus source-level evolution

This does not make prompt, skill, memory, or workflow evolution obsolete. Those layers are still important. They are cheap to edit, easier for humans to review, and usually carry a smaller blast radius. The issue is that not every agent failure lives at that layer.

Category	What changes	Strength	Failures it can miss
Text-artifact evolution	Prompts, skills, memory, workflows	Easy to review with a smaller change radius	Routing, hook order, state invariants, dispatch bugs
Source-level evolution	Agent harness and runtime code	Can repair structural failures	Regression, permission expansion, and deployment risk when validation is weak

Source-level evolution is more powerful, and that is precisely why it is riskier. A bad prompt usually degrades answer quality. A bad runtime patch can erase logs, bypass approvals, create runaway retry loops, break cost limits, or change data access behavior. So the most interesting part of MOSS is not the sentence "the AI patched code." It is the validation and promotion machinery around that sentence.

This also connects to the current coding-agent race. Tools such as Claude Code, Codex, Cursor, and OpenClaw are taking on larger units of work. As the work grows, the harness around the agent becomes more important. Context injection, tool-call control, execution observability, failure reproduction, and cost limits are not side details. If that surrounding system is weak, model capability alone will not produce reliable automation.

The real question is authority

If you imagine operating a MOSS-like system, the first question is not "how much does the score improve?" It is "which files is the agent allowed to change?" If an agent can modify its entire harness, then approval policy, logging policy, network access, cost limits, and even evaluation logic may become part of the writable surface. That opens the possibility that a self-improvement loop weakens its own guardrails.

The practical answer has to be a policy boundary around writable code. For example, routing behavior may be writable while authentication and secret-handling code are not. Log formatting may be editable while audit-log delivery paths remain immutable. The replay harness may need to be excluded from the patch surface. In a self-modifying system, once the evaluator and the evaluated component share the same authority, score improvement becomes harder to trust.

The second question is the data boundary. A production-failure evidence batch can contain user input, internal documents, source code, API responses, and error logs. Sending that batch to an external coding-agent CLI changes the organization's data path. The paper's pluggable design is flexible, but operators still need to know exactly which CLI sees which evidence. In enterprise codebases, the failure log itself can be sensitive data.

The third question is cost and runaway behavior. The self-evolution loop collects failure batches, generates patches, builds images, runs replay, and manages promotion and rollback. That is heavier than a model call. More failures can trigger more improvement work, and a poorly scoped loop can become a new compute and cost bottleneck. "The agent fixes itself" sounds attractive, but in practice it means adding another deployment pipeline and another budget surface.

What builders should take from it

This is not a signal that every team should deploy self-modifying agents immediately. The paper is still a preprint, and the public code was not available when checked. But it does point to a concrete direction for teams already operating coding agents: failure evidence is becoming a first-class asset.

If your team uses coding agents, "it failed" is not enough. You need to know the input, repository state, tool calls, output, and evaluation criteria behind the failure. That evidence makes improvement testable, whether a human writes the patch or an agent proposes it. Without failure evidence, self-improvement is not learning. It is guessing.

The agent harness also needs to be testable. Tool-call ordering, permission checks, state transitions, cost limits, and file-access policies should be expressible as tests or replayable assertions. MOSS-style trial replay only matters if there is an environment to replay and a verdict worth trusting. As agents do more work, the operational capability is not "we gave the task to an agent." It is "we encoded the invariants the agent is not allowed to break."

Human approval does not disappear either. MOSS includes user-consent-gated promotion for a reason. If every runtime patch waits for a human, automation slows down. If runtime code changes without a human gate, the risk becomes too large. Real products will likely need approval tiers: low-risk logging changes may promote automatically, while permission checks or network routing changes need review.

The next boundary of agent evolution

MOSS is not simply a story about agents becoming smarter. It is a sign that the space of agent self-improvement may expand from prompts and skills into source code and container deployment. That expansion is exciting, but it also turns the discussion back into software delivery. There are candidate patches, validation environments, promotion conditions, rollback paths, and audit logs.

In that sense, the future MOSS sketches is not a fully autonomous and magical agent. It is stronger automation that requires stronger operational boundaries. If an agent can edit its own code, humans may no longer write every patch by hand. But they still need to design what code may be changed, what evidence proves the change helped, what tests must pass before promotion, and what conditions force rollback.

The paper is an early marker. Code availability, reproducibility, larger task sets, regression behavior, security boundaries, and cost models still need follow-up. Yet one thing is already clear: the competition in AI agents is moving beyond model quality alone. Harnesses and operating loops are becoming part of the product. MOSS makes that shift explicit. If we want agents to work better, we may eventually put the code around the agent itself on the improvement surface. From that moment, the central question changes from "can it fix itself?" to "can we trust the fix?"