ECHO makes stderr part of the coding agent world model

Microsoft Research ECHO turns terminal output into a direct learning signal so coding agents can learn from failed logs, not only final rewards.

AI 요약

What happened: Microsoft Research AI Frontiers released ECHO, a terminal-agent training method and implementation repository.
- ECHO adds an environment prediction loss on top of GRPO and uses stdout, stderr, logs, and file contents as prediction targets.
Key number: The paper reports TerminalBench-2.0 pass@1 rising from 5.17% to 10.79% for Qwen3-14B.
Why it matters: The coding-agent race is expanding beyond larger models and more tool connections toward learning from failed rollouts.
Watch: The evidence is centered on terminal and Docker tasks. It should not be read as immediate generalization to browsers, SaaS apps, or production deployment surfaces.

Microsoft Research AI Frontiers' ECHO looks small from the outside. The official repository is microsoft/echo-rl, and the paper title is "ECHO: Terminal Agents Learn World Models for Free." The core idea is also compact. When a coding agent runs a command in a terminal, the environment sends back stdout, stderr, file contents, logs, stack traces, and exit-state clues. Many agent RL setups already feed those observations into the next turn as context. ECHO asks a sharper question: why are those tokens not also direct training targets?

That is why this is more interesting than another coding-agent product announcement. Recent agent competition has widened the execution surface: IDE integrations, mobile approval, MCP connections, cloud agents, organization-level audit APIs, and remote sessions. ECHO attacks a different layer. It treats the terminal output that agents already see as supervision. Instead of only adding more tools, it tries to make the model learn more from the tools it already used, especially when the rollout failed.

What ECHO changes

A terminal-agent rollout is not a simple chat transcript. There is a user task, the agent proposes commands, the shell executes them, and the environment returns observations. A failed pytest run may reveal an import error. An ls command may show that a file is missing. A search result may expose a config key or an unexpected dependency boundary. Human developers use these outputs as reasoning material all the time.

In standard GRPO-style training, the final verifier usually supplies a sparse success or failure reward. Policy loss is applied to action tokens, while observation tokens from the terminal mostly remain context for later actions. The paper notes that in Qwen3-8B settings, on-policy rollout success can often sit below 15%. If most trajectories fail, then a final reward alone discards a lot of useful signal. Those failed trajectories still contain concrete facts about the environment: which files exist, which tests failed, which build step broke, and which command changed state.

ECHO tries to recover that signal. Its full name is Environment Cross-entropy Hybrid Objective. The method combines GRPO's policy-gradient loss with a cross-entropy loss over environment observation tokens. In practical terms, the same model is asked to predict what tokens the terminal will return after a command. The important implementation claim is that ECHO does not require a separate teacher model, extra rollouts, or a second forward pass. It reuses the same rollout and logits, then applies different masks to action tokens and observation tokens.

Terminal-agent rollout example from the official ECHO repository

The GIF is less a consumer product demo than a picture of the training data shape. Commands and observations alternate. Each observation becomes context for the next action. ECHO adds one more step: the observation is not only text to read, but also a target that teaches the model a small piece of the terminal world's behavior. "If I run this command in this environment, these are the kinds of results I should expect."

Why stderr is a world-model clue

"World model" can sound too grand for a terminal. It often evokes robotics or autonomous-driving simulators. But for a terminal agent, the world is the file system, shell commands, package managers, tests, build scripts, exit codes, logs, and error formats. rm deletes files. pytest exposes fixture failures. sed rewrites text. A wrong path produces an error. Much of that world's physics is visible in stdout and stderr.

Human developers routinely infer state from failure logs. "This module is missing, so the dependency was not installed or the path is wrong." "This test failed because a function signature changed but a caller was not updated." "This TypeScript error means the generic type no longer matches the returned shape." A coding agent that becomes stronger at long tasks needs the same habit. It should not merely carry a failure log forward in context. It should learn the relationship between actions and observations.

The official README describes ECHO as an extension built on SkyRL. SkyRL supplies the RL training stack, while the ECHO repository adds terminal-agent integration, the environment prediction loss, example configs, and a small SkyRL hook patch. Harbor is used as the terminal-task backend, starting Docker task containers, executing commands, and returning verifier rewards. In other words, this is not just a paper sketch. It is closer to a reproducible open-source training pipeline for terminal agents.

Agent action tokens: commands, reasoning, task-done signal

↓

Environment observation tokens: stdout, stderr, file contents, logs

↓

ECHO: GRPO loss plus observation-token cross entropy

The numbers are small, but the direction is clear

The TerminalBench-2.0 results in the paper are not high in absolute terms. For Qwen3-8B, GRPO reaches 2.70% pass@1, while ECHO reaches 5.17%. For Qwen3-14B, GRPO reaches 5.17%, while ECHO reaches 10.79%. Those numbers look modest beside the perceived quality of strong commercial coding agents. The right reading is not that a small open model suddenly catches the frontier. The useful signal is the relative improvement under a controlled change.

The researchers compare GRPO and ECHO from the same starting policy. The main difference is whether an auxiliary loss is applied to terminal-output tokens. Under that setup, the reported TerminalBench-2.0 pass@1 roughly doubles for both 8B and 14B models. The Qwen3-14B pass@5 result also rises from 13.48% to 19.10%. The paper reports broader improvements on internal evaluations such as val100, ITD, and TBLite as well.

2.70%

Qwen3-8B GRPO TB2 pass@1

5.17%

Qwen3-8B ECHO TB2 pass@1

10.79%

Qwen3-14B ECHO TB2 pass@1

Another important point is the relationship to expert SFT. The paper says GRPO plus ECHO from base Qwen3-8B matched, on internal evaluations, a setup that applied GRPO after OpenThinker-Agent-v1-SFT, which used about 15,000 expert terminal-agent demonstrations. On TerminalBench-2.0, it recovered about half of the SFT advantage. That should not be read as "ECHO replaces expert demonstrations." It is better understood as evidence that terminal-output prediction can recover some initialization benefit without needing expert trajectories.

That matters for smaller models and internal agent deployments. Not every team can keep every agent task on a frontier model. Cost, privacy, on-premise constraints, and self-hosted requirements all push organizations toward adapting smaller open-weight models for narrow environments. ECHO gives those teams another option besides collecting only successful expert examples. Failed rollouts can carry useful observation tokens too.

Read verifier-free carefully

One phrase in the paper is especially attractive: verifier-free self-improvement. In some settings, the researchers report that environment prediction loss alone can improve performance on unseen out-of-distribution tasks. That is a compelling idea because verifiers are hard to build for many real software tasks. Tests are incomplete. The desired output may not be unique. A user's actual intent is often under-specified.

But the phrase needs a narrow reading. ECHO shows that predicting observations in a terminal environment can help task success. It does not prove that agents can self-improve without rewards across arbitrary work. Terminal output is relatively structured, and the causal link between a command and its result is often short. Browser workflows, CRM systems, email, payments, cloud deployment, and database operations produce more partial observations. Mistakes can have higher cost, and some outcomes only appear much later.

So ECHO's stronger message is not "verifiers are unnecessary." It is "do not look only at verifiers." Final success and failure rewards still matter. The logs, errors, file changes, and test outputs between the first command and the final verifier can also become dense supervision. As coding agents handle longer tasks, this intermediate signal becomes more valuable. The cost and quality of an agent session are shaped not only by whether the final PR passes, but by how well the agent interprets every failure on the way there.

The appeal of a free signal

At the time of the Korean source article, a large independent Hacker News thread was not yet easy to identify. The work had still spread through a Digg cluster mirroring X posts and The Neuron's May 18 AI roundup. The recurring reaction was intuitive: agents already receive terminal replies, so why throw that signal away?

That intuition fits the broader 2026 agent-infrastructure shift. Coding agents are moving from autocomplete into cloud sessions, remote control, MCP, audit systems, and cost routing. Better operational surfaces help, but they do not solve repeated failures by themselves. If an agent keeps misreading logs or repeating a broken command pattern, the cost remains high. The agent needs to anticipate command results, read failure output more accurately, and pick the next action faster.

ECHO turns that product pain into a training objective. Improving agents is not only a question of model size, context-window length, or the number of external tools. It is also a question of how interaction traces become training data. This direction is especially practical for self-hosted agents and internal coding systems. Company terminal logs, CI output, build traces, and test failures are hard to ship outside the organization, but they are very specific internal learning signals if they can be scrubbed and governed.

What changes for enterprise engineering teams

First, agent logs become learning assets, not only audit artifacts. Many teams already want to store agent transcripts for debugging, compliance, or incident review. ECHO-style training suggests those transcripts can also improve model behavior. That does not mean raw logs should be dumped into training. Secrets, customer data, credentials, and personal information need redaction. But systematically collecting scrubbed terminal traces, CI failures, and test outputs may become part of operating a serious coding-agent platform.

Second, teams need evaluation that does not discard failure. A final PR merge status says little about where an agent struggled. Useful intermediate metrics include repeated errors, unnecessary file exploration, misunderstood test output, dependency installation mistakes, and commands that keep failing in the same way. ECHO pulls one slice of those intermediate signals into the training objective. Operations teams can borrow the same lens when observing agent sessions.

Third, smaller models deserve another look. ECHO reports results on Qwen3-8B and Qwen3-14B family models. The paper does not claim frontier-level absolute performance. It does suggest that environment-specific reinforcement can make smaller models more useful. That matters for organizations that cannot send every repository, terminal log, or internal build trace to an external frontier API.

Fourth, tool output should be designed as something models can learn from. If logs are noisy, nondeterministic, or constantly changing format, they become weaker training signals. If important state is hidden, the agent cannot infer it reliably. Human-readable logs and model-learnable logs are not always identical. Agent-friendly CI output, structured errors, deterministic harnesses, and stable task environments may become more important as training moves closer to real execution traces.

What ECHO still does not answer

The biggest limitation is environment scope. The terminal is powerful, but real agents increasingly operate across more surfaces: browsers, spreadsheets, design tools, Slack, Jira, cloud consoles, database admin tools, and internal web apps. In GUI environments, the relationship between action and observation is more complex. Pixels, accessibility trees, network state, hidden state, and delayed side effects all interact. It remains an open question whether terminal-output prediction gives the same efficiency in those settings.

The second limitation is safety. A model that better predicts environmental effects is not automatically safer. If it understands what a command will do, it can automate useful work more effectively, but it may also understand harmful commands better. ECHO-like training needs to be considered alongside sandboxing, permissioning, redaction, and policy enforcement. Microsoft's open-source blog around the same period argued that agentic systems need governance primitives; ECHO sits squarely inside that larger concern.

The third limitation is data quality. Terminal output is rich, but it is not always good supervision. Long meaningless logs, flaky errors, external network failures, nondeterministic timestamps, and leaked secrets can make training worse or riskier. The paper notes that some tokens, such as warning prefixes, are excluded while actual command-output tokens are targeted. Choosing what counts as an observation target is itself a design problem.

The next bottleneck is failure interpretation

ECHO will not immediately change a developer's screen the way a major product launch does. Its importance is quieter. It points to a training bottleneck for teams that actually operate coding agents. As agents work for longer stretches, the key capability is not only getting the first answer right. It is learning from failure, avoiding the same mistake twice, and turning terminal clues into the next useful action.

AI coding news often centers on stronger models, longer context, more tools, and smoother approval UI. ECHO surfaces the less visible bottleneck in between. Agents already receive a lot of feedback. Tests explain why they failed. Builds reveal missing packages. Shells expose permission and path problems. The hard question is how densely that feedback becomes learning.

That is the cleanest reading of the work: ECHO promotes stderr from annoying red text into training data. For developers, stderr is not just noise. It is the environment telling the truth. Coding agents should not only read that truth in context; they should predict it, internalize it, and use it to act better next time. The first useful world model for a coding agent may not begin in a giant 3D simulator. It may begin with a failed pytest run and a ModuleNotFoundError.