90% of PRs are agent-built, Warp exposes the new bottleneck

OpenAI and Warp show that the coding-agent race is shifting from code generation to open-source verification, observability, and agent orchestration.

AI 요약

What happened: OpenAI published a Warp customer story on May 27, putting a GPT-5.5-based agentic development workflow in the foreground.
- Warp combines its open-source repository with Oz, an orchestration layer where humans set goals and verify outcomes while agents plan, implement, test, and draft PRs.
The numbers: In Warp's internal benchmark, GPT-5.5 used 30% fewer tokens per agentic coding task than GPT-5.4.
The signal: Warp says roughly 90% of its internal PRs are now created with agents.
- The bottleneck moves from writing code to issue selection, context management, runtime observability, and human review.
Watch: Community reaction is sharply split around AI terminals, login requirements, agent bloat, licensing, and data control.

OpenAI's Warp customer story, published on May 27, 2026, can be read as a simple startup success case. The headline mentions GPT-5.5. The body highlights token efficiency and growth numbers. For developers, the more important scene is elsewhere. Warp has opened its repository, and it is trying to turn that repository from a place where humans personally implement every change into a place where humans supervise fleets of agents. Agents write a lot of the code. Humans decide what should be built, verify whether the result is correct, and judge whether a change belongs in the product.

That shift captures the next phase of the coding-agent race. Over the past year, the main question was "how well can AI write code?" The question is now closer to "how do many agent-created changes become an operating software-production system?" Long-running agent work needs more than a stronger model. It needs reproducible environments, permissions, logs, memory, evaluation, review, recovery from failure, and an open-source feedback loop that can actually inspect the work. Warp's Oz layer is an attempt to make that operational layer visible.

OpenAI says Warp is used by nearly one million developers and by more than 56% of the Fortune 500. Warp also says agents help create roughly 90% of its internal PRs. In Warp's internal benchmark, GPT-5.5 used 30% fewer tokens per agentic coding task than GPT-5.4. Those numbers do not point to "smarter autocomplete." They point to a cost structure where running agents repeatedly, decomposing long tasks, and turning outputs into reviewable PRs become core product problems.

30%

fewer tokens per GPT-5.5 task

90%

of internal Warp PRs co-created with agents

56%+

Fortune 500 adoption cited by OpenAI

Warp opened more than a code repository

Warp announced on April 28 that it was making the terminal client open source in a post titled "Warp is now open-source". The important phrase is not only "open source." It is "agent-first workflow." Warp says the community can participate in product development through Oz, its cloud agent orchestration platform. In that model, humans provide ideas, direction, and validation. Agents take on more of the planning, code-writing, and testing load.

For open source, that is a meaningful change. Traditional contribution asks someone to read an issue, understand the code, create a patch, and negotiate the result with maintainers. Warp's proposed flow asks contributors to write less code directly. Instead, they judge which problems matter, what correct behavior looks like, and whether an agent's result matches user expectations. Product judgment and verification become more central than raw implementation ability.

The model is attractive because open-source projects have always had a backlog problem. Feature requests pile up. Maintainer time is scarce. If agents can handle a meaningful share of triage, planning, implementation drafts, and test execution, stale issues become easier to attack. Warp makes that argument explicitly: community ideas, Oz agents, structured process, context, and self-improvement loops could produce a better product than an internal team could produce alone.

But the same model raises uncomfortable questions. If open-source contribution moves from coding to verification, who is accountable? Who guarantees the quality of an agent-created PR? Where does the subtle product judgment between "the issue asked for this" and "the product should behave this way" get recorded? Can reviewers see the context an agent read, the commands it ran, the tests it failed, and the alternatives it skipped? Open source can become faster only if verification gets denser at the same time.

Oz is closer to a control plane than a terminal

OpenAI describes Warp's Oz as an agent control plane that bridges local and cloud environments. A developer can start agents from a web interface, choose predefined skills and environments, select a model and hosting configuration, and monitor long-running workflows centrally. Agents continue running remotely. The developer can watch a live session, review generated artifacts, or move work between cloud and local environments without losing context.

Warp's Oz product page points in the same direction. It describes a platform for orchestrating Claude Code, Codex, Warp Agent, and other harnesses locally or in the cloud, with automation for recurring jobs, large feature builds, migrations, and production deployments. The crucial word is "orchestration." Whether a single agent can write good code is already a crowded competition. Oz aims at the coordination cost that appears when multiple agents and harnesses run at the same time.

Warp Oz dashboard showing cloud agent execution in an official product screenshot

This is also a signal that coding agents are moving beyond chat panels inside IDEs. As agentic development grows, work separates into layers. One layer is the short local loop where a human talks to an agent in a terminal. Another is the asynchronous cloud runtime where longer tasks keep running. A third is the public collaboration loop across GitHub issues, PRs, reviews, and release notes. Warp is trying to connect those layers through the terminal, Oz, and an open-source repository.

That changes the basis of competition. Teams are no longer only choosing the best model. They are deciding which agent should handle which task, how parallel execution is limited, how costs are tracked, how failures are retried, and where humans enter the loop. This is why the OpenAI post talks about context compaction, persistent memory, and dedicated subagents for code search and file analysis. Long-horizon quality depends not only on momentary reasoning, but on state management and observability.

What the GPT-5.5 number says about cost

OpenAI's 30% token-efficiency claim can sound modest. In agentic coding, it matters differently than in a normal chat completion. An agent does not answer once and stop. It reads the repository, runs tests, interprets failures, edits code, and verifies again. It may call subagents, compact context, search files, and repeat the loop. When one task expands into dozens of turns and many tool calls, a 30% token reduction is not just an API bill improvement. It can determine how many tasks the same budget can run.

The 90% PR figure should be read in the same way. It does not mean people disappeared from the workflow. It means their work changed. Developers define problems, prepare useful context, review generated diffs, and judge product quality and risk. If the number of PRs to review grows, review fatigue becomes a real systems problem. The faster agents implement, the more important maintainer judgment and review tooling become.

The open-source version of this problem is especially sensitive. Inside a company, private repositories are surrounded by organizational policy and internal context. In an open-source repository, external contributors, public issues, public PRs, and public roadmap expectations all interact. If an agent reads a public issue and turns it into a plan, the plan needs to be understandable. The user's expectation needs to be explicit. Otherwise, a project can accumulate changes that an agent produced but nobody clearly asked for.

Community reaction is more cautious than celebratory

The Reddit discussion in r/rust around Warp's open-source announcement shows why the move is not straightforward. Some users said they avoided Warp because it was a closed-source terminal, but would reconsider after the open-source shift. Others praised aspects of the terminal itself, including its interface, block-based command output, and editing experience over SSH. A commenter who appeared to be associated with Warp framed the change as a way to build in public with users and reduce open-source maintenance burden through agents.

The opposing reaction was just as clear. Several commenters argued that Warp had pushed AI and authentication too hard into a tool that should stay lean. Some wanted a lighter version with agentic coding removed. Others said they would stay with terminal emulators such as kitty or Ghostty. There were also concerns about "free labor," agent training, licensing choices, and why the project was not AGPL from the beginning. This is the tension every AI developer tool faces when it invokes open source.

That reaction does not simply expose a weakness in the announcement. It describes the conditions for adoption. Developers may want agents, but they do not necessarily want every tool to reorganize itself around agents. The terminal is a particularly sensitive surface. Shell commands, SSH sessions, secrets, production logs, and deployment commands pass through it. When AI and cloud orchestration enter that space, the product has to balance convenience with control. Easy AI opt-outs, clear local-cloud boundaries, BYOK or model choice, and explicit log and data-retention policies become part of trust.

Open source becomes an evaluation ground for agents

Warp's experiment is interesting because open source is a hard evaluation environment for agent workflows. Public issues are often ambiguous. Old bugs, duplicated requests, environment-specific reproduction gaps, and maintainers' implicit design standards are mixed together. To create a useful PR in that setting, an agent needs more than code generation. It has to find related issues, narrow requirements, read existing design intent, solve the problem with a small change, and align tests and documentation.

Open source also exposes agent failure quickly. A misguided approach, an oversized refactor, an abstraction that does not match maintainer taste, or a change that skips tests can be visible in public review. In that sense, Warp's Open Agentic Development message is both product positioning and a public benchmark. The company has to show whether agents can collaborate in repositories that have real users and maintainers watching.

That makes the open-source shift more than a differentiation tactic. Cursor, Copilot, Codex, Claude Code, and Gemini CLI are all expanding in the coding-agent market. Warp is trying to connect the terminal, orchestration, and open collaboration into a different shape. The claim is not only "our agent writes code well." It is closer to "our workflow can coordinate many agents and many people."

What development teams should inspect now

The first practical question for a development team is where agent-created PRs actually bottleneck today. Is it code writing, context preparation, test runtime, review, approval policy, or release confidence? Warp's 90% PR figure is impressive, but review and observability have to carry the weight behind it.

The second question is how much agent orchestration should move into tooling. Some teams can get enough value from local CLI agents. Others that run migrations, dependency upgrades, and large feature builds may need cloud sessions, shared memory, recurring workflows, audit logs, and better queueing. The product name matters less than the operational requirement.

The third question is how to redesign human verification. The more PRs agents create, the more decisions maintainers must make in less time. PR bodies need useful diff summaries, test evidence, risk scoring, rollback plans, and product acceptance criteria. If agents write code faster, they also need to package the material humans use to judge that code.

Users or community members propose issues and goals

↓

Oz coordinates agents, models, environments, and workflows

↓

Agents plan, implement, test, and draft PRs

↓

Humans verify product judgment, security, quality, and merge readiness

The new bottleneck is verification

Warp and OpenAI's case study gives a clear view of where coding agents are going. Models are taking on longer work. Token costs are falling. Agent runtimes are moving between local and cloud environments. But the developer's job is not disappearing. It is moving up a layer. Writing good issues, preparing context, verifying agent-created changes, and protecting product and security boundaries become more important.

So the 90% figure in the headline is not only an automation win. It is also a warning about the next bottleneck. The more PRs agents create, the more teams need review protocols, acceptance criteria, and observability that can scale with them. For open source, the bar is even higher. Public repositories need a structure where the community can trust, inspect, modify, and reject agent-produced work.

Whether Warp's experiment works is still an open question. Reddit reaction shows that many developers are wary of AI moving too deeply into the terminal. At the same time, it matters that a formerly closed tool is opening its code and testing agent-based collaboration in public. The next coding-tool competition may be less about who writes the most code and more about who can explain, verify, and operate code written by agents. Warp's question lands there: when agents create 90% of the PRs, what standard do we use to trust them?