The same PR, a 12.5x bill: what coding agents really cost

Joule Index V0.1 adds dollars, joules, and public traces to coding-agent benchmarks, shifting the question beyond accuracy alone.

AI 요약

What happened: Joule Index published coding-agent cost, energy, and merge-readiness data from real open-source bug fixes.
- V0.1 is a preview built from 8 agent tiers, 3 May 2026 bug-fix tasks, and public observational traces.
The number: among five tiers that matched the same merged PRs, average cost still ranged from $0.082 to $1.025 per task.
Why it matters: coding-agent procurement now has to ask about prompt cache, token budget, trace export, and failed-run cost, not just SWE-bench scores.
Watch: the sample is still small. Joule Index calls V0.1 indicative and says V1 should move toward n>=30 plus direct power measurement.

The coding-agent market has mostly been organized around a simple question: which model fixes more GitHub issues? Which tool scores higher on SWE-bench Verified? Which agent survives longer terminal tasks? That question still matters. But once real teams start putting agents into daily work, a second question arrives quickly: what did the same result cost? Did the bill fall because of prompt caching, or because the model was cheaper? Who sees the cost of failed runs? Can a vendor-published benchmark score be recalculated?

Released in May 2026, Joule Index puts that second question at the center. Blankline Research describes it as an auditable benchmark for "AI cost, energy and merge-readiness." The interesting part is not just that it adds another performance table. It asks what happens when a coding agent produces a diff that matches a human-maintainer-merged PR on real open-source bugs, then presents that result with dollar cost, estimated joules, file attention, accessibility, and public trace evidence.

The strongest V0.1 number is 12.5x. According to the leaderboard, 8 agent tiers worked on the same 3 May 2026 open-source bug-fix tasks. Five tiers reached Attention F1 1.000, meaning their file set matched the PR that a human maintainer actually merged. Yet among those five tiers, average cost per task ranged from $0.082 for Dropstone Fast to $1.025 for Claude Opus 4.7. If the diff and merge-readiness are the same, that gap lands on the engineering team's bill and energy budget.

Joule Index V0.1 verified tier average cost and energy comparison

The missing column in accuracy tables

AI benchmarks are usually designed around capability. Accuracy, pass@k, Elo, solved rate, and benchmark score are the headline columns. Coding-agent benchmarks followed the same path. SWE-bench Verified was a real step forward because it uses actual GitHub issues. Terminal-Bench and Aider Polyglot expanded the shape of measured development work. But most tables still ask "did it solve the task?" first, while "what did it cost to solve?" is secondary or missing.

Joule Index is interesting because it changes that order. Its benchmark question is closer to: can a coding agent fix a recent user-reported bug in a way that resembles a human-maintainer-mergeable PR, and what does that cost in dollars and joules? That makes the artifact feel less like a model leaderboard and more like a procurement document. For developers, that is often more practical. Once an organization deploys coding agents to tens or hundreds of engineers, per-task cost distribution, tail cost, and repeated failed-run cost become bigger operating variables than a single benchmark score.

Blankline's V0.1 is still small: 3 tasks, 8 verified tiers, and one retired task. It would be a mistake to conclude from this preview that one model or vendor is definitively the most economical. The release is newsworthy because of its direction. Accuracy alone is no longer enough for a coding-agent benchmark. Token budget, cache hits, wall time, energy estimate, and public traces need to travel with the score if teams are going to make deployment decisions from it.

Why the same diff can cost 12.5x more

The most practical part of the Joule Index paper is prompt caching. Coding agents rarely finish in one model call. They inspect a repository, open files, write a plan, edit code, run tests, interpret failures, and edit again. Across that loop, system prompts, tool descriptions, repository context, and prior conversation often repeat. When a provider can treat repeated input as prompt-cache reads, cost and estimated energy drop. When cache support is absent or the harness fails to preserve a cacheable prefix, the same task can repeatedly recompute the full context.

Blankline says the V0.1 data showed input-token cache-read rates of 66% for Dropstone Fast, 79% for Dropstone Pro, 95% for Claude Haiku 4.5, and 0% for Dropstone Heavy. The important point is not that Heavy failed at the task. It also reached Attention F1 1.000. The problem is that it could not use caching in the same way, so it fell behind on the cost and joule axes. The paper frames this less as a capability gap and more as an inference-architecture gap.

That interpretation fits the broader coding-agent cost debate. A single prompt price sheet cannot explain real agent economics. Agents reread the same context, move between filesystem and shell, and feed failed test output back into the model. Actual cost is a function of model price, output length, prompt caching, number of tool-loop turns, repository size, and harness design. For many teams, "does our agent harness preserve a stable cacheable prefix?" may matter more than "which model looks cheapest on a token-price page?"

What the joule axis is trying to say

Dollar cost is familiar. API price sheets and billing records make it visible to teams. Joules are less familiar. Joule Index is not directly measuring the real GPU power draw of closed APIs. Instead, it starts from billed token counts and public per-token energy rates, then applies a cache-aware adjustment. Cache-read tokens are counted at 15% of fresh-input energy, while output decoding is treated as more expensive than fresh input. The methodology presents this as a conservative estimate and says V1 is intended to include direct GPU power measurement for open-weight runs.

That axis is imperfect. External observers cannot know a closed provider's exact hardware, batching, region, utilization, cooling, or power mix. The joule number should not be read as precise carbon accounting. But it is useful as a comparative signal. It shows how much repeated input, decoding, and context reuse differ across agent tiers doing the same work. For long-horizon coding tasks, energy estimate is not just an ESG decoration. It is another view of the same cost structure.

On the leaderboard, Dropstone Heavy averages 1,693J per task, the highest among the compared tiers. Claude Opus 4.7 averages 511J, Dropstone Fast 224J, and Claude Haiku 4.5 146J. This does not mean smaller models are always better, or that expensive models should never be used. On harder work, a stronger model can reduce failed attempts and lower total cost. But when the task is a routine bug fix and the resulting diff is the same, teams should ask whether a larger model and heavier inference path should be the default.

When verified disclosure becomes procurement language

Joule Index's second important axis is disclosure. A verified entry must publish a full observational trace. That includes the agent's tool calls, read files, billed tokens, and final git diff. It does not require the vendor to disclose source code, system prompts, or internal reasoning. That line matters. Vendors can protect trade secrets, while users get enough evidence to recompute the score.

This resembles the way power-aware hardware benchmarks earn trust. A hardware benchmark does not become authoritative because a vendor says it is fast. The task, inputs, measurement method, and output have to be inspectable. Agent benchmarks need the same treatment. What task was attempted? Which files did the agent read? How many tokens were billed? How were cost and joules calculated? What final diff was produced? A coding agent is not just a model. It is a system made from a model, harness, tool policy, sandbox, cache behavior, and retry loop. Without system-level trace data, a model score does not explain where the cost came from.

If development teams adopt this framing, vendor evaluation questions change. "What is your SWE-bench score?" gets joined by "what is the token budget per task on repositories like ours?", "can you show prompt-cache hit rate?", "are failed and retried runs included in the cost report?", "can we export the tool-call trace?", and "what happens to cost when model routing changes?" Those questions are much closer to real agent operations.

Evaluation axis	Traditional coding benchmarks	What Joule Index adds
Success criterion	Tests passed, issue solved rate, pass rate	File attention and merge-readiness against a human-maintainer-merged PR
Cost	Usually secondary or undisclosed	Per-task dollar cost from billed tokens
Energy	Rarely measured	Cache-aware token energy estimate, with direct measurement planned for V1
Verifiability	Leaderboard score or vendor report	Observational trace with tool calls, file reads, tokens, and final diff

The small sample is part of the news

The part to be careful with is sample size. The Joule Index methodology says definitive claims should use n>=30 per model and category. The current preview is n=3 per cell. So "Dropstone Fast is better than Claude Opus 4.7" would be the wrong conclusion. Small, routine bug fixes may favor cheaper tiers or smaller models, while larger architecture changes, security patches, or ambiguous issues may produce different rankings.

But the small sample does not erase the release's value. It shows what a better benchmark should disclose. Joule Index retired one candidate task because of reviewer disagreement and explained why. Its methodology documents contamination defense, verified disclosure, the limits of pricing-preview rows, and the limits of energy estimates. That posture matters in an agent-benchmark market that can easily become a marketing artifact. Calling a small sample small is not a weakness. It is the start of trust.

For builders, this is also a practical lesson. An internal evaluation does not have to begin as a huge benchmark. A team can start with 10 real issues, repeated runs, token budget, cache-hit rate, human review outcome, and rollback status. The key is to be explicit about what the table can and cannot prove. Joule Index V0.1 is a useful example of that balance.

Pricing-preview rows require a different reading

The leaderboard also includes a pricing preview separate from verified runs. This section can easily produce eye-catching numbers. For example, applying a reference task's token budget to the May 2026 list price of some unsubmitted frontier models can push rows such as GPT-5.5 or GPT-5.5 Pro above the $10 cap. But the methodology separates these rows from measured capability claims. They are list-price calculations, not observed runs, and they do not contribute to the Joule Score.

That distinction is important. In a real agent run, provider cache behavior, compression, output style, tool loops, and model routing all change the result. A more expensive model might finish in fewer steps. A cheaper model might burn many cheap tokens. The right reading is not "that model is bad." It is: list price can shape benchmark eligibility itself on long-horizon agent work. In the agent era, model pricing directly affects both research leaderboards and procurement thresholds.

This matters especially for enterprise buyers. A monthly IDE subscription or seat price hides much of the operational cost. Underneath, API calls, cache reads, search, sandboxing, storage, and telemetry accumulate. Bundling that cost can make adoption easier, but it can also make large-scale forecasting harder. A good platform should not merely feel unlimited. It should expose budget by task type and make failed-run cost visible.

The next coding-agent competition is cache policy

The AI-coding-tool market still talks mostly in model names: Claude Code, Codex, Gemini CLI, Cursor, OpenHands, Aider, Continue, and enterprise agent platforms all attach stronger models and longer context windows. In production, cache policy can become the quieter differentiator. How stable is the prefix? When does the repository summary refresh? Are tool schemas inserted in the same order every turn? Is user-specific memory separated from cacheable context? How aggressively is test output compressed before it is fed back into the loop?

Teams can control much of this. Even with the same model, a harness can destroy cache hits by slightly changing the system prompt every turn, dynamically reordering tools, regenerating long policy text, or attaching a repository summary with a fresh timestamp. A more disciplined harness can preserve fixed instructions, versioned tool schemas, stable repo maps, and short structured event logs. That directly changes cost.

So the operational message from Joule Index is not "always use the cheapest model." It is: make the agent harness measurable. For every run, record model, prompt version, cache-read ratio, input and output tokens, tool-call count, wall time, retry count, and final human verdict. With that ledger, teams can decide which tasks need a stronger model and which can be routed to a cheaper tier.

A checklist for teams adopting coding agents

First, define a task unit. "How much AI did we use today?" is too vague. Cost should be tracked per bug fix, test addition, migration, incident triage, or review pass. That is the level where teams can compare agent work against human time.

Second, record successes and failures together. Looking only at successful PR token cost creates an optimistic story. Failed runs, discarded patches, broken tests, reverted changes, and review rejections all belong in the real cost.

Third, ask vendors for cache metrics. Does the provider support prompt caching? How are cache-read tokens represented in billing? Can cache hit rate be exported per agent run? This is not just a discount question. It is an architecture signal.

Fourth, separate public and private trace levels. A company may not be able to share full source code or prompts externally, but internally it can retain file-touch sets, diff summaries, token counts, tool-call types, and final verdicts. Agent-written code can later become part of an outage or security incident, so audit records matter.

Fifth, route small and large models by role. Routine dependency updates, small UI copy edits, test snapshot changes, or narrow parser bugs may be enough for lower-cost tiers. Security-sensitive refactors, payment paths, data migrations, and concurrency bugs may justify stronger models and stricter human review. The point is to make that decision with task-level data, not instinct.

If benchmarks are going to beat marketing

Joule Index still has a lot to prove. Blankline operates the benchmark while also including its own Dropstone CLI, which raises fair conflict-of-interest questions. The public GitHub repository is still early, and community validation is limited. The energy estimate is not direct measurement. Attention F1 is also not the whole of merge-readiness. Real maintainers judge code quality, tests, maintainability, and style fit, not just whether the touched files match.

Even with those caveats, the release matters because it presses on a real weakness in the benchmark market. As agent benchmarks multiply, vendors can choose favorable numbers and tell a simple story: the score went up. Users often do not get the bill, the trace, or the failure mode. Joule Index asks for the invoice, the energy estimate, and enough trace data to recalculate the claim. That request is likely to stick, even if this specific V0.1 table remains only a preview.

The next stage of coding-agent maturity may be less about impressive demos and more about mundane ledgers: which files were read, which tools were called, which model spent how many tokens, how much cache was used, and what verdict a human reviewer gave. Builders will use that record to decide which work agents should take, which model tier should be the default, and when human review must be mandatory.

Joule Index V0.1 is a small table. But the question it raises is large. If the same PR can cost 8 cents or 1 dollar, we can no longer ask only whether AI can fix code. The sharper question is: inside what cost structure and evidence system did it fix the code? Once coding agents move from personal experiments into team production systems, that question becomes a source of real advantage.