The 7,000-return loop behind Codex self-improving agents

OpenAI Tax AI shows why production traces, eval sets, and practitioner feedback matter more than agent automation alone.

AI 요약

What happened: OpenAI published how it used Codex inside a self-improvement loop for Tax AI, a system built with Thrive Holdings.
- The case comes from Crete's network of 30-plus accounting firms and a pilot that processed 7,000 tax returns.
Key numbers: OpenAI says Tax AI cut preparation time by roughly one-third, reached up to 97% draft accuracy, and lifted throughput by about 50%.
Why it matters: The bottleneck for improving agents is not another model call. It is production traces and evaluation infrastructure.
- Practitioner edits become useful only after review, clustering, target eval creation, and bounded Codex tasks.
Watch: This is a domain-specific system with strong expert oversight, not a proof that agents can safely rewrite production workflows on their own.

OpenAI's May 27, 2026 case study on building self-improving tax agents with Codex looks, at first, like a story about automating tax preparation. For developers, the more important subject is narrower and more useful: how do failures found in production become small, reviewable tasks an agent can actually fix?

OpenAI and Thrive Holdings spent six months building Tax AI with Crete's network of accounting firms. According to the official post, that network includes more than 30 firms, and Tax AI processed 7,000 tax returns during the pilot. The target workflow was preparation for U.S. 1040 and 1041 returns. OpenAI says the system reduced tax preparation time by about one-third, produced drafts with up to 97% accuracy, and increased throughput by roughly 50%.

Those numbers sound like a conventional enterprise AI success story. The more interesting part of the post is the loop behind them. Practitioner corrections, extracted fields, source documents, tax-engine mappings, final filed values, evaluation sets, regression tests, and Codex task environments are tied together. Without that structure, "self-improving agent" is mostly a marketing phrase. This case pushes the phrase down into engineering territory.

Tax AI self-improvement loop

The real unit is formatted failure

The first striking detail in OpenAI's write-up is the scale of the work. Crete practitioners prepare tens of thousands of returns in a season and handle millions of source documents. For moderately complex returns, data entry alone can take eight hours. The inputs are not clean API responses. They include prior-year material, handwritten notes, emails, spreadsheets, client-specific exceptions, and documents that only make sense in context.

In that kind of workflow, having AI draft an initial return is no longer the most surprising claim. The hard problem starts when the draft is wrong. A wrong value might be a simple extraction miss. It might come from stale prior-year data in the tax engine. It might reflect a preparer's preferred input convention. It might point to an unsupported field. Or it might be a clear model error reading the document.

That means the event "a practitioner changed the value" is not enough for a product to learn from. The edit is a signal, but it is not yet training data, a regression test, or a product bug. OpenAI's design focuses on that gap. Tax AI keeps not only what the practitioner changed, but what the system first suggested, which source documents it relied on, which mapping path it took, and what finally went into the return.

That is a production trace. It is narrower than a general log and more product-oriented than an audit record. It is a bundle of evidence that can be turned into a task an agent is allowed to investigate.

Experts, traces, and Codex form the loop

OpenAI describes the Tax AI self-improvement loop around three pillars. First, the product has to stay close to practitioners. In tax preparation, only domain experts can distinguish a true error from a legitimate judgment call. Second, the system has to preserve evidence from production. It cannot keep only inputs and outputs; it needs the path from document classification through field extraction, provenance, tax-engine mapping, and expert correction. Third, that evidence has to become custom evals and Codex tasks.

This is materially different from the popular image of an AI agent that simply fixes itself. The system is closer to a repair line than an autonomous rewrite engine. Repeated practitioner edits are grouped first. A pattern might show, for example, that the system often misses the fair rental days field on rental-property returns, mixes values when several properties appear in one document package, or maps "other expenses" to the wrong place.

Once a pattern is reviewed, it can become an evaluation target. That target needs a representative source package, expected output, failure criteria, and regression expectations. Only then does Codex investigate inside the actual product scaffold. It can check whether an extraction schema is missing, whether source selection is weak, whether a mapper omitted a field, or whether a grader is treating normal workflow noise as a failure.

Codex is not receiving a vague "fix the tax bot" prompt. In OpenAI's example, the Codex task environment includes a repo, task.yaml, eval data, suites, graders, related documentation, and separated read-only production traces and source artifacts. The writable surface and the evidence are kept distinct. That boundary is a major part of what makes the word "self-improving" defensible.

From 25% to 86%, with a very specific meaning

One of the most useful metrics in the case study is field completion. Tax AI measured how little a return later needed to be corrected by using 75%, 90%, and 100% correct field completion thresholds. At launch, about 25% of returns reached the 75% completion threshold. Within six weeks, 86% reached it.

That number is different from a benchmark score. It does not mean a model got a higher score on a frozen test set. It means issues found during real operation were routed back through a product-improvement loop. That is why the case is more interesting than a simple accuracy claim. AI products often "improve over time," but the sources of improvement are usually mixed together: model upgrades, prompt changes, UX fixes, data cleanup, and product patches. Tax AI gives a clearer picture of at least one layer: which signals become evals, and which evals become bounded coding tasks.

The direction of complexity also matters. OpenAI says the product began with simpler workflows such as W-2s and 1099s, then expanded into K-1s, multiple schedules, and harder edge cases as the season progressed. This is not just a story of higher accuracy on easy work. The more complex returns produced larger time savings per case, while the rental-property domain took about six weeks and significant engineering oversight to reach 90% precision and recall.

That phrasing is important. The system is called self-improving, but OpenAI does not erase human experts or engineers from the loop. Practitioners create the signals through corrections and judgment. Engineers still own architecture and product decisions. Codex works inside a bounded task, producing candidate investigations, implementations, and validations.

7,000

tax returns processed in the pilot

86%

returns reaching the 75% field-completion threshold after six weeks

50%

throughput increase claimed by OpenAI

Why tax preparation is a useful test case

Tax preparation may not seem like a flashy AI demo. It is hard to show off in a slick interface, and mistakes can be expensive. That is exactly why it is a useful domain for understanding the next stage of agent products. The inputs are messy, the rules are dense, and final responsibility remains with experts. Being able to produce a plausible answer matters less than being able to trace why the answer appeared and where it failed.

OpenAI's rental-property example captures this well. To prepare Schedule E, the system has to read property-specific income and expense fields from multiple documents, preserve provenance, and map the fields into tax-engine concepts. When a number is wrong, the cause can vary. The model may have missed the value. It may have mixed up several properties. It may have misread a client note. It may have collided with an existing value in the tax engine.

Coding agents face the same pattern. A failed test can mean the implementation is wrong, the test is stale, or the requirement is ambiguous. A production bug can come from one line of code, but it can also reflect a data contract, permission boundary, deployment environment, or user workflow. Codex becomes more useful not merely because it can write code, but because the system around it can provide the evidence bundle in a form the agent can act on.

That makes the Tax AI case relevant beyond accounting. A serious agent-improvement loop needs four ingredients. It needs high-quality feedback from practitioners. It needs product traces that preserve context. It needs an operating process that converts repeated patterns into evaluation targets. And it needs regression tests and review so changes can be trusted or rolled back.

The risk inside "self-improving"

The phrase to handle carefully in this announcement is "self-improving." It is easy to imagine a model setting its own goals, modifying the product freely, and pushing changes into production. OpenAI's actual architecture is much more constrained.

First, the evidence comes from production, but Codex does not get to rewrite the original evidence. The representative task environment separates a writable repo from read-only production traces and source artifacts. Second, not every correction becomes a task. Practitioner edits are reviewed and clustered, then promoted into eval targets only when they represent a repeated product failure. Third, Codex outputs are checked against targeted evals and broader regression suites, with engineer review still in the path.

Those constraints are what make the self-improvement framing safer. Without them, self-improvement becomes a way to obscure responsibility. When a product is wrong, the team has to explain who judged the failure, what evidence justified the change, and which tests showed the fix did not break something else.

The practical lesson for development teams is not to add more agents everywhere. It is to shrink the surface an agent can touch and improve the structure of the failure data it receives. "Codex can see the repo" is only the starting point. More important questions are which task it receives, which eval defines success, which documents are read-only, and which changes must be sent back to a human.

Enterprise AI competition moves down the stack

AI companies are pushing deeper into enterprise workflows. Anthropic has put Claude into professional-services organizations such as PwC and KPMG. OpenAI is bringing Codex and workspace agents into work loops. Microsoft is building management layers such as Agent 365. On the surface, that can look like a fight over which model is smartest. Inside customer organizations, the sharper question is different.

Which company can turn domain feedback into product-improvement data? Which platform can leave traces that survive audit and review? Which agent can break failures into small, verifiable tasks? Which organization can separate what should be automated from what must return to an expert?

Tax AI shows how OpenAI is positioning Codex in that competition. Codex is not only an IDE-side coding assistant. It is also an improvement engine for domain products that have operational evidence and evaluation infrastructure. That is a meaningful signal for the AI coding-tool market. Agent quality will not be decided only by the answer in a chat window. It will depend on eval sets, product traces, sandboxes, permission boundaries, regression tests, and review flows.

The case also should not be overgeneralized. Thrive Holdings has an ownership-and-operations structure that let OpenAI connect product development closely to real services. Crete practitioners performed repeated work and could provide high-quality feedback. Many companies do not have that level of domain access, data rights, evaluation infrastructure, or engineering time.

So the conclusion is not "all work can now improve itself." A more accurate conclusion is that agent products which appear to improve themselves usually have a very human operating system behind the scenes.

A checklist for product and engineering teams

The questions for AI builders are fairly direct.

First, are user corrections structured? Saving only "the user edited the output" is different from saving the expected value, actual value, final value, source document, workflow stage, and candidate reason for the correction. The second version is where an improvement loop starts.

Second, can the team separate product noise from true failures? If every difference is treated as an error, the agent will optimize toward the wrong target. In tax preparation, prior-year values, practitioner judgment, and unsupported fields can all appear as differences. In software products, flaky tests, temporary operational workarounds, and undocumented customer exceptions play a similar role.

Third, is someone converting repeated patterns into evals? Good evals do not appear automatically. Someone has to group failures, choose representative cases, define expected output, and set success criteria. Codex is most useful when that boundary already exists.

Fourth, does the agent environment separate read-only evidence from writable code? Without that separation, debugging may feel easier in the short term, but auditability and reproducibility degrade quickly.

Fifth, is there a regression suite that catches new damage after a targeted fix? In the Tax AI case, the broader regression suite is as important as the targeted eval. Fixing one field while breaking another schedule is not self-improvement. It is local optimization.

The signal that will last

Near the end of OpenAI's post, one senior accountant says last year's preparation work took 180 hours, while this year it took 15. That is the headline number. The longer-lasting signal is that the reduction did not come from a lone model call. It came from a system that turns practitioner feedback into evaluable work.

The AI-agent market is moving from "what can the agent do?" to "how does the agent get better?" The first question can be answered in a demo. The second can only be answered by an operating system: production traces, domain experts, eval sets, bounded Codex tasks, and human verification.

That shift will likely change how coding agents are evaluated as well. Longer context windows, faster models, and more tool calls still matter. But enterprises will ask drier questions. Can we give this agent our product failures in a format it understands? Can we measure whether corrections repeat? Does the cost of turning a failure into an eval go down? Can we trust the validation loop around the change Codex proposes?

OpenAI's Tax AI case is optimistic, but it is also demanding. It shows that agents can improve within a real product loop, but the loop is not automatic. It depends on human corrections, product evidence, explicit eval targets, and engineering boundaries. The core capability behind a self-improving agent is not the agent alone. It is the organization's ability to format failure so the agent can learn from it.