Codex Tax AI handled 7,000 returns, and the improvement loop starts with evals

OpenAI and Thrive showed how Tax AI links production traces, practitioner corrections, evals, and Codex tasks.

AI 요약

What happened: OpenAI published a case study on Tax AI, built with Thrive Holdings and Crete.
- The pilot processed 7,000 tax returns across more than 30 accounting firms, mostly for U.S. 1040 and 1041 filings.
The numbers: OpenAI reported roughly one-third less preparation time, up to 97% draft-return accuracy, and about 50% higher throughput.
Builder point: Codex was not only writing code. It operated inside a loop of corrections, traces, evals, repos, and skills.
Watch the wording: Self-improving here means a reviewed product loop, not a model updating its own weights in production.

OpenAI published its self-improving tax agents with Codex case study on May 27, 2026. The system is called Tax AI, and it was built with Thrive Holdings and Crete's accounting-firm network. The post is less about a new model name than about an operating design: corrected fields from accountants, source documents, tax-engine mappings, filed returns, eval suites, and Codex work branches are connected into one product-improvement loop.

The case reaches AI builders because the public numbers are unusually specific for a vertical agent. OpenAI says Tax AI processed 7,000 tax returns during the tax-season pilot across more than 30 accounting firms that participate in Crete. The work centered on U.S. 1040 and 1041 return preparation. Crete practitioners handle tens of thousands of tax returns and millions of underlying documents each season, and OpenAI says data entry alone can take up to eight hours for a moderately complex return.

OpenAI highlighted three performance claims. Tax AI reduced tax-preparation time by roughly one-third, produced draft returns with up to 97% accuracy, and increased throughput by about 50%. The case study also disclosed one accuracy-improvement metric: at launch, 25% of scored returns reached 75% correct field completion; within six weeks, 86% reached that threshold. The caveat is clear: these are figures from a jointly written OpenAI and Thrive production case, not an independent benchmark.

7,000

Tax returns processed by Tax AI in the pilot

97%

Maximum draft-return accuracy reported by OpenAI

25% → 86%

Share reaching 75% field completion after six weeks

The phrase self-improving needs a careful reading. The case study does not describe a model that rewrites its weights during production use. It describes a product loop where practitioner corrections become structured data, product traces expose the failure location, repeated errors become tailored eval targets, and Codex proposes product changes that must satisfy those evals. That is closer to eval-backed engineering than autonomous online learning.

OpenAI breaks the loop into three steps. First, the product has to capture what a practitioner changed. Second, the trace needs to connect source material, extracted fields, provenance, downstream submission, and the filed return. Third, when repeated corrections can be grouped into an actionable finding, Codex sees the repository, evals, skills, docs, and read-only production evidence before proposing a fix.

Tax preparation needs that structure because a changed value does not always mean extraction failed. The value might have been carried forward from a prior-year return, already existed inside the tax engine, reflected practitioner preference, or belonged to a workflow the product did not yet support. OpenAI says practitioners helped distinguish these cases, and ambiguous evidence was routed back to the product team instead of being pushed into the automated improvement loop.

The rental-property example makes the design concrete at field level. Rental property income belongs on Schedule E for individual returns. Tax AI needs to extract rental-property fields from messy sources such as handwritten notes, emails, spreadsheets, and client files, then map them into concepts the tax engine understands. Small errors in fields such as fair rental days, other expenses, or multiple properties in the same source package can consume real review time before filing.

Tax AI processes corrections in three layers. It compares the filed return with Tax AI's output and creates field-level review rows that include the expected value, predicted value, and whether the difference is actionable. It clusters similar rows to separate recurring product failures from expected workflow noise. Repeated findings then become eval targets. Codex receives more than "the answer was wrong": it gets representative source packages, expected output, related code paths, a grader, and regression suites.

Practitioner correction: field value changed before filing

↓

Product trace: source file, extracted field, provenance, tax-engine mapping

↓

Review rows: expected value, predicted value, actionable status

↓

Codex task: repo, eval datasets, regression suites, skills, read-only evidence

↓

Product-change candidate shipped after human review

Codex's role in this case reads more like investigation than generic code generation. OpenAI says Codex inspects source packages, extraction schemas, mapper behavior, and code paths. It then separates unsupported fields, missed extraction patterns, source-selection problems, mapper gaps, and grader problems. The proposed changes are similarly narrow: extending extraction schemas, improving rental-property document selection, updating the tax-engine mapper, or adjusting a grader that was counting expected workflow noise as product error.

AGENTS.md files and skills appear directly in the architecture. OpenAI's bounded task environment includes a repository branch, AGENTS.md, a task spec, a plan, and a result file. Application code, eval datasets, regression suites, graders, skills, and docs live in the same working context. Production traces, source artifacts, and tax-engine documentation are separated as read-only context. In practice, the environment tells Codex which surfaces it can change and which surfaces are evidence only.

That structure lines up with the current coding-agent discussion. Many products foreground the execution surface: create a PR, run tests, address an issue, or work through a backlog item. The Tax AI case makes the pre-PR packaging more visible. Before Codex can be useful, the task folder already needs a success condition expressed as an eval, a representative failure expressed as a production trace, and human review that filters ambiguous tax judgment away from product defects.

Component	Role in the OpenAI case	Question for teams to verify
Practitioner feedback	Structures corrected fields and final filed values	Who decides whether a correction is an error or tax judgment?
Production trace	Links source, extraction, provenance, mapper, and filed return	Which permissions preserve sensitive documents and PII?
Tailored eval	Turns repeated corrections into bounded success conditions	Does the regression suite cover real filing complexity?
Codex task	Uses repo and eval context to propose fixes and evidence	Are branch, reviewer, permissions, and rollback rules sufficient?

The accuracy numbers should be read with measurement discipline. "Up to 97% accuracy" does not, from the public text alone, fully specify the field set, return type, sample definition, or denominator. The 75% correct field completion metric shows launch-to-six-week improvement, but the detailed curve for 90% and 100% completion is more visible in OpenAI's imagery than in the article text. For buyers and builders, the safer use of the case is as a measurement checklist, not as a direct procurement benchmark.

The time-savings number needs the same treatment. OpenAI cites a senior accountant who spent 180 hours on tax prep the prior year and 15 hours this year. That is a strong anecdote, but it is still one practitioner example rather than the pilot average. The useful product detail is what happened to the saved time: OpenAI says it shifted toward client calls and new services. That shows how a vertical agent's cost reduction can become product value only when the surrounding practice has higher-value work to absorb the freed capacity.

Thrive's role is part of the explanation. OpenAI describes Thrive Holdings as both owner and operator, giving the joint team deep access to real operating companies such as Crete. That is different from a SaaS vendor observing customer workflows at arm's length. Without direct practitioner access and production data, it becomes harder to collect field-level corrections, classify ambiguous cases, and convert recurring failures into scoped evals quickly.

Tax filing is not an easy agent domain. The input documents are messy, the rules vary across forms and schedules, and a wrong value can create filing risk for a client. OpenAI's claim is narrower than "AI replaces accountants." The case study says automation is being applied to extraction and mapping layers, while architecture, product decisions, and shipping remain engineering responsibilities. Removing that sentence would turn a concrete operations case into an overbroad labor-replacement story.

Engineering teams can still take four immediate design patterns from the case. First, store production feedback as review rows with expected values and predicted values, not only as free-text complaints. Second, traces should preserve source provenance and downstream system mappings, not only model inputs and outputs. Third, have humans separate expected workflow noise from product failure before a recurring finding becomes an eval target. Fourth, give a coding agent read-only evidence and regression commands before giving it broad write authority.

The security and governance questions are large. Tax AI may touch tax records, client notes, prior-year returns, and supporting documents. If a Codex task can reference read-only production context, teams need to define what data enters the sandbox, which identifiers are masked, and where failed-run artifacts are retained. OpenAI shows a bounded task environment, but the public post does not fully spell out customer-by-customer data residency, audit logging, or reviewer policy.

The wider developer community reaction appears limited so far. The Korean research note did not find an active Hacker News or GeekNews discussion under the same title. Secondary coverage mostly summarized OpenAI's numbers, while some commentary argued that self-improving is better understood as a structured correction loop than as model self-training. That reading fits the technical substance: the interesting part is not a model magically becoming smarter, but human correction being transformed into an engineering task Codex can inspect, test, and patch.

From an AI infrastructure angle, this is a convergence of observability, evals, and coding agents inside one product loop. It overlaps with the market that tools such as W&B Weave, LangSmith, Braintrust, and Datadog are addressing around agent traces and eval operations. OpenAI is not announcing a standalone observability product here, but the competitive bar is visible. In enterprise verticals, "the agent gives good answers" is less durable than "agent failures close through evidence, evals, patches, regressions, and review."

The conservative conclusion is the most useful one. OpenAI and Thrive reported a 7,000-return pilot, more than 30 accounting firms, up to 97% draft accuracy, and a six-week jump to 86% of scored returns reaching 75% correct field completion. Those numbers are internal product-case numbers. The part builders can reuse is the procedure: convert production corrections into evals and bounded Codex tasks. For teams already shipping agents, the next bottleneck may be trace design, field-level review, regression coverage, and the human gate rather than the model choice.

Codex Tax AI handled 7,000 returns, and the improvement loop starts with evals

Sources