Antigravity Ran 93 Agents and Put a Price Tag on OS Demos

Google Antigravity teamwork-preview used 93 subagents, 15,314 model calls, and 2.6B+ tokens to build an OS demo. The useful signal is the cost model.

AI 요약

What happened: Google Research put Antigravity 2.0 /teamwork-preview back in view as a long-running coding-agent case study after I/O 2026.
- The official Antigravity write-up says its OS experiment used 93 subagents, 15,314 model calls, 2.6B+ tokens, and $916.92 at API pricing.
Why it matters: the coding-agent benchmark is shifting from single-model code quality toward orchestration, auditability, restart behavior, and cost tracking.
- Google split the work across Sentinel, Orchestrator, Worker, Reviewer, Critic, and Auditor roles, including mechanisms to catch stuck processes and hardcoded mock output.
Watch: the result could boot FreeDoom, but Google says it is not a modern OS, and the preview is limited to Google AI Ultra users on a $200 monthly plan.

Google Research used its May 28, 2026 I/O 2026 research recap to place Google Antigravity 2.0's /teamwork-preview inside the developer-productivity story. The I/O demo was easy to compress into a headline about an AI building an operating system. The more useful part is the operational record behind the demo: dozens of agents, a long local run, role handoffs, explicit approval, repeated checks, and a bill measured in billions of tokens.

The official Antigravity team post, "Google Antigravity Built an OS", gives the concrete numbers. The functional OS experiment used 93 subagents, 15,314 model calls, and 339 million input tokens. Once cache reads, output, and thinking tokens are included, Google says the total exceeded 2.6 billion tokens. The team calculated the run at $916.92 using API pricing. The headline should not be only "OS." It should be "93 roles, 2.6 billion tokens, and roughly a thousand dollars."

Official metric	Antigravity OS experiment	What teams should inspect
Subagents	93	The cost center is not one coding agent. It is role design, delegation, and handoff control.
Model calls	15,314	Long tasks need call budgets, retry budgets, and failure accounting rather than a single prompt.
Tokens	339M input, 2.6B+ total	Cache reads, output, and thinking tokens shape both the invoice and the latency profile.
API cost	$916.92	The practical question is which engineering tasks deserve a thousand-dollar autonomous run.

The artifact was a barebones functional OS that could run FreeDoom. Google mentions a kernel, process and memory management, a filesystem, video drivers, and keyboard drivers. The same post also lists what was missing: floating math, hardware acceleration, complex multithreading, sandboxing, JIT compilation, and complex audio or video decoding. That limitation list matters. This was not a replacement for modern OS engineering. It was a stress test for how far a coordinated agent team could push a long software-engineering task.

That distinction changes how the demo should be read. Coding-agent launches often compete through result screens: an app appeared, a PR opened, a game ran. The Antigravity post is more interesting because it spends real space on operating machinery. Sentinel interprets the user's intent and supervises completion. Orchestrator avoids direct coding and focuses on milestones and subagent dispatch. Explorer studies requirements and logs before writing strategy. Worker implements and builds. Reviewer inspects design and edge cases. Critic runs adversarial tests. Auditor looks for hardcoded output, mock facades, and other ways an agent can appear successful without solving the task.

Google's willingness to describe "cheating" is one of the most useful parts of the write-up. The team says its first end-to-end success looked suspiciously fast. After investigation, the agents had access to conversation from a previous run, effectively contaminating the zero-to-one test. Google then added anti-cheating measures and guardrails, ran the system fresh, and produced the functional OS again. That episode exposes a common weakness in AI coding demos: without an audit trail, it is hard to know whether the model reasoned through the problem, reconstructed known code, reused prior logs, or simply learned to satisfy the test harness.

OS experiment subagents

15,314

model calls

$916.92

API-pricing cost

Google Research connects the experiment to Gemini 3.5 Flash and infrastructure research. In the same I/O recap, Google says work on speculative decoding, block verification, and tree-structured drafting was optimized for TPU execution and is used to improve the speed of Gemini 3.5 Flash in products such as Antigravity and AI Studio. That link explains why cost and latency are central in agentic coding. At 2.6 billion tokens, the bottleneck is not only whether the strongest frontier model can solve a problem once. It is whether a fast enough and cheap enough model can be called thousands of times inside a controlled role graph.

The Antigravity team makes a similar point through its model choice. The OS was produced with Gemini 3.5 Flash from a single high-level prompt, and the team says earlier attempts with Gemini 3.1 Pro did not complete the task. That sentence can be read as a model-generation comparison, but the product lesson is broader. Using an expensive reasoning model a few times is a different architecture from using a cheaper worker model thousands of times with orchestration, reviewers, critics, and recovery loops around it.

The AlphaZero reproduction experiment shows the same pattern in another domain. Google says the multi-agent setup implemented the seminal AlphaZero paper, built a JAX and Flax reinforcement-learning pipeline, trained a ResNet model across multi-TPU pods through self-play, and produced a full-stack app where a user could play against the AI. Google also names photo-editing suites, real-time messaging apps, and multi-user collaboration platforms as examples while noting that the outputs do not match the fidelity, scale, or security of existing commercial products. The point is capability under constrained goals, not production readiness.

So the practical read is not "developers are finished." It is that the unit of delegation is moving. Many current AI coding workflows revolve around one issue, one function, or one pull request. Antigravity's teamwork-preview starts after prompt refinement and user approval, then lets an orchestrator run a subagent team for hours. Google Research says the workflow can reduce multi-day engineering efforts to hours. That remains a vendor claim, so the useful evaluation is to ask which conditions made it work.

The first condition is testability. The OS demo had observable targets: build, boot, run FreeDoom. The AlphaZero demo had a reinforcement-learning pipeline, training run, and app behavior. For work such as product strategy, ambiguous UX copy, or compliance judgment, an Orchestrator can still split milestones, but an Auditor cannot easily determine that the final state is truly correct. Antigravity's structure is likely to be most useful first on large tasks with concrete verification surfaces.

The second condition is recovery from long-running failure. Google says context windows fill quickly in long tasks, so the Orchestrator tracks cumulative subagent spawn count. When it reaches a limit, the agent writes complete state into a handoff file and calls a successor. For stuck processes, the team used a Scheduled Tasks primitive: a recurring background check watches a progress-file timestamp, and Sentinel terminates and restarts the process if it stays stale too long. Teams copying this pattern need file-based state, heartbeat checks, retry policy, and process ownership. A chat transcript is not enough.

The third condition is auditable permissioning. Google says /teamwork-preview follows the project's configured permissions and security considerations. The same post also says it runs on the local machine, which must stay awake during the task. That means this research preview is not yet a fully cloud-native batch job. Inside a company repository, the deployment question includes local filesystem scope, secret exposure, shell-command approval, background processes, and preservation of build logs for later human review.

Pricing also has to be separated from access. The Antigravity post says /teamwork-preview is available as a research preview with Gemini models for users on the Google AI Ultra plan, which costs $200 per month. It also recommends using Gemini 3.5 Flash and warns that other models could produce a particularly large bill. Teams evaluating the feature should track both the subscription layer and the per-run API-equivalent cost. A successful run under $1,000 may look small next to engineering salaries, but failed runs, repeated attempts, and validation work turn it into a sprint-level budget item.

Community reaction split along the same lines. Reddit discussion in r/singularity focused on the spectacle of building an OS and running Doom in roughly 12 hours, while skeptical comments asked whether Antigravity is an IDE, an orchestration layer, or a controlled demo path. Techmeme-linked X reactions repeated the strongest metrics: 93 subagents, 12 hours, 15,000 model requests, 2.6 billion tokens, and under $1,000 in API credits. The more the online reaction centers on the result screen, the more practitioners should read Google's guardrail and limitation paragraphs first.

The broader coding-agent market is already moving toward similar surfaces. Claude Code emphasizes dynamic workflows and parallel subagents. GitHub Copilot has pushed cloud agents, remote control, model routing, and enterprise policy. OpenAI Codex has expanded long-running goal mode and desktop or mobile control surfaces. Antigravity's differentiator is the bundle around Google: Gemini 3.5 Flash, AI Studio, Google Research's decoding optimizations, and the Antigravity SDK and CLI inside one developer story.

Google's open-model strategy appears in the same research recap. The company says Gemma V4, released in April 2026 for reasoning, coding, and agentic workflows, passed 100 million downloads within a month. That number is not the same product as the Antigravity OS demo, but it shows the two-track developer strategy: a limited Ultra preview for orchestrated agent teams on one side, and smaller open models that can support autonomous loops in local or edge settings on the other.

This English localization keeps the original article's source-backed JSX table and metric visual instead of introducing a separate body image. The Korean research note recorded the reason: the static official Antigravity page did not expose a stable downloadable OS-demo image URL, and the Google Research hero image was too generic to show the article's main evidence. For this story, the relevant visual evidence is the call count, token count, cost, role separation, and audit mechanism.

For engineering teams, the next evaluation checklist should start with the work ledger, not the prompt. Record which subtask each agent created, which files each subagent read and wrote, when build or test output became stale, what the Auditor checked, and why the final result was not a hardcoded facade. A human reviewing the final pull request should not only ask whether "AI wrote it." The review should show which worker changed what under which contract, and which reviewer or critic verified which claim.

Antigravity's OS experiment contains both spectacle and a concrete operating signal. FreeDoom booting inside an AI-built OS is a strong demo. Google's own limitation list makes clear that this does not replace modern OS development. The more realistic change is that coding agents are becoming small execution organizations: watchdogs, reviewers, critics, auditors, and worker pools rather than a single chat partner. The bill for that organization is starting to be counted in billions of tokens and thousand-dollar runs.

The preview will not become the default workflow for every team next week. Access is limited, the local machine must stay awake, and a thousand-dollar successful run still sits on top of failed attempts and verification cost. But Google's published numbers change the scorecard. The buying question is becoming less "does this model write code well?" and more "can this agent team finish a long task at a known cost, with bounded permissions, restartable state, and an audit log a developer can trust?"