AWS cuts agent search costs with OpenSearch Serverless scale-to-zero
AWS released the next generation of Amazon OpenSearch Serverless for agentic AI apps, with scale-to-zero, 20x faster autoscaling, and up to 60% lower costs.
- What happened: AWS announced the next generation of Amazon OpenSearch Serverless on May 29, 2026.
- The stated numbers are scale from zero, 20x faster autoscaling, up to 60% cost savings, and 10x more concurrent queries.
- Why it matters: RAG systems, agent memory, and log-analysis agents depend on search backends that can absorb bursty
tool callswithout carrying a high idle floor. - Watch: The 60% figure is not a universal budget line. It depends on collection type, vector dimensions, query bursts, ingestion rate, and billing granularity.
- Teams already debating OpenSearch Compute Unit costs should rerun measurements against their own corpus, query mix, and cold-start tolerance.
AWS announced the next generation of Amazon OpenSearch Serverless on May 29, 2026. The AWS News Blog framed the release as search infrastructure for building agentic AI applications, not merely as a managed-search update. The launch numbers are specific: scale from zero in seconds, autoscaling that AWS says is 20x faster than before, up to 60% cost savings, 10x more concurrent query handling, and petabyte-scale data support. For AI builders, the story sits below the model layer. It targets the backend that agents use when they retrieve documents, search memory, inspect logs, and connect product context to model calls.
Search infrastructure is often less visible than the LLM in an AI product, but it shows up quickly in both latency and billing. A RAG pipeline embeds documents, builds a vector index, and often combines semantic search with keyword search on every user question. Coding agents and operations agents add their own search paths: repository history, tool results, tickets, logs, runbooks, incident reports, and past agent runs. Those indexes keep existing when no one is using the product, then face bursts when users run agents in parallel. The OpenSearch Serverless update is aimed at those two pain points: the idle cost floor and the burst-scaling curve.

The first phrase to read carefully in AWS's announcement is "scale from zero." Serverless search products can still carry minimum capacity costs, especially when a collection or workload has to stay provisioned for readiness. OpenSearch Serverless has faced repeated community questions around minimum OpenSearch Compute Unit usage, per-collection economics, and whether small workloads are actually cheaper than managed clusters. AWS putting scale-to-zero and up to 60% cost savings at the top of this release suggests the company is addressing that customer complaint directly.
The basic capacity unit in OpenSearch Serverless is the OCU, or OpenSearch Compute Unit. AWS abstracts indexing, search, and storage capacity so users do not have to choose instance types, size clusters, or operate shards manually. The Big Data Blog deep dive says the next generation changes how capacity is shared and allocated, reducing the need to pre-provision capacity per collection. That distinction matters for agentic AI workloads. A public search product with steady traffic has a different capacity profile from an internal RAG assistant that stays idle for hours, then receives a burst of parallel retrieval calls.
AWS is putting agentic AI in the foreground because the shape of search has changed. Plain full-text search was enough for many document and log workflows. Agents combine vector search, hybrid search, metadata filters, recent conversation memory, tool-output indexing, observability queries, and security analytics. Once an LLM calls a search backend as a tool, search latency becomes part of the model experience. If the backend is slow, the agent may retry, fall back, or call more tools. That cost lands on both sides: more model tokens and more search queries.
The release does not treat vector search and text search as separate product categories. The AWS News Blog walks through creating a collection, building a vector index, and connecting an application to search. It also includes examples around Vercel AI SDK and Kiro, which is AWS's AI development environment. That positioning matters. AWS is not only saying that OpenSearch got faster. It is presenting OpenSearch Serverless as the search backend that an AI app developer can wire into an agent workflow.

The headline metrics map to different developer checks. The 20x faster autoscaling claim is about traffic spikes: many users searching the same knowledge base, or an agent running multiple retrieval subtasks at once. The up-to-60% cost claim is about idle or low-utilization workloads. The 10x concurrent-query claim connects to agents that run parallel tool calls or dashboards that query logs while agents are also working. Petabyte-scale support is more relevant to enterprise logs, security analytics, and very large document corpora than to a small prototype.
| AWS claim | Where it touches agentic AI | What teams should validate |
|---|---|---|
| Scale from zero | Lower idle cost for RAG backends when no users are active | Cold-start latency, first-query delay, and minimum billing units |
| 20x faster autoscaling | Absorbs parallel agent search and sudden traffic bursts | Spike duration, index size, and query mix |
| Up to 60% cost savings | Reduces the floor for small RAG apps and intermittent workloads | OCU usage, ingestion rate, storage, and vector dimensions |
| 10x concurrent queries | Supports multi-agent tool calls and observability queries at the same time | p95 latency, throttling, and the share of hybrid-search requests |
That table is also why the 60% figure should not be copied directly into a product budget. AWS describes the savings as workload-dependent. If a service has large vector dimensions, continuous ingestion, and steady query traffic, scale-to-zero will not matter as much. If the workload is an internal document agent, a developer log-search assistant, or a periodic analysis agent that sleeps between runs, the economics can change more sharply. A real comparison needs the same corpus and query set across OpenSearch Service provisioned domains, current OpenSearch Serverless, the new generation, and alternatives such as Pinecone, Weaviate, Elastic, or MongoDB Atlas Vector Search.
The Big Data Blog describes a structural change behind the metrics. Earlier OpenSearch Serverless architecture gave collections capacity isolation, but that also meant small collections could leave capacity underused. The next generation is presented as a more elastic allocation model where indexing and search capacity move closer to workload demand. AWS is trying to keep the managed-service benefit, where the user does not pick nodes and shards, while making the cost curve look more like the serverless label implies.
For agent developers, the first practical use case is RAG. Documents are chunked, embedded, stored in OpenSearch, and retrieved through vector and keyword search whenever a user asks a question. The second is memory. An agent may store prior work results, user preferences, tool outputs, and run logs in a searchable index. The third is observability. An operations agent can search logs, metrics, incident tickets, deployment history, and runbooks before proposing an action. All three workloads mix idle intervals with query bursts.
The Vercel AI SDK example in the AWS announcement fits this pattern. Vercel AI SDK is a developer surface for LLM calls, streaming, and tool use in application code. If OpenSearch Serverless plugs into that flow, an agent app can join model calls and search backend calls inside the same product path. The Kiro example reinforces the same point for coding agents and agentic IDEs. To modify code reliably, an agent has to search repository context, documentation, logs, issues, and previous decisions repeatedly.
The competitive market is now a mix of vector database products and managed search services. Pinecone and Weaviate lead with vector search and RAG workflows. Elastic Cloud and MongoDB Atlas Vector Search attach vector capabilities to existing data and search products. Azure AI Search and Google Vertex AI Search wrap retrieval into cloud AI stacks. AWS OpenSearch Serverless has different advantages: the OpenSearch ecosystem, AWS IAM, VPC integration, Bedrock Knowledge Bases, CloudWatch, and log analytics. The tradeoff is that OCU pricing and collection-level planning can feel complex for small teams that want predictable costs.
The Bedrock connection is especially relevant. In AWS's RAG stack, Bedrock Knowledge Bases can manage data sources, embeddings, and retrieval configuration, while OpenSearch Serverless often sits behind it as the vector store. If OpenSearch Serverless lowers idle costs and scales faster during bursts, the operating cost of Bedrock-based agents can change as well. But total cost still includes Bedrock, embedding models, storage, data transfer, OpenSearch queries, and model tokens. A lower search bill does not automatically mean a lower full agent bill.
The first implementation check for a team is collection type. OpenSearch Serverless separates workload categories such as search, vector search, and time series. An agent product has to decide whether memory and RAG belong in the same collection, whether log analytics should be separated from document search, and whether security policies require separate indexes. Those choices affect both latency and cost. Serverless removes cluster sizing from the user, but it does not remove information architecture.
The second check is index design. Vector dimensions, HNSW parameters, metadata fields, text analyzers, and hybrid-search behavior all affect latency and storage. A serverless service can hide nodes, but it cannot make a poorly shaped index free. A product team still needs to measure retrieval quality, not only query speed. If a cheaper backend returns worse context, the downstream LLM may compensate with longer answers, more retries, or lower accuracy.
Cold start is the third check. Scale-to-zero reduces idle cost, but it can introduce first-query latency. AWS says the service can scale from zero in seconds. That may be acceptable for an overnight analysis agent or an internal knowledge base used intermittently. It may be more visible in a customer-support agent where the first user question needs a fast answer. Product teams should design prewarming, caching, progress UI, and fallback behavior according to the user experience, not just according to the infrastructure bill.
Parallel tool use is the fourth check. Agents increasingly issue several searches inside one user request. A coding agent might search dependency docs, error logs, repository files, and issue history in parallel. An operations agent might inspect CloudWatch logs, an OpenSearch index, a ticketing system, and a runbook at the same time. AWS's 10x concurrent-query claim is relevant because agent workflows create concurrency from a single user action. Teams should watch p95 and p99 latency, throttling, and the number of search calls per task, not just average latency.
The fifth check is authorization. A search backend is a powerful tool in an agent system. Indexes can contain customer records, internal documents, logs with secrets, security events, and personal data. OpenSearch Serverless provides IAM and network policies, but teams still need agent-level policy: which indexes can this agent search, which fields may be returned, and which retrieved snippets can be placed into a model prompt. In RAG systems, data leakage can happen silently through retrieval before the model ever generates a bad sentence.
Community reaction around OpenSearch Serverless has often centered on cost predictability. Since its earlier releases, users have asked whether minimum OCU usage makes small workloads unexpectedly expensive, especially in development and test environments. The next-generation launch addresses that topic directly, but the answer will come from billing data, not launch phrasing. Two teams with the same corpus size can get different results if one has steady ingestion and another has short query bursts separated by long idle periods.
This release also shows how AWS is assembling its agentic AI stack. Bedrock handles models and knowledge-base workflows. Kiro targets AI-assisted development. OpenSearch Serverless provides vector search, full-text search, and analytics infrastructure. CloudWatch and OpenSearch connect observability. The new OpenSearch Serverless economics make more sense if AWS expects agent apps to call search backends more frequently and more unpredictably than older web-search or log-search products.
For AI developers, the announcement reopens the "which vector database should we use?" question. Independent vector databases can offer fast RAG setup and specialized retrieval workflows. Managed search platforms can combine full-text search, vector search, log analytics, IAM, and cloud networking in one place. The new OpenSearch Serverless generation adds cost and scaling improvements to the managed-search side. Teams already deep in AWS, Bedrock, CloudWatch, S3, IAM, and VPC networking have a clearer reason to test OpenSearch. Multicloud teams and small prototypes still need to compare pricing and setup overhead carefully.
Benchmarks should be split into three layers. First, retrieval quality: does hybrid search improve answer quality on the team's real corpus? Second, latency: what happens on the first query after scale-to-zero, in a warmed state, and during burst traffic? Third, total cost: add OpenSearch OCUs, storage, ingestion, embeddings, model tokens, network traffic, and observability overhead. The 60% savings claim belongs in the third layer, but it is only one row in a wider operating-cost table.
The OpenSearch Serverless update touches a less glamorous bottleneck in the agent market. Better models do not help if an agent waits on slow retrieval, pays for idle search capacity all night, or hits throttling when several tool calls run in parallel. AWS's answer is infrastructure: scale-to-zero, faster autoscaling, lower claimed costs, and higher query concurrency. Whether teams adopt it will depend on their corpus shape, burst pattern, permission model, and actual bill.
The larger meaning of the launch is that serverless search is being repositioned as a default component of AI applications. As agents use more tools, keep more memory, and search more logs, the search layer is called almost as often as the model layer. AWS placed OpenSearch Serverless inside examples that include Bedrock, Kiro, and Vercel AI SDK because it wants that layer to sit inside the agent development loop. The immediate task for builders is not to memorize the launch numbers. It is to inspect when their agents search, how often they sit idle, and how many queries they fire when a single user asks for work.