Devlery
Blog/AI

SageMaker opens an OpenAI-compatible door for enterprise LLM infrastructure

AWS SageMaker AI now supports OpenAI-compatible inference endpoints, moving enterprise LLM friction from model deployment toward API surfaces, IAM, and routing layers.

SageMaker opens an OpenAI-compatible door for enterprise LLM infrastructure
AI 요약
  • What happened: AWS added an OpenAI-compatible /openai/v1 path to SageMaker AI real-time inference endpoints.
    • Teams using the OpenAI SDK, LangChain, or Strands Agents can point familiar clients at SageMaker-hosted models by changing the endpoint URL.
  • Why it matters: The enterprise LLM battleground is shifting from model hosting alone toward the shared API interface between agents, gateways, and private inference.
  • Watch: The bearer token can last up to 12 hours, and teams still need to design IAM scope, endpoint costs, and Chat Completions compatibility carefully.
    • OpenAI-compatible does not mean every provider-specific feature, tool-call behavior, or streaming edge case is automatically identical.

AWS has opened a small but important door inside SageMaker AI. The door is an OpenAI-compatible API path. At first glance, the announcement sounds like a developer convenience feature: existing OpenAI SDK code can call a SageMaker endpoint with fewer changes. But the more important signal is architectural. The more enterprises want to keep models, GPUs, VPC boundaries, and IAM control inside their own cloud accounts, the more application frameworks are converging on an OpenAI-style interface at the top of the stack.

The AWS Machine Learning Blog announced on May 20, 2026 that SageMaker AI real-time inference endpoints now support an OpenAI-compatible API. The core path is /openai/v1/chat/completions. Developers using the OpenAI SDK, LangChain, or Strands Agents can change the endpoint URL and call a model hosted on SageMaker without rewriting application code around a SageMaker-specific client, a custom SigV4 wrapper, or a separate request translator.

That does not mean AWS has surrendered to OpenAI as a model provider. It means the layers are becoming clearer. The model might be Qwen, Llama, Mistral, or a fine-tuned internal model. The runtime might be vLLM, SGLang, or a custom container. The cloud might be AWS. But the application and agent framework increasingly want to speak one familiar dialect: something close to OpenAI Chat Completions.

Why one URL is news

In generative AI infrastructure, the phrase "just change the URL" is often too simple. Production integrations also depend on authentication, streaming, error shapes, model names, tool calls, logs, cost tags, and permission boundaries. That is why many teams still build gateways even when a provider says its API is OpenAI-compatible. The interesting part of this SageMaker release is that AWS is putting that compatibility path inside its managed real-time inference endpoint rather than leaving it to an external proxy.

The official example is direct. Existing OpenAI client code sets base_url to the SageMaker runtime address and puts a bearer token generated by the SageMaker SDK into the api_key field. The endpoint shape looks like this:

https://runtime.sagemaker.{REGION}.amazonaws.com/endpoints/{ENDPOINT_NAME}/openai/v1

For inference components, the component name appears in the path:

https://runtime.sagemaker.{REGION}.amazonaws.com/endpoints/{ENDPOINT_NAME}/inference-components/{IC_NAME}/openai/v1

The value is not only fewer lines of code. Many agent applications already abstract model calls around Chat Completions-shaped requests. Before this release, SageMaker inference often meant using InvokeEndpoint, signing requests with AWS credentials, and adapting payloads around the serving container. Now SageMaker absorbs part of that interface gap.

OpenAI SDK / LangChain / Strands Agents

↓ replace base_url

SageMaker Runtime /openai/v1/chat/completions

↓ validate IAM-backed bearer token

Model response from vLLM, SGLang, or a custom container

That diagram compresses three facts from the AWS documentation. First, the endpoint accepts Chat Completions-style requests. Second, authentication uses a bearer token produced by the SageMaker SDK. Third, the container still needs to implement /v1/chat/completions and streaming response behavior. The adapter between cloud runtime and application framework is becoming a platform feature.

The compatibility layer is winning

Reading this as "the OpenAI API became the standard" is only partly right. More precisely, the OpenAI-style wire format has become stronger than any single hosted model. Enterprises still care about sensitive data, latency, GPU reservations, internal fine-tuning, cost structure, regional controls, and compliance. Developers still want the agent frameworks, eval tools, tracing systems, and gateways they already use. The compromise is: keep the model in your account, but expose it through the interface the application layer already understands.

AWS's own use cases point in that direction. One is agentic workflows on owned infrastructure: LangChain or Strands Agents at the application layer, with inference running on dedicated SageMaker GPU endpoints. Another is multi-model hosting through one interface: a general Llama model, a fine-tuned Mistral model, and a smaller classifier can sit behind inference components while the caller stays close to OpenAI SDK semantics. A third is serving fine-tuned models without forcing application teams to rewrite the model access path.

All three are combinations of model diversity and interface consolidation. Models are multiplying. Frameworks are becoming more automated. Agents are making more tool calls and running longer jobs. If every model provider SDK gets hard-coded into product logic, operational complexity grows quickly. If every workload has to use one hosted frontier model, teams may lose data placement, cost, and model-selection flexibility. SageMaker's OpenAI-compatible endpoint is AWS's practical middle path.

AreaDirect SageMaker invocationOpenAI-compatible path
Application codeSageMaker client and request translation are usually requiredOpenAI-family clients mainly change the endpoint URL
AuthenticationThe application handles SigV4 request flow directlyThe caller uses a time-limited bearer token generated by the SageMaker SDK
Agent integrationFramework-specific adapters may be neededThe interface matches OpenAI-compatible clients used by LangChain and Strands Agents
Operational controlAWS account and endpoint control remain in placeAWS control remains while the caller adopts a common API shape

The security model is not a static API key

The "OpenAI-compatible" label can make developers imagine a long-lived static API key. SageMaker's approach is different. The SageMaker developer documentation describes bearer token authentication built on AWS credentials. The SageMaker Python SDK's generate_token function creates a short-lived token. According to the documentation, the default and maximum validity period is 12 hours, and applications can set a shorter lifetime with timedelta.

The token's structure is also important. AWS describes it as a base64-encoded SigV4 pre-signed URL. Token generation happens locally and does not require a network call. When the request reaches SageMaker, the service decodes the token, verifies the signature and expiration, and checks the permissions of the original IAM principal. Developers place a string where the OpenAI SDK expects an API key, but the security model is closer to a short-lived IAM-backed delegation.

That creates several practical requirements. The caller needs both sagemaker:InvokeEndpoint and sagemaker:CallWithBearerToken. AWS recommends narrowing InvokeEndpoint to specific endpoint ARNs. CallWithBearerToken, however, does not support resource-level restrictions and therefore requires Resource: "*". The documentation also warns that the bearer token has the same permissions as the AWS identity that generated it, so teams should not store it on disk, in environment variables, in logs, in databases, or in distributed caches.

This is where the announcement becomes more serious than SDK convenience. The OpenAI-compatible path reduces adapter code. It does not remove permission design. Teams still have to decide which role can generate tokens, which endpoints it can invoke, how long the token should live, and where logs might accidentally capture it.

vLLM and SGLang become part of the managed path

The supported container list is another signal. AWS names SageMaker AI vLLM Deep Learning Containers, SageMaker AI SGLang Deep Learning Containers, and custom containers that implement /v1/chat/completions plus /ping. That means AWS is not only exposing its own inference API. It is pulling open-source model-server conventions into the managed SageMaker path.

vLLM and SGLang are not just ways to "run a model." They represent a serving layer concerned with batching, streaming, structured outputs, latency, and GPU memory management. In open-weight model communities, the OpenAI-compatible endpoint has already become a useful bridge because it reduces client-code churn when teams change serving backends. SageMaker's release shows the same pressure reaching managed cloud deployment.

The AWS sample notebook makes the pattern concrete. It uses Qwen3-4B as the example model and walks through a single-model endpoint, inference components, and a Strands Agents multi-agent workflow. The notebook estimates 20 to 30 minutes including endpoint provisioning and roughly 5 to 10 dollars for less than one hour on ml.g6.12xlarge. Those numbers are a reminder that SageMaker endpoints are dedicated inference infrastructure. Calling them through the OpenAI SDK does not turn them into serverless per-token APIs.

What changes for agent teams

For agent teams, the immediate change is more flexibility in model routing. Many production teams already put OpenAI, Anthropic, Gemini, Bedrock, and self-hosted vLLM endpoints behind a gateway. The gateway hides provider-specific SDK details, while the application speaks a Chat Completions-like interface. If SageMaker itself exposes that interface, private AWS-hosted models can enter the same routing layer with less glue code.

Imagine a coding agent that uses different models for general reasoning, code review, issue triage, test-log summarization, and classification. A team might keep frontier-model calls for broad reasoning, but send customer-sensitive or internal-code-heavy tasks to a SageMaker endpoint inside its AWS account. If the application does not have to deeply understand provider-specific payloads, model choice can become a policy decision based on data sensitivity, cost, latency, or evaluation results.

The same applies to observability. As OpenAI-compatible interfaces spread, tracing, eval, prompt logging, and gateway policy tools will continue to optimize around common request and response shapes. AWS needs SageMaker endpoints to fit that ecosystem. Developers need private models to fit existing evaluation and debugging pipelines without rewriting every tool.

The caveat is that compatibility is not equivalence. This release centers on Chat Completions. Tool-call details, structured output behavior, reasoning traces, multimodal inputs, provider-specific safety controls, and subtle response-field differences still depend on the model server and framework adapter. A simple chat completion may work quickly, while a production agent loop can fail in streaming chunks, retry behavior, tool-call parsing, or error handling.

AWS is defending and attacking at the same time

AWS already offers Bedrock for managed foundation model APIs, SageMaker for training and deployment, and agent-related layers such as AgentCore and Strands Agents. The OpenAI-compatible SageMaker endpoint fills a gap between those worlds. It is defensive because it reduces the chance that a team abandons private AWS inference only because its application code already uses OpenAI SDK patterns. It is offensive because it tells teams that OpenAI-compatible applications can still run against GPUs, IAM, and endpoints in their own AWS account.

The competitive map is broader than OpenAI versus AWS. Vertex AI, Azure AI Foundry, Together AI, Fireworks AI, self-hosted vLLM deployments, and LLM gateway vendors all face the same question: if application developers already expect an OpenAI-compatible interface, who can offer the least-friction mix of model diversity, data control, cost predictability, and security policy?

That is also why gateways such as Bifrost, OpenRouter, LiteLLM, Portkey, and Vercel AI Gateway keep showing up in infrastructure conversations. The interest around multi-provider OpenAI-compatible gateways is a signal that developers want a common connection layer more than they want to memorize every model provider SDK. Once SageMaker exposes the path officially, gateways do not have to exist only to make SageMaker look like OpenAI. They can focus more on routing, fallback, budgets, observability, and policy.

Four checks before production

First, design token generation and refresh. Long-running applications can create a new token per request or use an auto-refresh pattern such as an httpx.Auth implementation. Persisting the token in logs, environment variables, or shared caches goes against the AWS guidance.

Second, narrow IAM authority. Limit InvokeEndpoint to endpoint ARNs where possible, and make sure the role that creates tokens is not overpowered. Because CallWithBearerToken requires a wildcard resource, the scope of the original role matters even more.

Third, do not misunderstand the cost model. The AWS blog and documentation both caution that SageMaker endpoints incur charges while they are running, even without traffic. The OpenAI SDK client does not change the underlying endpoint economics. Dedicated endpoints can be the right choice for control, performance, and data placement, but idle cost and capacity planning remain.

Fourth, test compatibility at the framework level. A basic OpenAI SDK chat completion can pass while an agent framework still breaks around streaming chunks, tool call serialization, error objects, retry semantics, or partial execution recovery. Production agents expose integration problems in the loop, not only in one response.

The model war has an API layer underneath it

SageMaker's OpenAI-compatible API support is not a flashy model launch. There is no new benchmark table and no new foundation model brand. But infrastructure announcements like this can have a long tail because API shapes pull frameworks, gateways, eval tools, logs, and operating policy behind them.

The center of the story is not AWS accepting OpenAI's models. It is AWS accepting OpenAI-style API syntax as a practical interface layer for enterprise AI infrastructure. The model runs in a SageMaker endpoint inside an AWS account. The application speaks through the OpenAI SDK. Permissions are controlled through IAM and short-lived bearer tokens. That combination is not perfect, but it is close to the compromise many enterprise AI teams actually want.

More announcements like this are likely. Model providers will push their APIs. Clouds will push their runtimes. Developers will keep asking to change less application code. In that tension, the winning artifact may not be a branded SDK. It may be the connection grammar that the most tools already understand. SageMaker opening a /openai/v1 door shows that agent infrastructure competition is happening below the leaderboard, at the API surface where models meet real applications.