Devlery
Blog/AI

SageMaker Adds OpenAI API Support for AWS-Hosted Models

AWS SageMaker now supports /openai/v1 endpoints, lowering the migration cost for OpenAI SDK, LangChain, Strands Agents, and AI gateways.

SageMaker Adds OpenAI API Support for AWS-Hosted Models
AI 요약
  • What happened: AWS added an /openai/v1 route to SageMaker AI real-time inference endpoints.
    • OpenAI SDK, LangChain, and Strands Agents apps can point at SageMaker-hosted models by changing the endpoint URL and bearer token.
  • Builder impact: The OpenAI API shape is hardening into a shared deployment contract for LLM apps and agent gateways.
  • Watch: The bearer token carries AWS authority, and SageMaker endpoints cost money while InService, even without traffic.
    • AWS docs recommend scoping InvokeEndpoint to specific endpoint ARNs and avoiding token storage or logging.

AWS added OpenAI-compatible API support to Amazon SageMaker AI real-time inference endpoints on May 20, 2026. The AWS Machine Learning Blog announcement says OpenAI SDK, LangChain, and Strands Agents users can call a SageMaker endpoint without a custom client, SigV4 wrapper, or application rewrite. The main change is not the model itself. It is the request contract. If an existing app is organized around client.chat.completions.create(), the app can now target an endpoint inside an AWS account by changing base_url and the authentication token.

AWS published a What's New availability notice on May 21, 2026. The supported regions include US East, US West, Europe, Canada, South America, and Asia Pacific Seoul. For teams already using AWS, the practical question is not only whether SageMaker can host a model. It is whether an LLM prototype built against an OpenAI-style client can move into a production boundary with VPC controls, IAM permissions, CloudWatch logging, autoscaling, and dedicated endpoint capacity.

The route is intentionally simple. The SageMaker developer guide says real-time inference endpoints expose /openai/v1/chat/completions. The AWS blog describes the shorter base path as /openai/v1. The OpenAI Python SDK example uses a base URL in this shape:

from openai import OpenAI
from sagemaker.core.token_generator import generate_token

REGION = "us-west-2"
sme_base_url = (
    f"https://runtime.sagemaker.{REGION}.amazonaws.com"
    f"/endpoints/{SME_ENDPOINT_NAME}/openai/v1"
)

client = OpenAI(
    base_url=sme_base_url,
    api_key=generate_token(region=REGION),
)

In that code, the model field is not the main routing key. SageMaker routes the request through the endpoint name in the URL, while the model value is passed through to the container. AWS says it can be left empty or set to the model name expected by the container. That differs from public OpenAI API usage, where the model name usually determines the provider-side route. In SageMaker, the AWS resource boundary comes first: endpoint, inference component, and container.

OpenAI SDK / LangChain / Strands Agents

base_url: SageMaker endpoint /openai/v1

bearer token: client-side SigV4 signing from AWS credentials

SageMaker real-time endpoint / inference component / container

Authentication is the less visible half of the compatibility story. The caller passes a string that looks like an OpenAI API key, but the SageMaker bearer token is a base64-encoded SigV4 pre-signed URL generated by generate_token in the SageMaker Python SDK. It is created from existing AWS credentials. The default lifetime is 12 hours, and the expiry parameter can shorten it to anywhere from 1 second to 12 hours. Token generation happens through client-side signing, so creating the token does not require a separate network call.

That design is convenient, but it gives security teams specific work. AWS docs state that the bearer token has the same authorization as the underlying AWS credentials used to generate it. A broadly privileged role creates a broadly privileged token. AWS therefore recommends restricting sagemaker:InvokeEndpoint to the specific endpoint ARN and avoiding tokens generated from expansive permissions such as AdministratorAccess or AmazonSageMakerFullAccess. There is also an awkward detail: sagemaker:CallWithBearerToken does not support resource-level restriction, so the policy needs Resource: "*".

Token handling also cannot be treated like ordinary application configuration. AWS tells users not to store the bearer token on disk, in environment variables, in config files, in databases, or in distributed caches, and not to log it. For long-running applications, AWS points to auto-refresh patterns such as httpx.Auth, where the app generates a fresh token close to each request. "OpenAI-compatible" does not mean SageMaker has adopted OpenAI-style authentication. Developers can use an OpenAI client, but IAM roles, endpoint ARNs, and the AWS credential chain still define the operating boundary.

AWS's first highlighted use case is agentic workflow migration. A team using Strands Agents or LangChain for multi-step agents can keep the OpenAI-compatible client interface while running inference on a dedicated GPU instance inside the customer's AWS account. The AWS blog includes a customer quote from Caffeine.AI, which says its Bifrost gateway already talks to multiple LLM providers and can add SageMaker as a drop-in endpoint for Vercel AI SDK and standard OpenAI clients using the bearer-token feature.

The second use case is multi-model hosting. SageMaker inference components let teams place several model components under the same endpoint, each with its own resource allocation. A single endpoint might host Llama, a fine-tuned Mistral model, and a smaller classifier. The OpenAI SDK client can call a specific component by putting the inference component name in the URL path. Instead of swapping provider-specific clients in application code, the endpoint and component path become the routing surface.

AreaPrevious SageMaker invocationOpenAI-compatible invocation
ClientSageMaker runtime client and custom payloadsOpenAI SDK, LangChain, Strands Agents
AuthenticationAWS SDK with SigV4 signingShort-lived bearer token generated from AWS credentials
RoutingEndpoint invocation APIEndpoint URL and inference component path
Operational tradeoffAWS-native, but LLM apps often need client glue codeEasier app migration, with IAM and endpoint-cost management still required

AWS also named the supported serving containers. SageMaker AI vLLM Deep Learning Containers and SGLang Deep Learning Containers are supported, and custom containers can work if they implement the OpenAI API path and /ping. The AWS blog example deploys Hugging Face's Qwen/Qwen3-4B with a vLLM container on an ml.g6.2xlarge instance. That example is a useful boundary marker: this feature is not a call-through to an external model provider API. It is an OpenAI-style interface in front of open model serving containers hosted on AWS infrastructure.

The cost model deserves attention because API compatibility can hide it. AWS docs warn that SageMaker AI endpoints incur charges while they are InService, regardless of whether they receive traffic. Teams used to per-token hosted APIs need to price endpoint uptime, instance type, autoscaling policy, and inference component allocation separately. For short experiments, a hosted API may remain cheaper. For sustained traffic, data residency, or dedicated-capacity requirements, a SageMaker endpoint may make operational sense. The announcement reduces migration friction; it does not automatically reduce inference cost.

The broader developer signal is that the OpenAI API has become more than one company's API. vLLM and SGLang already treat OpenAI-compatible serving as a core selling point. OpenRouter, LiteLLM, Portkey, and Vercel AI Gateway all build provider routing around similar compatibility assumptions. By adding /openai/v1 directly to SageMaker endpoints, AWS is not routing around that ecosystem. It is absorbing the client contract that LLM apps and agent frameworks have already standardized on.

SageMaker also needs to be separated from Bedrock. Bedrock is closer to managed foundation model access, guardrails, agents, and marketplace-style integrations. SageMaker works at a lower operational layer: customer-owned model artifacts, custom containers, dedicated endpoints, and inference components. The OpenAI-compatible endpoint does not replace Bedrock. It gives teams running fine-tuned open models, self-hosted vLLM or SGLang containers, or internal model-serving stacks a cleaner way to attach those systems to the OpenAI SDK ecosystem.

For agent teams, gateway architecture becomes simpler. An app can put the public OpenAI API, Anthropic, Bedrock, a local vLLM server, and a SageMaker endpoint behind a router such as LiteLLM or Bifrost. Previously, adding SageMaker often meant writing a SigV4 wrapper or provider-specific adapter. With an OpenAI-compatible SageMaker path, the number of backend providers can grow while the client contract narrows. That difference shows up during incident response, authorization review, and regression testing.

Compatibility still has limits. The minimum common surface is Chat Completions and streaming. Tool calling, structured outputs, multimodal input, embeddings, batch processing, response metadata, and safety filter semantics can vary by provider and container implementation. AWS docs specify that supported containers must implement the /v1/chat/completions path and Server-Sent Events streaming responses. If an app depends on detailed OpenAI API behavior, a URL change alone should not be treated as proof of equivalent output or error handling.

An operations checklist falls out of the announcement. First, confirm that the IAM role generating tokens has endpoint-scoped minimum permissions. Second, verify that gateways and agent runners never log the bearer token. Third, price endpoint uptime and autoscaling against real traffic patterns rather than assuming per-token economics. Fourth, regression-test whether the vLLM or SGLang container matches the app's actual tool-call, streaming, timeout, and error-shape expectations.

This is infrastructure news rather than a generic SageMaker tutorial. AWS is treating OpenAI compatibility as the deployment grammar that AI applications already use, not merely as a rival vendor's API. That brings SageMaker's AWS account boundary, dedicated GPU capacity, IAM controls, and inference components closer to the OpenAI SDK world. AI builders now have one more place to ask the practical question: how far can the same client code move, and what authority, cost, and observability obligations remain after it moves?