QVAC AI SDK Provider lets local models plug into AI SDK apps
Tether QVAC published a Vercel AI SDK provider for local OpenAI-compatible servers. Here is what it changes for agents, TypeScript apps, and local AI routing.
- What happened: Tether's QVAC repository published
@qvac/ai-sdk-provider0.1.0.- The GitHub release appeared on June 1, 2026, and the npm package lists
ai@^6.0and@ai-sdk/openai-compatible@^2.0as peer dependencies.
- The GitHub release appeared on June 1, 2026, and the npm package lists
- Developer impact: QVAC's local runtime can now sit behind AI SDK calls such as
streamText,generateText,embed, andtranscribe. - Watch the limits: v0.1.0 only supports external mode, so developers must run
qvac serve openaithemselves.- The README warns that port
11434collides with Ollama and that smaller 4B/8B local models may be unreliable for coding-agent tool calls.
- The README warns that port
Tether's QVAC repository published the @qvac/ai-sdk-provider 0.1.0 release on June 1, 2026. On the same day, Tether announced QVAC SDK 0.12.0 with a TurboQuant implementation, but the most concrete new integration point for application developers is the GitHub release and npm package. QVAC's local runtime now has an official Vercel AI SDK provider, which means local models can be addressed through the same TypeScript call surface that many AI apps already use for cloud providers.
This is not another explainer about the TurboQuant algorithm. devlery has already covered Google Research's original TurboQuant work and KV cache compression. The event here is narrower and more practical: QVAC is starting to connect a local inference engine, an OpenAI-compatible HTTP server, a Vercel AI SDK provider, and coding-agent integration notes into one path. If local AI is going to move from demos into everyday development workflows, the connection surface has to become as repeatable as the model download.

What the 0.1.0 release actually ships
The GitHub release describes @qvac/ai-sdk-provider as a Vercel AI SDK provider for the QVAC local AI runtime. The developer first starts a local HTTP server with qvac serve openai, then configures the provider's baseURL to point at that server. After that, QVAC model aliases can be used from AI SDK APIs such as streamText, generateText, embed, transcribe, and generateImage.
The npm package description is shorter but consistent. It identifies the package as a Vercel AI SDK provider for the QVAC local AI runtime and names chat, embeddings, transcription, translation, speech, OCR, and image generation as the intended scope. The package uses the Apache-2.0 license. Its peer dependencies are ai@^6.0 and @ai-sdk/openai-compatible@^2.0, so QVAC is not trying to duplicate the AI SDK. It is joining the provider ecosystem around it.
The README code makes that architecture explicit. createQvac() wraps createOpenAICompatible. The provider name is qvac, the default API key is the literal string qvac, and the base URL is supplied by the developer. Vercel's AI SDK documentation uses the same general pattern for OpenAI-compatible providers: create a provider, pass a model identifier, and call generateText or another SDK primitive. QVAC's package adds QVAC naming, defaults, model metadata hooks, and a planned path toward managed mode.
That thin wrapper still matters. Teams building AI apps need a provider name, model catalog conventions, authentication defaults, documented base URLs, and framework examples. A server may be reachable through @ai-sdk/openai-compatible directly, but one-off configuration rarely becomes a team standard. @qvac/ai-sdk-provider gives the local runtime an official label inside AI SDK applications.
QVAC is aiming beyond a simple chat endpoint
The QVAC HTTP server documentation is the foundation for the provider release. It tells developers to install @qvac/sdk and @qvac/cli, run qvac serve openai, and expose a local OpenAI-compatible API. The example URL is http://localhost:11434/v1/. Requests are routed to model aliases configured in qvac.config.*.
The endpoint list is broader than a minimal chat server. The documentation names chat, responses, completions, embeddings, audio, image, files, and vector store routes under /v1. QVAC is not only trying to make text generation look like OpenAI. It is also trying to fit RAG, file workflows, speech, and image generation into the same compatibility surface.
The docs already name tools that depend on specific parts of that surface. Continue.dev needs streaming SSE and /v1/models. LangChain uses chat completions and embeddings. Open Interpreter needs chat completions with streaming and tool calls. That compatibility table makes QVAC's target clear: not a private SDK that only works in one application, but a local runtime that OpenAI-compatible tools can reach by changing the base URL.
| Component | Role in this release | What developers need to verify |
|---|---|---|
qvac serve openai | Exposes local models through OpenAI-compatible HTTP endpoints. | Model aliases, ctx_size, authentication, and ports remain explicit configuration. |
@qvac/ai-sdk-provider | Adds an AI SDK provider name and a createQvac() factory. | v0.1.0 supports external mode only, and the default base URL is a placeholder. |
| Coding agents | Targets harnesses such as OpenCode, Cline, Aider, and Continue. | Small instruct models may fail to produce reliable tool calls. |
The bottleneck for local AI is often the API shape
Local-model discussions usually start with model performance. Developers compare parameter counts, Q4 quantization, Metal or Vulkan support, and token throughput. In application code, a more immediate failure often comes from the API shape. One model server mimics only OpenAI chat completions. Another uses a different tool-call schema. A third exposes embeddings and image generation under separate conventions. The AI app slowly accumulates provider-specific branches.
The Vercel AI SDK provider abstraction exists for that problem. The application calls relatively stable functions such as generateText or streamText, and the provider handles runtime-specific details such as authentication, base URLs, model IDs, and streaming formats. When QVAC joins that layer, its local server becomes a provider candidate inside existing AI SDK apps rather than a separate tool that has to be wired by hand each time.
That is useful for small teams. Internal document summarization, local code analysis, offline transcription, and image OCR often involve sensitive inputs that cannot casually be sent to a cloud API. If QVAC exposes an OpenAI-compatible server and the provider exposes the AI SDK surface, teams can compare local and cloud providers inside the same application. High-quality paths can remain on frontier APIs, while privacy-sensitive or cost-sensitive paths can route to local models.
The hybrid design does not automatically produce a strong product. The QVAC README presents v0.1.0 as plumbing more than a finished managed runtime. Actual agent performance still depends on the local model. The README says Q4-quantized 4B/8B Qwen3-Instruct models may be able to chat but may not reliably emit tool calls, and that dependable local tool use likely needs a 14B-class or larger model with coder or agent post-training. That warning is more useful than a generic local-AI promise.
The port 11434 warning is not a footnote
One of the most practical details in the release is the default base URL warning. The QVAC CLI currently uses 11434 as the default port for qvac serve. The README notes that this collides with Ollama, so the provider's default base URL is a placeholder: http://127.0.0.1:11435/v1. Developers are expected to pass the actual qvac serve URL into createQvac({ baseURL }).
That is more than setup trivia. A developer laptop can now have Ollama, LM Studio, a llama.cpp server, QVAC, a local proxy, and MCP servers running at the same time. If multiple tools claim similar OpenAI-compatible routes and ports, an app can send requests to the wrong server and still look partially functional. If model aliases overlap or API key checks are loose, the failure can be quiet.
The QVAC docs also spell out authentication behavior. By default, requests from 127.0.0.1 do not need authentication. If the developer passes --api-key, the server can require a bearer token. Localhost is still a meaningful boundary, but it is not a complete security model when desktop apps, browser extensions, local web apps, and agent harnesses share one machine. Teams need to review CORS, public base URLs, API keys, and file endpoint scope together.
How the TurboQuant announcement connects
Tether's June 1, 2026 press release said QVAC SDK 0.12.0 includes a TurboQuant implementation. The release describes a 4B model using roughly 8GB of KV cache at a 262,000-token scale and gives the example of four sessions consuming up to 32GB for cache alone. Tether says TurboQuant can compress KV cache by up to 5x while preserving output quality close to the uncompressed model.
Google Research first announced TurboQuant on March 24, 2026. Google described it as an algorithm for reducing KV cache bottlenecks and vector-search memory overhead, with work scheduled for ICLR 2026. That research already stood on its own. From the QVAC angle, the new detail is that the compression idea is now being packaged into a local runtime that developers can install and route to.
Local AI apps are not only about loading model weights. Long conversations, codebase analysis, document RAG, transcription followed by summarization, and OCR post-processing all produce longer sessions and more intermediate state. TurboQuant attacks part of the memory bottleneck. The AI SDK provider attacks the integration cost. One is about fitting more context into local hardware; the other is about making that runtime easier to call from TypeScript apps.
It would still be wrong to read the two announcements as evidence that local AI is about to replace cloud APIs. Tether's own press release acknowledges that large compute remains important. The README's warnings about small-model tool use point in the same direction. Local models are strong candidates for privacy, offline access, and cost control. Frontier cloud models may still be the better choice for complex coding agents and hard reasoning tasks. The new choice is routing, not replacement.
Failure modes for coding agents
The QVAC README gives coding-agent integration its own section. The first failure mode is concurrent requests. The underlying llama.cpp addon serializes inference per native model context, and it may reject a new job when another one is already running. Coding agents often issue overlapping calls for the main response, title generation, summaries, and tool-result processing. The README suggests loading the same model file under different aliases if parallel inference is needed right now.
The second failure mode is context size. The documented default ctx_size for QVAC LLMs is 1024 tokens. That can work for a short chat, but it is not enough for a coding-agent harness. A tool-heavy agent can send a system prompt, more than 10 tool definitions, recent messages, and file snippets in its first request. The README recommends explicit values such as 16384 for chat and 4096 for title generation.
The third failure mode is reasoning output. Reasoning-tuned models such as Qwen3 or DeepSeek-R1 variants can emit <think> blocks. If the host does not support a separate reasoning channel, those tokens appear in the UI and add latency for content the user may never need to see. The README recommends setting reasoning_budget: 0 at the addon level. That setting requires QVAC SDK 0.11.0 or later and a CLI version pinned to that SDK, such as CLI 0.5.0 or later.
These are the operational questions that matter when local AI enters a product. Will concurrent requests queue or fail? How large should context be for each agent role? Should reasoning tokens be hidden or disabled? How will the app detect missing tool calls from smaller models? What happens to volatile Responses API state after a server restart? "It runs locally" is only the first checkpoint.
What teams should evaluate now
QVAC provider 0.1.0 looks more like an early integration standard than a production-ready managed runtime. The useful part is its transparency. The README says external mode is the only supported mode today, states that the default base URL is a placeholder, and marks the model catalog as a 0.1.0 placeholder. It also documents the limitations of small local models for tool use. That kind of boundary-setting is valuable when evaluating local AI tools.
The weak side is the same surface area. Developers still have to handle installation, server startup, model configuration, port selection, alias mapping, and per-agent model slots themselves. The README says a future 0.2.0 release may add managed mode, where the provider auto-spawns or supervises the serve process, but that is not available in v0.1.0. A product team shipping to non-technical users would still need installer flows, model download management, updates, failure recovery, and disk-usage visibility.
Ollama is the obvious comparison point. It is already widely used for local model execution and OpenAI-compatible APIs. LM Studio offers a desktop UI and local server. QVAC's differentiators are its P2P ambitions, broader media endpoint scope, and the attempt to package SDK, CLI, and AI SDK provider inside one ecosystem. Developer adoption will be decided less by that positioning and more by repeatable execution, model catalog quality, package update cadence, and real compatibility with agent harnesses.
For teams testing it now, the safest path is to start with narrow tasks. Internal document summaries, small embedding jobs, transcription, and OCR are better first candidates than full coding-agent backends. For coding agents, evaluate a 14B-class or larger coder/agent model, enough RAM or VRAM, longer context windows, concurrent-request behavior, and cloud fallback. If a local model silently drops tool calls, the result is not a cheaper cloud substitute. It is automation that can fail without a clear signal.
Local AI is becoming a provider choice
The significance of this QVAC release is not a benchmark score. It is a sign that local AI is moving into provider competition. Application developers will increasingly want to place OpenAI, Anthropic, Google, and OpenRouter next to Ollama, QVAC, LM Studio, and internal vLLM endpoints, then compare them from the same code path. The deciding factors will be streaming, embeddings, tool calling, structured output, audio, image generation, auth, logging, and recovery behavior.
QVAC has shipped the first piece of that path in 0.1.0. It is a thin wrapper, but it is an official package. It connects to the OpenAI-compatible server documentation. It also names coding-agent limits instead of hiding them. If Tether's TurboQuant message is about reducing the memory bottleneck for local runtime, @qvac/ai-sdk-provider is about making that runtime reachable from TypeScript AI apps and agent harnesses.
The practical question for developers is no longer whether local AI should be trusted for everything. It is which tasks should route locally and which should stay on cloud models. QVAC's provider gives AI SDK users one more way to run that experiment in application code. The rough edges are still visible in ports, context sizing, concurrency, and tool-use quality, but documenting those limits in a first release is a useful starting point.