Devlery
Blog/AI

Thinking Machines Makes AI Collaboration Real Time

Thinking Machines Interaction Models proposes full-duplex collaboration where AI can listen, see, speak, and use tools at the same time.

Thinking Machines Makes AI Collaboration Real Time
AI 요약
  • What happened: Thinking Machines Lab published a research preview for Interaction Models.
    • The model processes audio, video, and text in 200ms micro-turns, aiming for real-time collaboration rather than turn-based chat.
  • Core shift: The AI does not wait for the user to finish speaking. It can listen, watch, speak, and use tools while the interaction continues.
  • Builder impact: Realtime APIs are becoming a full-duplex orchestration problem, not only a voice input-output problem.
  • Watch: This is still a limited research preview. Long sessions, network quality, and safety behavior remain unproven in broad use.

Thinking Machines Lab's May 11, 2026 preview of Interaction Models can look, at first glance, like another real-time voice AI announcement. The interesting part is not simply voice quality or demo polish. Thinking Machines is reframing the question of when an AI should listen and when it should speak as a model-architecture problem. Many AI products still assume a turn-based structure: the user enters something, the model responds, then the user enters the next thing. Even voice modes often follow the same pattern. Voice activity detection marks a turn boundary after the user stops talking, the model generates a response, and text-to-speech plays it back.

Thinking Machines wants to break that sequence. While the user is speaking, the model continues listening. While the user is showing something on screen, the model continues watching. When appropriate, it can briefly interject before the user finishes, speak at the same time, or hand longer reasoning and tool use to a background model. The company's framing is that interactivity should scale alongside intelligence. Better reasoning alone is not enough if the model still experiences collaboration through a narrow turn-taking channel.

Thinking Machines official seamless dialog management demo thumbnail

This sits in a slightly different place from much of the recent AI-agent conversation. In 2025 and 2026, the agent race has largely moved toward long-running autonomy: a person gives a goal, and the agent executes for minutes or hours. Coding agents take an issue, create a branch, run tests, and open a pull request. Research agents search and summarize. Work agents connect CRM, calendars, documents, and payment tools. Those systems matter, but they assume the user can express the requirement clearly up front.

Real collaboration is messier. Requirements emerge through conversation. A designer looks at a screen and says a section feels cramped. A developer points at a running app and says state is breaking here. A doctor reads not only the patient's words but also pauses and facial expression. In a meeting, someone can interrupt midway because a premise is wrong. Thinking Machines starts from that gap. Many tasks still need a human in the loop, but current AI interfaces compress that loop into a narrow text or audio channel.

The bottleneck in turn-based AI

Most mainstream LLM interfaces inherit the time structure of text chat. The user's input becomes one complete message, and the model's output becomes one complete assistant message. Streaming tokens change the display, but not the basic flow. The model begins its real answer after the user's message is complete. If the user changes direction mid-sentence, if a new visual event appears on screen, or if the user says "no, not that" while the model is responding, the system needs a separate interruption handler, turn detector, or dialog manager.

Thinking Machines argues that this external harness approach does not scale well. Voice activity detection can guess whether speech has stopped, but it does not deeply understand whether the user is thinking, correcting themselves, inviting the model in, or speaking to someone else nearby. Screen-based systems face a similar issue. A model can inspect periodic screenshots, but deciding whether a visual change deserves an immediate response becomes another policy layer.

The key idea behind an interaction model is to move that policy inside the model. The company describes current commercial frontier models as experiencing the world through a single thread. While the user speaks, the model waits. While the model speaks, perception pauses. Human collaboration is not like that. Listening, seeing, speaking, waiting, interrupting, nodding, and working can happen in overlapping time. Thinking Machines is arguing that AI should learn that temporal structure directly.

The 200ms micro-turn design

The preview's central model is TML-Interaction-Small. Thinking Machines says it trained the model as an interaction model from the start. The basic unit is 200ms. The model splits 200ms worth of input and 200ms worth of output into micro-turns, then interleaves audio, video, and text input with text and audio output along the timeline. Instead of waiting until a user utterance is complete and treating it as one turn, the model can see, hear, and respond in very small time slices.

The official post includes a comparison between a turn-based sequence and a time-aligned micro-turn sequence. A turn-based model sees a flattened line: input 1, output 1, input 2, output 2. An interaction model sees video frames, audio streams, model output, silence, overlap, and interruptions aligned in time. That difference changes the user experience. The model can provide backchannels before the user finishes, correct something at the moment it becomes wrong, or speak up when a visual bug appears on screen.

The technical choices are also notable. Thinking Machines describes an early-fusion approach that does not rely heavily on large standalone encoders. Audio enters as dMel signals through a lightweight embedding layer. Images are split into 40x40 patches and encoded by hMLP. The audio decoder uses a flow head, and the components are co-trained with the transformer from the beginning. This is less an ASR plus LLM plus TTS pipeline and more an attempt to make time-synchronized audio, video, and text part of the model's native input structure.

Inference is another important piece. Processing 200ms chunks continuously means many tiny prefill and decode operations. Existing LLM inference stacks are strong at high-volume token processing, but short and frequent requests carry overhead. Thinking Machines says it implemented streaming sessions where a client sends 200ms chunks and the inference server appends them to a persistent sequence in GPU memory. It also says some related functionality was upstreamed to SGLang. Realtime AI is not only a model paper problem. It is a serving runtime and network protocol problem.

Fast models and slow models split the work

The system is not one interaction model doing everything. It has two layers. One is a fast interaction model that stays in front of the user. The other is a background model that handles longer reasoning, browsing, tool use, and background work. The interaction model follows the user's speech and screen in real time. When deeper thinking is needed, it passes a context package to the background model. The background model's result is not simply pasted into the conversation as a finished answer. The interaction model blends it back into the current flow of work.

This is an important hint for AI product design. Future realtime agents may not be "the smartest model on a socket." The model in front of the user has to optimize for latency, turn-taking, interruption, and sensory grounding. The model in the background can optimize for planning, retrieval, tool execution, and verification. They need shared context, but the foreground interaction model decides how results are surfaced to the user.

That pattern resembles frontend and backend separation, but it is harder than an ordinary software split. The foreground model is not just a UI wrapper. It has intelligence and must judge the user's attention and timing. The background model is not just a worker. Its results have to arrive partially and contextually inside a live interaction. In that sense, the interaction model becomes part scheduler, part narrator, and part interruption manager for the agent runtime.

200ms
input-output micro-turn unit
0.40s
FD-bench v1 turn-taking latency
12B
active parameters inside a 276B MoE model

What the benchmarks say

Thinking Machines claims TML-Interaction-Small is the first model to combine intelligence and interactivity in this way. In its published table, the model records 0.40 seconds on FD-bench v1 turn-taking latency. The same table lists GPT-realtime-2.0 minimal at 1.18 seconds, GPT-realtime-1.5 at 0.59 seconds, and Gemini 3.1 Flash Live minimal at 0.57 seconds. On FD-bench v1.5 average, TML-Interaction-Small is reported at 77.8, compared with 46.8 for GPT-realtime-2.0 minimal and 54.3 for Gemini 3.1 Flash Live minimal.

Those numbers require caution. First, they come from Thinking Machines' own announcement. Some benchmark items are public, but until conditions and user experience are independently reproduced, the numbers are best read as early evidence in a favorable direction. Second, realtime model quality is not determined by latency alone. Voice quality, dropped connections, meeting noise, accents, screen-capture frequency, privacy behavior, and appropriate interruption all matter. Third, this is a research preview. TechCrunch also noted that because the system is not yet a public product, the real experience remains hard to judge until people can use it directly.

Even with those caveats, the benchmarks matter because they change what is being evaluated. Traditional model evaluation asks whether the model gives the right answer, fixes the code, solves the math problem, or calls the right tool. Interaction models add other questions: when should the model speak, when should it stay quiet, can it say the right short thing while the user is still talking, and can it react when the answer appears on screen? If AI is going to work with people in real settings, those questions are not secondary.

Why visual proactivity matters

The most practical part of the announcement is visual proactivity. Voice AI is already relatively natural, but much of it still behaves like audio-only turn-taking. The user speaks, the model listens, and after speech ends it replies. Work, however, is visual. Developers look at editors, terminals, browsers, logs, design drafts, and dashboards. Manufacturing, healthcare, education, and field operations also depend on screens or physical scenes. If a model is going to be a real collaborator, it needs to detect meaningful visual changes even when the user does not explicitly say "look at this."

Thinking Machines says it adapted video benchmarks such as RepCount-A, ProactiveVideoQA, and Charades into streaming settings. For example, if a user asks the model to count push-ups and then only video continues, the model has to track the action and speak at the right time without a new audio cue. The announcement reports that GPT-realtime-2.0 minimal mostly fails to respond or responds incorrectly in this visual-proactivity category. TML-Interaction-Small is reported at 35.4 off-by-one on RepCount-A, 33.5 on ProactiveVideoQA, and 32.4 Charades mIoU.

Thinking Machines official generative UI demo thumbnail

In developer tools, the difference becomes even clearer. Today's coding agents usually start from an issue or a prompt. Pair programming often depends on moments before a full prompt exists: the user hesitates while reading a failing test log, edits the same file repeatedly, or sees a UI break in a browser. If a model can say "that state update just happened twice" at the moment the problem appears, the interaction pattern changes. This is not only about capturing the screen more often. The model has to learn when to speak, when to stay silent, and when a helpful comment becomes a distraction.

What the demos show and what remains unknown

Thinking Machines embedded several YouTube demos in the official post: seamless dialog management, verbal and visual interjection, simultaneous speech, time awareness, simultaneous tool calls and search, generative UI, and longer real sessions. The demos show experiences that are not "answer after the conversation ends." The model follows the user's speech, reacts to visual cues, interjects at times, and continues search or UI generation in parallel.

But demos are still demos. The difficult part of realtime AI products is not the beautiful single session. It is the messy everyday session. A meeting-room microphone is far away. Background noise is present. The screen keeps changing. A user starts a sentence and stops halfway. Sensitive information passes through the frame. There are many moments when the model should not interrupt. "Proactive" is attractive in a demo, but in a product it can easily become "intrusive." In developer tools and work tools, users may feel the boundary between helpful teammate and surveillance software very sharply.

The official post is also explicit about limitations. Continuous audio and video accumulate context quickly, making long-session management difficult. Low-latency streaming audio and video require stable connectivity, and weak connections can degrade the experience sharply. Realtime interfaces also change safety and alignment. A text chatbot can inspect one message and produce a refusal. In live speech, refusal has to be both natural in conversation and safe. Thinking Machines says it built additional data for refusals that sound natural while remaining firm.

How APIs and product design change

For AI developers, the question is broader than when this specific model becomes available. If interaction models become products, APIs will likely grow more complex than chat completions or simple realtime WebSockets. Audio chunks, video frames, text events, tool results, background task updates, and UI-generation events all have to move along the same timeline. Alternating user and assistant messages in a messages array is not enough.

First, session memory becomes more important. As in the streaming-session design Thinking Machines described, the server has to append very small chunks to a persistent sequence. Product teams must decide which frames and audio segments remain, what gets summarized, and how sensitive screen information is discarded. Realtime context management reaches cost and privacy problems faster than ordinary long-context text.

Second, tool-call semantics change. In existing tool calling, the model calls a function during generation, receives the result, and then continues producing tokens. In an interaction model, the background model may search while the user keeps speaking, and the interaction model may continue responding to follow-up comments at the same time. When the tool result comes back, the system has to decide whether to say it aloud, wait, show it as UI, or use it only inside the background task. This is where agent orchestration and realtime UX merge.

Third, permission boundaries become more difficult. If a model is always watching, listening, and ready to speak, permissions need to be more granular. Screen viewing, microphone input, camera input, web search, internal document search, code execution, and external messaging all carry different risks. The advantage of an interaction model is natural collaboration, but that same naturalness can hide what the AI knows and what it is allowed to do. Good products should not say "enable everything and it gets smarter." They should show which senses and tools are active in which situations.

What pressure this puts on competitors

OpenAI, Google, Anthropic, xAI, Alibaba, and NVIDIA-adjacent research teams are all moving on realtime, multimodal, and agentic workflows. OpenAI has the Realtime API and GPT Realtime models. Google has Gemini Live API and Gemini 3.1 Flash Live. Qwen Omni, Moshi, and Nemotron VoiceChat point toward full-duplex or audio-native research as well. It is too early to say Thinking Machines has overtaken those platforms. Existing large platforms are stronger in product distribution, ecosystem, pricing, reliability, and developer tooling.

Still, Thinking Machines has introduced a frame that competitors will have to answer. "Realtime voice response" is no longer enough if users begin to expect models that natively handle time, overlap, and visual cues. Existing voice assistants may feel slower if users get used to this interaction style. Coding agents could face the same shift. The important agent may not only be the background worker that opens a PR. It may also be the agent that watches the screen with the developer and reacts at the right moment.

The other pressure is benchmarking. Thinking Machines argues that existing evaluation does not measure interactivity well enough, so it introduced TimeSpeak, CueSpeak, RepCount-A adaptation, ProactiveVideoQA adaptation, and Charades streaming evaluation. It is unclear whether the broader community will adopt those metrics. The direction, however, is natural. AI model competition is widening from "is the answer correct?" to "when, how, and how unobtrusively does the model work with a person?" The quality users feel in real work is not captured by answer accuracy alone.

What development teams should watch now

Teams cannot put TML-Interaction-Small into production today. Thinking Machines only says it plans to open a limited research preview in the coming months and a broader release later this year. So this should not be read as a new API guide. It is more useful as a signal about product structure.

First, the default assumption that an AI app starts as turn-based chat and later adds voice may hit limits. In education, healthcare, design review, developer tools, field work, and customer support, mid-stream human feedback is central. Product teams may need to design interaction state from the beginning: whether the user is speaking, thinking, looking at something, or inviting interruption.

Second, teams should separate the foreground collaborator from the background agent. A model that performs long tasks well and a model that collaborates naturally in real time have different latency budgets. One model may eventually do both, but system design will often be cleaner if roles are separated. The foreground model handles immediacy and attention. The background model handles planning and verification.

Third, proactivity is policy, not a feature toggle. "The AI jumps in by itself" is powerful in a demo and risky in a product. Teams need to decide when the model stays silent, when it gives a short signal, when it explains at length, and how often each user wants intervention. In developer tools, it is more practical to begin with high-value moments such as test failures, security risks, destructive commands, and UI regressions.

Conclusion

Thinking Machines' Interaction Models are still more research signal than finished product. The API is not broadly available, outside reproduction is limited, and long-session reliability is unknown. But the announcement points to an important change. The next stage of AI collaboration is not explained only by longer context windows or stronger autonomous agents. The ability to share the user's timeline, follow speech and screen simultaneously, and intervene only when useful may become part of the model race.

The question for developers is simple: is the AI product still a system that answers after input, or is it a system that works inside the same scene as the user? Thinking Machines is treating the second option as a model-architecture problem. If that bet is right, realtime AI development is not just adding a voice UI. It is designing time, tools, permissions, safety, and attention together.

Sources: