Why Qwen3.7 is pairing 35-hour agents with custom chips

Alibaba Qwen3.7-Max is not just a model launch. It packages agents, custom chips, 128-accelerator racks, and cloud runtime into one stack.

AI 요약

What happened: Alibaba introduced Qwen3.7-Max alongside its Zhenwu M890 AI chip and a 128-accelerator supernode.
- The official announcement frames 35-hour agent runs and more than 1,000 tool calls as the headline demo.
Why it matters: Model competition is moving toward a full stack of agent runtime, cloud rack, interconnect, and low-precision inference.
Builder angle: Choosing a strong LLM is no longer enough; teams also need execution layers, cost controls, observability, and supply-chain clarity.
Watch: Qwen3.7-Max is slated for Alibaba Model Studio, and the 35-hour and 1,000-tool-call figures are Alibaba's own claims.

Alibaba unveiled Qwen3.7-Max at the Alibaba Cloud Summit on May 20, 2026. On the surface, this looks like another flagship model announcement. The more important signal sits behind the model name. Alibaba introduced Qwen3.7-Max together with the Panjiu AL128 Supernode Server, T-Head's Zhenwu M890 AI processor, ICN Switch 1.0, and new agent-focused optimizations in Model Studio and Bailian.

The official framing is direct. Qwen3.7-Max is presented as a model designed for agentic coding, complex reasoning, and long-horizon task execution. Alibaba says it can handle agent tasks lasting up to 35 hours and more than 1,000 tool calls without performance degradation. That is not just a better-chatbot message. Alibaba is saying that long-running agents require a bundle: the model, tool calling, memory, interconnect, low-precision inference, and cloud racks all have to work together.

Most recent AI model launches follow a familiar script: a higher benchmark score, a longer context window, cheaper tokens, faster coding, or better multimodal behavior. Qwen3.7-Max fits that pattern, but the launch does not stop at the model card. Alibaba put a 128-accelerator single-rack system, native FP4 support in a new chip, and a 25.6 Tbps switch on the same stage. The message is that the next bottleneck for agents is not only inside the model weights.

Official Qwen Chat app image

Long-running execution is the real hook

The number that makes Qwen3.7-Max stand out is 35 hours. Alibaba's announcement says the model is built for code generation and debugging, office workflow automation, and complex tasks that require hundreds or thousands of steps. It specifically highlights continuous long-running agent work for up to 35 hours and more than 1,000 tool calls. Anyone who has operated coding agents or workflow agents can see why that number moved to the front of the announcement.

An agent is not a single model response. It reads files, drafts a plan, runs shell commands, launches tests, studies failures, calls browsers or search tools, and recombines the results. As the loop gets longer, raw reasoning matters, but it is only one part of the system. State retention, tool-call reliability, cost predictability, recovery from failure, logs, and traces become equally important. A 35-hour claim is aimed at an operational question: can the worker keep going without drifting, stalling, or losing the thread?

That places Alibaba's announcement in the same broader movement as Google Gemini API Managed Agents, NVIDIA's Vera CPU shipping story, and AWS-style security agents. Model companies and cloud companies increasingly see that a model API by itself is not enough. Agents that finish real work need execution environments, code sandboxes, networking, permissions, compute scheduling, and observability. Qwen3.7-Max is Alibaba's attempt to pull that execution layer toward Alibaba Cloud and its own silicon.

Qwen3.7 continues the move toward closed flagship models

devlery previously covered Qwen3.6-Plus as part of a wider shift: Qwen's Max and Plus tiers increasingly put agentic coding and API-delivered flagship models at the center. Qwen3.7-Max pushes that direction further. According to SCMP's reporting, Qwen3.7-Max-Preview and Qwen3.7-Plus-Preview appeared on LMArena and Qwen Chat before the formal announcement. Qwen3.7-Max-Preview was discussed as ranking 13th globally for text, while Qwen3.7-Plus-Preview was discussed as ranking 16th for vision. The same coverage also noted that the previews still trailed top US models from the Claude, Gemini, and ChatGPT families.

That tension matters. Alibaba's Qwen brand has built a strong presence in the open-model ecosystem. But the front edge of the Max and Plus line is moving toward hosted services, previews, and API access. Many developers will ask a different question from cloud buyers: when will a smaller, capable open-weight model arrive?

Those strategies can coexist. Qwen can keep developer goodwill through open-weight releases while Alibaba Cloud monetizes the frontier tier as a hosted product. Still, the practical test for Qwen3.7-Max is not benchmark rank alone. Developers will need to see Model Studio availability, pricing, tool-call quality, context and state behavior, and real compatibility with coding-agent frameworks.

Custom chips are a supply-chain statement

The chip name in this launch is Zhenwu M890. It comes from T-Head, Alibaba's semiconductor design subsidiary, and is presented as a new AI training and inference processor. Alibaba says the M890 delivers three times the performance of the prior Zhenwu 810E, includes 144 GB of memory, supports 800 GB/s inter-chip bandwidth, and natively supports precision formats from FP32 down to FP4.

There are two useful ways to read those numbers. First, agent inference is a cost problem. Long tasks bring repeated model calls and tool execution. Once a model goes through hundreds of intermediate steps, the cumulative inference bill matters more than the price of one prompt. FP4 support points toward faster inference and lower cost, even though actual quality and cost tradeoffs have to be tested workload by workload.

Second, this is about China's AI infrastructure supply chain. High-end GPU access and export controls are structural constraints for Chinese cloud providers. By launching a model, an AI chip, an interconnect, and a rack server together, Alibaba is signaling that it wants to be seen as an agent compute-stack company, not only a model company. That aligns with the global move toward vertical integration, from NVIDIA's systems roadmap to Google's TPU-based cloud strategy.

35 hours

Claimed long-running Qwen3.7-Max agent task

1,000+

Claimed tool calls without performance degradation

128

Accelerators in a single Panjiu AL128 rack

Panjiu AL128 targets agent concurrency

Panjiu AL128 Supernode Server is the infrastructure centerpiece of the launch. Alibaba says the system uses Zhenwu M890 and ICN Switch 1.0 to integrate 128 AI accelerators in one rack and provide PB/s-scale single-rack bandwidth. Alibaba Cloud frames the configuration as suitable for scalable agent inference and large-scale model training.

For agent workloads, concurrency is not just the number of incoming requests. A single user can assign one long task, and that task may internally trigger many model calls, tool calls, file accesses, code runs, searches, and verification loops. When many users run those jobs at the same time, the cluster behaves differently from a server built only for short inference requests. Long-lived tasks hold state, alternate between idle time and bursts, and often run small retry or evaluation loops.

That explains why Alibaba presented AL128 together with Qwen3.7-Max. If the model is marketed as capable of long-horizon execution, the cloud infrastructure underneath it has to survive that workload pattern. Accelerator density, inter-chip bandwidth, low-latency interconnect, safety governance, and Agentic RL are all parts of one question: can the platform keep many agents running without turning tail latency and cost into the hidden failure mode?

Agentic RL and safety boundaries still need scrutiny

The announcement also mentions Bailian's Agentic RL. Alibaba says it can use agent execution feedback to keep improving model performance and includes built-in safety governance so autonomous agents remain inside defined boundaries. At the level of product vocabulary, those are exactly the areas an agent platform has to address. Long-running agents fail in many small ways, and the system needs a feedback loop that can turn those failures into better model behavior and better policies.

But this is where independent verification becomes important. The announcement does not explain what data or reward signals Agentic RL uses, how customer tool traces affect improvement loops, how tenant boundaries are enforced, or how far the safety layer goes against prompt injection and tool misuse. Long-running agents are useful precisely because they can move through code, documents, internal APIs, and privileged tools. For enterprise customers, those boundaries can matter more than a model's headline benchmark.

This is not an Alibaba-only issue. Google Managed Agents had to talk explicitly about network allowlists, credential injection, and interaction retention. OpenAI Codex and Claude Code have to deal with sandboxing, permissions, git-change boundaries, and test-execution cost. Agent-platform competition is becoming less about who can answer smartest in one turn and more about who can work for a long time without creating an unacceptable operational or security risk.

Agent-framework optimization is an ecosystem fight

One striking detail is Alibaba's claim that Qwen3.7-Max is optimized for leading agent frameworks such as OpenClaw, Hermes Agent, Claude Code, Qwen Paw, and Qoder. Claude Code is an Anthropic product, which makes the mention especially notable. It shows that model providers now have to think about the tool-use patterns of specific agent runtimes and harnesses.

Coding agents do not behave like ordinary chatbots. They produce diffs, edit files, read test logs, summarize shell output, recover from failures, and keep small changes moving toward completion. A model that only writes fluent natural language is not enough. The ordering of tool calls, handling of partial failures, ability to preserve a goal across a long run, respect for permission boundaries, and willingness to finish small edits all matter.

Alibaba's framework-optimization language suggests that models will not be fully runtime-neutral commodities in this market. The same model can feel different depending on the harness around it. Buyers may soon ask not only which model ranks higher on a benchmark, but which agent harness it can survive inside for a long task.

Layer	Alibaba announcement	Practical question
Model	`Qwen3.7-Max`, agentic coding, long-horizon execution	How reliably does it preserve goals and context across long tasks?
Service	Model Studio/Bailian, Agentic RL, safety governance	How transparent are traces, permissions, data boundaries, and failure recovery?
Rack	Panjiu AL128, 128 accelerators, PB/s-scale bandwidth	How much can it reduce tail latency for concurrent long-running agents?
Chip	Zhenwu M890, 144 GB, 800 GB/s, FP4 support	How does it balance cost and quality on real inference workloads?

Community reaction sits between interest and skepticism

Qwen3.7 had already drawn attention before the formal announcement. In Qwen-related Reddit threads, some users praised Qwen3.7-Max Preview's math performance. In LocalLLaMA discussions, others focused on whether Max would remain closed and whether practical open-weight descendants in the 27B or 35B range might arrive. Reports about its LMArena position added fuel to the discussion.

There was skepticism too. One user said Qwen3.7 Plus Preview failed to accept the current year as 2026. That single report does not define the whole model, but it does illustrate a familiar gap between preview leaderboard impressions and live product behavior. In a long-running agent, small inconsistencies can become larger failures. If the model misunderstands dates, file state, execution results, or permission boundaries, a long task can quietly move in the wrong direction.

The other tension is open-source expectation. Qwen is strongly associated with open-weight models among developers. If Max and Plus previews represent the best capability but remain bound to Qwen Chat or Model Studio, the community will keep asking when the weights move downstream. Alibaba may have a rational split: monetize the top model through cloud services while widening the ecosystem with smaller open releases. For developers, that makes the choice between local control and API dependence more complex.

What China's full-stack AI push means for builders

This announcement is easy to misread if the only frame is whether Chinese AI models are catching up with US frontier models. Alibaba is positioning Qwen3.7-Max as competitive with GPT, Claude, and Gemini-class systems, but it is also putting chips and cloud servers at the center of the story. That means Chinese AI companies are tying model capability, cloud distribution, and hardware independence into one strategy.

For builders, the immediate effect is a broader decision matrix. Global teams already compare OpenAI, Anthropic, Google, xAI, Mistral, DeepSeek, Qwen, and other model families. The comparison used to focus mainly on quality, price, latency, context length, and API stability. Now more questions enter the picture. Does the provider offer an agent runtime? Can you inspect tool-call traces? Can long-running tasks pause and resume? Where is the data stored? Which chips and regions run the workload? What supply-chain risks come with that choice?

For Asian markets and enterprises operating in China, Alibaba Cloud's full-stack approach could have practical value. A model, cloud, chip roadmap, and enterprise sales channel inside the same company can make adoption simpler. For global developers, the same setup requires more due diligence around regional data rules, API availability, model policy, English and Korean quality, and ecosystem tooling.

Strong numbers are not the same as independent proof

Qwen3.7-Max arrives with many strong numbers: 35 hours, more than 1,000 tool calls, 128 accelerators, 144 GB of memory, 800 GB/s inter-chip bandwidth, a 25.6 Tbps switch, more than 560,000 Zhenwu chips shipped, and more than 400 external customers across 20 industries. They make the announcement memorable. But most of these numbers come from Alibaba's own announcement. The picture will become clearer only after independent benchmarks and real customer workload reports arrive.

Agent performance is also difficult to measure. A model that scores well on coding benchmarks does not always produce reliable pull requests in large real repositories. A 35-hour endurance claim depends on the task, tool set, retry policy, checkpointing, memory compaction, and the level of human intervention allowed. "More than 1,000 tool calls without performance degradation" also needs a definition of which performance metric is being held steady.

That does not make the numbers irrelevant. Instead, they show where future model companies are likely to point. Beyond single-answer quality and classic academic benchmarks, we should expect more claims about tool-call endurance, long-horizon task success, infrastructure efficiency, and rack-level bandwidth. As agents move toward the center of products, benchmarks will move closer to product operations.

The checkpoints to watch now

The first checkpoint is how Qwen3.7-Max actually ships. Alibaba says global developers will be able to access it through Alibaba Model Studio. Pricing, region availability, rate limits, context limits, tool-calling APIs, function schemas, streaming, logging, and safety policy will determine whether teams can really adopt it.

The second checkpoint is the open-weight roadmap. Even if Max remains a closed frontier model, developer reaction will depend on how quickly smaller Qwen3.7 models appear and how capable they are. Qwen's strength has been its presence across both local and cloud workflows.

The third checkpoint is independent validation of Panjiu AL128 and Zhenwu M890. The 144 GB memory figure, 800 GB/s inter-chip bandwidth, and FP4 support are interesting, but real inference throughput, power efficiency, software stack maturity, compiler quality, and cluster reliability will matter more than launch-stage specifications.

The fourth checkpoint is transparency around Agentic RL and safety governance. Long-running agents may touch customer code, documents, internal APIs, and privileged tools. The model's capability matters, but so do the answers to where data is stored, what feedback is used for improvement, and where the agent is forced to stop.

Conclusion: the real Qwen3.7 story sits outside the model

Qwen3.7-Max is clearly model news. It is Alibaba's latest flagship, aimed directly at coding, reasoning, and long-running agents. But the real meaning of this launch is outside the model. Alibaba placed Qwen3.7-Max on top of Model Studio, Agentic RL, Panjiu AL128, Zhenwu M890, and ICN Switch 1.0.

That is where the AI agent market is heading. Teams will still ask which model is best, but that question is no longer enough. They also need to know which platform can trace a long job to completion, which infrastructure can tolerate thousands of tool calls, which chip roadmap can reduce inference cost, and which cloud provider can explain data and permission boundaries.

Alibaba is trying to answer with a full-stack strategy: its own model, cloud, chip, and interconnect packaged for the agent era. The success of that strategy cannot be judged from an announcement alone. But Qwen3.7-Max leaves a clear message. Agent competition no longer ends inside the model card. If you want an AI system to work for 35 hours, the rack and the chip underneath it become part of the product.