Devlery
Blog/AI

Qwen3.6-Plus beats Claude on Terminal-Bench, but closes the flagship model

Alibaba released Qwen3.6-Plus for agentic coding with a 1M-token context window, free preview access, and a closed-source strategy that changes how builders should read Qwen.

Qwen3.6-Plus beats Claude on Terminal-Bench, but closes the flagship model
AI 요약
  • What happened: Alibaba released Qwen3.6-Plus on April 2, 2026 as an agentic-coding model with a 1M-token context window.
    • Alibaba reported 61.6 on Terminal-Bench 2.0, ahead of Claude Opus 4.5 at 59.3, and 78.8 on SWE-bench Verified.
  • What changed: Qwen3.6-Plus is Alibaba's third consecutive closed-source model release after Qwen became known for open releases.
  • Builder impact: The preview is free through OpenRouter and Qwen Code, making it attractive as a low-cost sub-agent in multi-agent coding systems.
  • Watch: Independent tests cited by the Korean source reported a 43.3% security score and 26.5% code-reasoning hallucination rate.

Alibaba released Qwen3.6-Plus on April 2. The model is positioned around "Towards Real World Agents" rather than ordinary chat completion: 1M tokens of context, up to 65,536 output tokens, always-on chain-of-thought reasoning, native function calling, and multimodal input for documents, images, and video. Alibaba reported 61.6 on Terminal-Bench 2.0, ahead of Claude Opus 4.5 at 59.3, and 78.8 on SWE-bench Verified, only 2.1 points behind Claude Opus 4.5.

The benchmark line is not the only story. Qwen3.6-Plus is Alibaba's third consecutive closed-source model. Qwen built much of its developer credibility through open releases and more than 600 million downloads, so the flagship model arriving behind an API changes how teams should evaluate the ecosystem. The question for builders is no longer only "how good is Qwen?" It is also "what parts of Qwen can we depend on if the best models are no longer open?"

Qwen's leadership change and strategy shift

The release sits behind a turbulent month for the Qwen team.

On March 4, 2026, Qwen technical lead Junyang "Justin" Lin announced his departure. He was one of the central figures who helped move Qwen from an internal experiment into a global model family. Two colleagues left with him. Bloomberg described the change as part of a broader reorganization as Alibaba put more emphasis on AI revenue.

Digitimes read the move as a sign that Alibaba was reconsidering open source under growing commercial pressure. Qwen3.6-Plus followed a closed image-generation platform and a closed multimodal model, making it the third closed release in three days. Before that run, Alibaba had used open Qwen models to build a developer ecosystem large enough that Airbnb reportedly chose Qwen instead of ChatGPT for some work.

The change therefore looks less like a narrow license decision and more like a monetization signal. Chinese AI companies are under pressure to turn model progress into cloud revenue, enterprise contracts, and sticky platform usage. In that context, the best Qwen model becoming API-first is a business move as much as a technical launch.

The Chinese coding-agent market is also getting crowded. ByteDance has Trae, Baidu has Comate, Tencent has CodeBuddy and QClaw, and Zhipu has CodeGeex. Alibaba's Qwen Code and Qoder put Qwen3.6-Plus into that same competition, with agentic coding as the differentiator.

A model built for agentic coding

SpecDetail
ArchitectureHybrid Gated DeltaNet plus sparse MoE
Context window1,000,000 tokens
Maximum output65,536 tokens, with up to 80K used in evaluation
Reasoning modeAlways-on chain-of-thought
Function callingNative, without prompt-hacking workarounds
New APIpreserve_thinking, which keeps reasoning context across turns
Multimodal inputText plus vision for documents, images, and video
API compatibilityOpenAI chat completions plus Anthropic API protocol
RegionsBeijing, Singapore, and U.S. Virginia

Alibaba says Qwen3.6-Plus uses a hybrid Gated DeltaNet architecture with sparse Mixture-of-Experts routing. The pitch is efficiency: combine linear-attention techniques with sparse expert activation, reduce inference energy use, keep stability, and reach conclusions faster than the Qwen 3.5 generation. Alibaba has not disclosed the parameter count.

The agentic feature set is more important than the parameter count for coding workflows:

  • 1M-token context window: large codebases can be loaded into a single task context.
  • 65,536 output tokens: Terminal-Bench evaluation used up to 80K.
  • Always-on chain-of-thought: reasoning is the default behavior, not an optional mode.
  • Native function calling: tool use is part of the model interface instead of a prompt convention.
  • preserve_thinking API: reasoning content can carry across turns, which is meant to reduce goal drift in long agent tasks.
  • Multimodal processing: the model handles text and vision together, including document parsing, real-world visual analysis, and long-video reasoning.

The API supports OpenAI-compatible chat completions and the Anthropic API protocol. Alibaba says compatibility has already been confirmed with Claude Code, Cline, OpenClaw, Kilo Code, OpenCode, and other coding-agent tools. That matters because model switching becomes a configuration problem rather than a full integration project.

Alibaba is framing Qwen3.6-Plus as the first Qwen model where agentic behavior is a core built-in capability. The official description focuses on a perceive-reason-act loop inside one workflow. For developers, that means Alibaba is trying to compete on sustained coding-agent execution, not only static answer quality.

What the benchmarks say, and what they do not

Benchmark comparison
Terminal-Bench 2.0, higher is better
Qwen3.6-Plus
61.6
No. 1
Claude Opus 4.5
59.3
GLM-5
56.2
Kimi K2.5
50.8
SWE-bench Verified, higher is better
Claude Opus 4.5
80.9
No. 1
Qwen3.6-Plus
78.8
SWE-bench Pro, higher is better
Claude Opus 4.5
57.1
No. 1
Qwen3.6-Plus
56.6

Official Qwen3.6-Plus benchmark results comparing language, coding, math, and multimodal scores

Coding-agent benchmarks

On Terminal-Bench 2.0, Qwen3.6-Plus scored 61.6, 2.3 points ahead of Claude Opus 4.5 at 59.3. The evaluation setup used a three-hour timeout, 32 CPUs, 48GB RAM, 256K context, and five-run averaging. The gap was also visible against GLM-5 at 56.2 and Kimi K2.5 at 50.8.

SWE-bench Verified tells a tighter story. Qwen3.6-Plus scored 78.8, behind Claude Opus 4.5 at 80.9 but closer than previous Qwen models. On SWE-bench Pro, Qwen3.6-Plus reached 56.6 against Claude Opus 4.5 at 57.1. On SWE-bench Multilingual, Qwen3.6-Plus scored 73.8, while Gemini 3 Pro led that benchmark at 77.5.

Multimodal and general performance

The model also posted strong non-coding numbers in the Korean source. Qwen3.6-Plus scored 91.2 on OmniDocBench v1.5, ahead of Claude Opus 4.5 at 87.7, and 85.4 on RealWorldQA, ahead of Claude Opus 4.5 at 77.0. MMMU at 86.0 and Video-MME at 87.8 were slightly behind Gemini 3 Pro.

Third-party evaluations expose weaker spots

BridgeBench and other third-party checks produced a more mixed picture. Throughput was reported at 158 tokens per second, about 1.7 times Claude Opus 4.6's 93.5 tokens per second. The free tier's time to first token, however, was 11,520 milliseconds. That delay is a real constraint for interactive coding loops even when post-start throughput is high.

The more serious concern is security. The Korean source cites a 43.3% security benchmark score, roughly half the level reported for GPT-5.4 Mini at 87.3% and Claude Sonnet 4.5 at 87.2%. It also cites a 26.5% code-reasoning hallucination rate. In agentic coding, security and correctness are not secondary to benchmark rank. A model that writes, edits, and runs code needs guardrails, review, and sandboxing before its output reaches production.

The comparison target is disputed

Alibaba compared Qwen3.6-Plus with Claude Opus 4.5. The Korean source notes that Claude Opus 4.6 was already the newer version at the announcement time. Hacker News commenters criticized the 4.5 comparison as misleading.

That does not erase the Terminal-Bench result, but it narrows what the result proves. Qwen3.6-Plus looks competitive against the comparison set Alibaba chose. Its exact position against the newest frontier models depends on independent evaluations using current model versions.

What it means for developers

ModelAccess pathPriceNote
Qwen3.6-PlusOpenRouterFreePreview period only
Qwen3.6-PlusQwen CodeFree1,000 free calls per day
Claude Opus 4.5Anthropic APIPaid$15 input and $75 output per million tokens
GPT-4oOpenAI APIPaid$2.5 input and $10 output per million tokens

The force of a free preview API

The immediate developer impact is price. Qwen3.6-Plus is available for free on OpenRouter during the preview under the model ID qwen/qwen3.6-plus-preview:free. Qwen Code also provides 1,000 free calls per day. The Korean source says the model handled about 400,000 requests and more than 400 million tokens within two days of release.

That pricing affects agent architecture. A team can reserve an expensive frontier model for the lead agent and use Qwen3.6-Plus for sub-agents that inspect files, draft tests, summarize traces, or explore alternatives. Hacker News discussion around the release focused on exactly that pattern: a model can be below the very top tier and still be useful if its marginal cost is close to zero.

Practical deployment caveats

The free tier should not hide operational risk.

First, a 43.3% security benchmark score means generated code should not be merged without review. That applies even more strongly when the model is allowed to call tools or edit a repository.

Second, a 26.5% code-reasoning hallucination rate can turn complex debugging into extra work. The Korean source cites user reports that the model sometimes ignores instructions and hallucinates more than Sonnet. Those reports do not replace systematic evaluation, but they are consistent with the idea that Qwen3.6-Plus should start as an assisted or sub-agent model, not an unchecked production committer.

Third, the free tier's 11.5-second time to first token can make quick interactive sessions feel slow. Throughput after generation starts may be fast, but a repeated 11-second wait changes the experience of tight edit-test loops.

Compatibility lowers the experiment cost

The positive side is integration. OpenAI and Anthropic API compatibility means teams can test Qwen3.6-Plus in existing agent stacks without rewriting the whole toolchain. Claude Code, Cline, OpenClaw, Kilo Code, and OpenCode support is enough for many builders to run a realistic trial.

Alibaba is also integrating Qwen3.6-Plus into its Wukong platform for enterprise task automation with multiple AI agents. That positions the release as part of a platform strategy, not only a standalone model endpoint.

Community reaction: useful model, contested framing

Hacker News discussion around the release reached about 70 points and more than 30 comments when the Korean source was written.

+Positive reactions
  • Lower cost with usable quality
  • 1M context for larger codebases
  • Potential as a sub-agent under SOTA models
  • More consistent tool calling than Qwen 3.5
-Critical reactions
  • Comparing against Opus 4.5 when 4.6 existed looked misleading
  • Some users saw 15-30 tokens per second in practice
  • Reports of instruction misses and more hallucination than Sonnet
  • The closed-source shift raised criticism of open source as acquisition marketing

The positive case is straightforward: the model is close enough to frontier coding models for many sub-tasks, and the preview price is hard to ignore. Users discussed cost-optimized multi-agent systems, YaRN-based context extension, fewer retries in multi-step agent tasks compared with Qwen 3.5, and improved tool-calling consistency.

The negative case focused on framing and reliability. The comparison against Claude Opus 4.5 drew complaints because Opus 4.6 was already available. Users also reported frequent tool-calling errors and real-world throughput of 15-30 tokens per second on the free tier, far below the official 158 tokens per second throughput figure.

The closed-source shift produced the bluntest criticism. One reaction in the Korean source framed the smaller free models as advertising rather than generosity. That criticism cuts at Qwen's previous trust base: if the open ecosystem was a path to developer adoption, teams now have to ask which future models will remain open and which will move behind Alibaba's commercial interfaces.

Agentic coding becomes a multi-dimensional competition

Qwen3.6-Plus points to four changes in the coding-agent market.

First, agentic coding is becoming a default model requirement. Alibaba is not offering tool use as an afterthought. The model is packaged around long context, persistent reasoning, function calling, and multimodal task execution. Claude Code, Cursor, and Copilot are moving in the same direction, and a free Qwen preview makes the race more aggressive.

Second, the post-preview price will decide much of the market impact. Free access can move developers into the ecosystem, but the long-term effect depends on the production pricing Alibaba chooses after the preview. A free-to-paid conversion is normal cloud strategy; the question is whether the final price still makes Qwen3.6-Plus attractive as a high-volume coding-agent model.

Third, open-source Qwen now carries uncertainty. A leadership departure plus three consecutive closed releases is a clear signal that flagship openness is no longer automatic. Projects and companies depending on open Qwen releases do not need to abandon them immediately, but they should plan for a world where the strongest future Qwen models arrive as APIs first.

Fourth, security and hallucination are becoming central agentic-coding metrics. The 43.3% security score and 26.5% hallucination rate in the Korean source are not small footnotes. When a model can autonomously edit and execute code, a security miss or confident false diagnosis becomes operational risk. Production adoption will depend on model quality, but also on sandboxing, code review, policy enforcement, and evaluation loops.

The coding-agent market is no longer a single leaderboard. Price, security, ecosystem openness, API compatibility, latency, and real-world reliability all matter. Qwen3.6-Plus captures that complexity in one release: a Terminal-Bench win, a closed-source controversy, a free API preview, and security numbers that demand caution.

For builders, the practical reading is narrow. Qwen3.6-Plus is worth testing as a cost-efficient agent or sub-agent, especially where OpenAI or Anthropic API compatibility already exists. It is not a model to trust blindly with production code. Its launch shows that agentic coding is now a platform competition, and Qwen's platform is moving closer to Alibaba's commercial cloud strategy.