Devlery
Blog/AI

StepFun Step 3.5 Flash reaches frontier-class scores with 11B active parameters

StepFun Step 3.5 Flash activates only 11B parameters inside a 196B MoE model, posts strong math and coding benchmarks, and ships as Apache 2.0 open weights.

StepFun Step 3.5 Flash reaches frontier-class scores with 11B active parameters
AI 요약
  • What happened: StepFun released Step 3.5 Flash, a 196B-parameter sparse MoE model that activates about 11B parameters per token.
    • The company reports 97.3 on AIME 2025, 74.4% on SWE-bench Verified, and 86.4 on LiveCodeBench-V6.
    • The weights are published under Apache 2.0 with Hugging Face, ModelScope, llama.cpp, vLLM, SGLang, and Transformers support.
  • Why builders care: StepFun positions the model near DeepSeek V3.2-level coding performance at roughly one-sixth of the reported inference cost.
  • Architecture: The model routes each token through 8 of 288 routed experts plus one shared expert, with a 256K-token context window.
  • Watch: Community reports praise local speed and instruction following, but also flag hallucinations, long reasoning traces, and occasional reasoning loops.

An open-weight LLM has entered the frontier-efficiency race with a striking claim: a 196B-parameter model can deliver frontier-class math and coding scores while activating only about 11B parameters per token. StepFun, the Shanghai AI lab founded in 2023 by former Microsoft employees, published Step 3.5 Flash on February 12, 2026. The company reports 97.3 on AIME 2025, 74.4% on SWE-bench Verified, and an inference cost around one-sixth of DeepSeek V3.2 for similar software-engineering performance.

The model is not just another dense checkpoint with a permissive license. It is an aggressive sparse Mixture-of-Experts system: 196B total parameters, 288 routed experts per layer, one shared expert, and only the top 8 routed experts selected for each token. StepFun released it under Apache 2.0, which gives commercial teams a path to test, fine-tune, and deploy it without the use restrictions attached to many frontier APIs.

For global AI builders, the question is practical. If the numbers hold across real workloads, Step 3.5 Flash gives teams another option between closed frontier APIs and smaller local models that cannot yet handle hard coding or math tasks. The tradeoff is that community testing already points to familiar reasoning-model failure modes: hallucinated facts, overly long thinking traces, and occasional loops that can erase the latency advantage.

StepFun joins China's open-weight LLM race

StepFun sits inside the group often described in China as the country's "AI Six Tigers." The company was founded in Shanghai in April 2023 and reached unicorn status in its first funding round. In January 2026, it raised a Series B+ round of 5 billion yuan, about $717 million, with investors including Tencent, Qiming Venture Partners, and Shanghai State-owned Capital Investment.

That funding context matters because China's open-weight LLM ecosystem had been defined mainly by two names. DeepSeek proved that sparse MoE models could compete globally on reasoning and coding while keeping inference costs down. Alibaba's Qwen line kept pressure on the dense and open model side, especially for developers who wanted strong multilingual and coding coverage. Step 3.5 Flash gives the ecosystem a third major contender with a different efficiency profile.

The contrast with DeepSeek is the cleanest way to understand StepFun's claim. DeepSeek V3.2 uses a much larger 671B-parameter MoE with about 37B active parameters. Step 3.5 Flash uses 196B total parameters and activates about 11B. In active-parameter terms, StepFun is trying to reach a similar performance band with roughly one-third of DeepSeek's active compute budget.

The sparse MoE shape

Step 3.5 Flash MoE architecture flow
196B total parameters
45 layers · 4,096 hidden dimension · 256K context
Per-layer router
Routes each input token across 288 routed experts plus 1 shared expert
1 shared expert
Always active for every token
Top-8 routed experts
Only 8 selected from 288 per token
11B active parameters
About 5.6% of total parameters · roughly one-third of DeepSeek's active count

Step 3.5 Flash has 45 layers, a 4,096 hidden dimension, and a 256K-token context window. Each layer includes 288 routed experts and one shared expert. The shared expert stays active for every token, while the router chooses 8 of the 288 routed experts token by token. That is the mechanism behind the 11B active-parameter figure.

The attention stack is also tuned for efficiency. StepFun mixes Sliding-Window Attention and Full Attention in a 3:1 ratio. Most layers use the cheaper sliding-window pattern for local context, while strategically placed full-attention layers preserve long-range dependency handling. The model also raises sliding-window query heads to 96 instead of the more common 64, increasing representational capacity inside the local attention window.

StepFun did not build the model as pure sparse MoE from top to bottom. The paper describes a hybrid layout that mixes sparse MoE blocks with dense layers at selected depths. That choice is meant to stabilize training and preserve representation quality while still getting most of the inference savings from sparse routing.

The benchmark profile

Selected benchmark comparison
ModelTotal / activeAIME 2025SWE-benchLiveCode-V6CostLicense
Step 3.5 Flash196B / 11B97.374.4%86.4$0.10 / 1MApache 2.0
DeepSeek V3.2671B / 37B~75%~$0.60 / 1MMIT
Qwen 3 72B72B / 72B~60%~$0.30 / 1MApache 2.0
Llama 4 Maverick400B / 17B~70%~$0.15 / 1MLlama 4
SWE-bench Verified basis. Cost figures are approximate API prices cited in the source article.

The math results are the first signal. StepFun reports 97.3 on AIME 2025, 98.4 on HMMT 2025 February, 94.0 on HMMT 2025 November, and 85.4 on IMOAnswerBench. The model also topped MathArena in the source article's snapshot. Those scores explain why StepFun frames the model as a reasoning system rather than a general lightweight chat model.

The coding numbers are the second signal. SWE-bench Verified at 74.4% puts Step 3.5 Flash close to the roughly 75% band cited for DeepSeek V3.2, while LiveCodeBench-V6 at 86.4 and Terminal-Bench 2.0 at 51.0% make it relevant for agentic software-engineering workflows. The comparison is not just score versus score. It is score per active parameter and score per dollar.

The agent benchmarks fit the same story. StepFun reports 88.2 on tau2-Bench, 51.6 on BrowseComp, 69.0 on BrowseComp when using a context manager, and 84.5 on GAIA without files. The model is designed for orchestration across more than 80 MCP tools and supports integrations with Claude Code and OpenClaw. That positioning matters because the practical market for a fast open-weight reasoning model is increasingly tool use, repository editing, browser work, and automation.

MTP-3 and MIS-PO

StepFun points to several training and inference choices behind the benchmark profile. The most visible one is MTP-3, a three-way Multi-Token Prediction setup. Instead of predicting only the next token, Step 3.5 Flash predicts four tokens in parallel. In the source article, this is tied to generation speeds of 100 to 300 tokens per second and peaks near 350 tokens per second in cloud API settings.

The second named method is MIS-PO, short for Metropolis Independence Sampling Filtered Policy Optimization. StepFun describes it as a reinforcement-learning framework that replaces conventional importance weighting with stricter sample filtering. The target problem is gradient instability in long reasoning sequences. That is directly relevant to math and coding because both tasks often require long chains of intermediate reasoning before the final answer or patch.

The architecture also includes head-wise gated attention, truncation-aware value bootstrapping, and routing confidence monitoring. These features are less marketable than "11B active parameters," but they are the pieces that decide whether a sparse MoE model behaves reliably under long context and multi-step tool use.

Cost and local deployment

Inference cost and generation speed comparison, Step 3.5 Flash = 1x
Step 3.5 Flash
$0.10 / 1M · 100-350 tok/s
1x
Llama 4 Maverick
~$0.15 / 1M · 80-200 tok/s
1.5x
Qwen 3 72B
~$0.30 / 1M · 30-80 tok/s
3x
DeepSeek V3.2
~$0.60 / 1M · 50-150 tok/s
6x
Approximate OpenRouter-style API pricing and cloud generation-speed ranges cited in the source article.

The source article cites OpenRouter availability at about $0.10 per 1M tokens, compared with roughly $0.60 per 1M tokens for DeepSeek V3.2. For agent workflows, that ratio matters more than it does for single chat prompts. A repository editing agent can burn through context by reading files, producing plans, calling tools, revising patches, and validating tests. A sixfold cost gap changes which jobs are worth automating.

Local deployment is plausible but not casual. The INT4 GGUF variant is cited at roughly 111.5GB, putting it within reach of machines such as a Mac Studio M4 Max with 192GB unified memory, NVIDIA DGX Spark, or AMD AI Max+ 395 systems. That is still expensive hardware. The practical local-use case is not "everyone runs this on a laptop"; it is teams with privacy, latency, or offline requirements getting a frontier-adjacent model without a server GPU cluster.

The runtime support list is broad: llama.cpp, vLLM, SGLang, and Transformers are all named. The model is available through Hugging Face and ModelScope, with INT4, INT8, FP8, and BF16 formats. The article also cites 96,535 monthly Hugging Face downloads at the time of writing, which suggests developer curiosity beyond a launch-day spike.

Community reports are mixed

The Hacker News response described in the Korean article splits into two camps. Positive reports focus on local performance and instruction following. One user reported 36 tokens per second generation and 300 tokens per second prompt processing on an M1 Ultra. Another said the model's instruction following and output quality beat most leading models they had tested, including comparisons with Opus 4.5 on word problems and reasoning tests.

The negative reports are just as important for deployment decisions. Several users reported hallucinations on fact-heavy questions. One example involved Pokemon championship deck information, where the model reportedly invented details. That failure mode is common in reasoning-optimized models: they can perform well when the answer can be derived from the prompt, but fail when the task requires reliable stored knowledge.

Long reasoning traces are another reported issue. A simple HTML coding prompt can trigger a large amount of thinking text, so high tokens-per-second does not necessarily mean low wall-clock latency. One user summarized the operational problem clearly: excessive reasoning can offset the speed advantage.

There are also reports of infinite or near-infinite reasoning loops. For an API user, that is a cost and latency risk. For an agent connected to tools, it can become a reliability risk because the model may never reach the tool call or final answer stage. Any production trial should set explicit token budgets, reasoning limits, and timeout behavior before giving the model write access to files, tickets, or external systems.

What this says about MoE efficiency

196B → 11B
About 5.6% active parameters
96,535
Monthly Hugging Face downloads cited in the article
Apache 2.0
Permissive commercial use

Step 3.5 Flash reinforces a clear design trend: frontier-adjacent performance is moving from "how many total parameters can we train?" toward "how few parameters can we activate per token?" DeepSeek V3 showed the practical value of a 671B/37B MoE. Meta's Llama 4 Maverick is cited at 400B/17B. Nvidia's Nemotron 3 Super is cited at 120B/12B. StepFun now pushes the active count to 11B while still claiming competitive math and coding scores.

That trend changes the product surface for open models. Dense models still matter, especially for simpler deployment and predictable behavior. But sparse MoE systems increasingly define the cost-performance frontier for high-end open weights. They let a model store broad capacity across many experts while paying for a much smaller subset during inference.

StepFun's release also adds diversity to China's open-weight ecosystem. DeepSeek and Qwen already gave global developers permissive alternatives to closed U.S. frontier APIs. Step 3.5 Flash adds another Apache 2.0 model with explicit agent and tool-use positioning. That matters for teams that want to avoid single-vendor dependence or run evaluations across multiple open-weight providers.

The caution is that benchmark strength is not the same as operational maturity. The community reports on hallucination, verbose reasoning, and loops are not edge details; they are exactly the failure modes that can make an agent expensive or unsafe. Step 3.5 Flash should be evaluated with task-specific prompts, retrieval grounding for factual work, strict output budgets, and sandboxed tool access.

The technical signal is still meaningful. An 11B-active sparse MoE model reaching this benchmark band weakens the old assumption that higher total parameter count is the main proxy for capability. The more useful question for builders is now narrower: how well does the router spend active compute on the task in front of it, and how reliably does the model stop when it has done enough?


Sources