A 75% Discount Becomes the Baseline for DeepSeek V4-Pro
DeepSeek is turning V4-Pro API discount pricing into the new baseline, forcing agent builders to recalculate inference cost and routing strategy.
- What happened: DeepSeek is making the 75% discount price for the
V4-ProAPI the new baseline.- The official pricing page says prices will be adjusted to
1/4of the original level after May 31, 2026 at 15:59 UTC.
- The official pricing page says prices will be adjusted to
- Key numbers: The new prices are
$0.435/Minput cache miss tokens and$0.87/Moutput tokens. - Why it matters: Long-context and repeated-call agent workloads get another lower-cost route for first passes, retries, and bulk analysis.
- Watch: Real cost still depends on cache-hit rate, output ratio, provider markup, quality, retry behavior, and data-policy constraints.
DeepSeek has moved the model pricing conversation again. The DeepSeek API pricing page says the 75% discount promotion for deepseek-v4-pro will end on May 31, 2026 at 15:59 UTC, but the model API price will then be officially adjusted to one quarter of the original price. In other words, the promotion is not simply expiring. The discounted price is becoming the new baseline.
The numbers are small on a page but large inside an agent product. V4-Pro's cache-miss input price is $0.435 per million tokens, and output is $0.87 per million tokens. Cache-hit input drops as low as $0.003625 per million tokens. DeepSeek also says the input cache-hit price for all models was lowered to one tenth of the launch price starting April 26, 2026 at 12:15 UTC.
This is not just another "Chinese model got cheaper" story. For teams building agents, it changes the cost spreadsheet. Agentic workflows read long context, call tools repeatedly, retry after failure, and emit a lot of intermediate output: plans, summaries, patches, test interpretations, log analysis, and explanations of why a previous attempt failed. Token price becomes a product-design constraint. DeepSeek's cut asks model routers and agent runtimes to recalculate which work should go to which model.
This is a baseline change, not a discount ending
Discounts are usually launch mechanics. A vendor lowers the price, attracts usage, and eventually returns to the original rate. The sentence on DeepSeek's page points the other way. After the 75% discount promotion ends, V4-Pro API pricing is formally adjusted to one quarter of its original price.
The strikethrough prices make the change clear. V4-Pro's cache-miss input price was originally displayed at $1.74 per million tokens, while output was $3.48 per million tokens. The current discounted prices are $0.435 and $0.87. The stated one-quarter adjustment makes those discounted numbers the normal price.
| Item | Previous displayed price | New baseline price |
|---|---|---|
| Input cache hit | $0.0145 / 1M tokens | $0.003625 / 1M tokens |
| Input cache miss | $1.74 / 1M tokens | $0.435 / 1M tokens |
| Output | $3.48 / 1M tokens | $0.87 / 1M tokens |
The most visible number is the $0.87 output price. Agents use more output than short-answer chatbots. They write plans, intermediate notes, code changes, test summaries, and recovery paths. When output becomes cheaper, the pressure to make every reasoning step terse becomes slightly less severe.
The cache-hit input price may matter just as much. Agents often reread the same system prompt, tool descriptions, repository summaries, policy documents, and operating instructions. If the cache works well, repeated long inputs become far cheaper. If every request is shaped differently and misses the cache, the lowest number on the pricing page will not describe the bill.
V4-Pro is not only a cheaper model
DeepSeek's pricing page lists V4-Pro with a 1M context length, up to 384K output, JSON output, tool calls, chat prefix completion beta, and FIM completion beta. It supports both thinking and non-thinking modes. That feature list is aimed directly at agent workloads.
A 1M context window can hold a large slice of a codebase, a bundle of documents, a conversation history, or long logs. Tool calls connect the model to search, code execution, database lookups, and business systems. FIM matters for code editing and completion. Thinking mode connects to product experiences that allow longer reasoning before an answer or action.
The meaning of that feature list changes when the price drops. At a premium price, 1M context can be a capability teams use carefully. At a lower price, it becomes a route a model router may try earlier. This does not mean every task should move to V4-Pro. It means builders have one more candidate for long-input and repetitive work where token volume used to be the blocker.
The router question changes
AI product teams are already past the single-model era. They mix fast models, strong models, long-context models, coding models, and sometimes local models. The usual routing rule has been quality first: send difficult work to expensive frontier models, and send classification, extraction, or lightweight summarization to cheaper models.
DeepSeek's price cut makes the cost gradient steeper. Repository-wide code understanding, bulk log summarization, RAG candidate compression, long research drafts, and synthetic-data cleanup are all workloads that can be expensive if they go straight to a premium model. If a low-cost long-context model is reliable enough, a product can use it for the first pass and repeated passes, then reserve the expensive model for final review or hard cases.
The key measurement is not average token price. It is the cost of failure. If a cheaper model needs two retries to reach the same quality, much of the discount disappears. If it misuses a tool and burns external API calls, or produces a patch that takes a human longer to repair, token price alone tells the wrong story. Agent economics has to combine model price with task success rate.
That is why routing needs traces, not just benchmark headlines. A team should know which steps are mostly reading, which steps generate long output, which steps call tools, which steps fail, and which steps are expensive when they fail. V4-Pro's pricing makes some routing experiments newly plausible, but the winner is still decided by full task cost and reliability.
Community attention moves to providers and routing
The early community reaction across r/LLMDevs, r/DeepSeek, r/opencodeCLI, and r/GitHubCopilot is practical. People are asking whether they should use DeepSeek's official API directly, route through aggregators such as OpenRouter, or plug V4-Pro into OpenCode, Claude Code-style workflows, and other coding-agent loops.
That is not just a procurement question. Aggregators are convenient, but provider markup and routing behavior can change the effective price. Some users noticed that not every provider reflected the lower price immediately. Others wanted allowlists or blocklists so an agent could be forced onto an official provider or kept away from a provider they did not trust.
The difference matters for agent products. A one-off chatbot call may not care about a few cents of markup. Evaluation loops, codebase analysis jobs, synthetic-data generation, and long-running research agents can spend hundreds of millions of tokens. A small provider spread becomes a monthly cost issue. Model routing is therefore not only a quality decision. It also includes the actual billing path.
Cheap does not automatically mean standard
Price is powerful, but it does not settle model choice. First, quality distribution matters. Coding, math, general reasoning, tool use, Korean and English document handling, and long-context retention are different capabilities. A model that looks strong on one benchmark may be uneven inside a real agent loop.
Second, operations matter. Concurrency limits, rate limits, incident response, regional availability, account restrictions, and abuse policies all affect production systems. DeepSeek's page lists V4-Pro with a concurrency limit of 500. That may be plenty for individual experiments, but high-volume services still need queuing, fallback, and load-shedding.
Third, data and regulation matter. DeepSeek is a Chinese AI company. Some companies and public-sector organizations may restrict adoption because of data location, legal jurisdiction, security review, or supply-chain risk. A low price does not mean every workload can be sent there.
Fourth, pricing durability matters. The current document says the price will be adjusted to the lower baseline, but the page also tells users to check the latest pricing page because product prices may change. Long-term contracts and product-level cost planning have to include that caveat.
The real condition for savings is caching
The standout number on DeepSeek's table is cache-hit input: $0.003625 per million tokens. But that number only matters when the cache actually hits. If an agent app rebuilds the system prompt, tool schema, rules document, and repository summary differently on every request, cache-hit rate will stay low.
The practical work is therefore architectural. Teams need to separate stable prompt regions, make repeated context cacheable, place changing user input and tool results later in the request, and split long context into reusable summaries and retrieval layers instead of appending everything blindly. Token caching is not just a price-table feature. It is a prompt and context-design problem.
This gives developers a slightly strange message. The model is cheaper, so they can run more experiments. At the same time, to make it truly cheap, they need stricter prompt structure and better context hygiene. Agent cost optimization is a system problem across model selection, prompt caching, retrieval, output budget, retry policy, and provider routing.
In practice, teams should start from real agent traces rather than a spreadsheet alone. Separate successful and failed tasks. Record cache-hit input, cache-miss input, output, tool calls, retries, and human repair time by step. Then the question becomes more concrete: which part of the workflow got cheaper, and which part stayed expensive? A team testing V4-Pro should probably start with high-token, low-failure-cost stages such as long-draft generation, document triage, or bulk candidate summarization, not a full replacement of its hardest premium-model path.
Token-price compression changes the pace of agent experiments
DeepSeek V4-Pro's 75% discount becoming the standard price is an economic signal even before the model-quality debate is settled. Agents consume many tokens. When token prices move, possible product experiences move with them. Longer analysis, larger candidate sets, more frequent eval runs, wider repository scans, and longer research loops become easier to test.
That does not justify a simple conclusion that the cheapest model wins. In an agent product, final cost is token price multiplied by success rate, retry rate, tool-call cost, human review time, and data risk. DeepSeek has lowered the most visible axis, not the whole equation.
But that axis moved enough to matter. If a 75% discount becomes the baseline, expensive frontier models may shift from the default route to a premium path reserved for genuinely difficult stages. The router's question becomes sharper: does this task really need the strongest model first, or is a $0.87-per-million output pass good enough to try before escalating?