Xiaomi MiMo-V2.5 cuts inference cost 99% with KVCache breakthrough

Xiaomi's MiMo-V2.5 series models achieve a 99 percent API price reduction by compressing KVCache storage to roughly one-seventh of comparable solutions, the company said, challenging the narrative that Chinese AI pricing is driven by loss-leading tactics.

"The inference efficiency of the MiMo-V2.5 series does not stem from a single breakthrough but from multidimensional, coordinated optimizations across the entire stack," Luo Fuli, head of MiMo, said in a technical blog post. "Only then did Hybrid SWA fully realize its architectural advantages in long-context inference."

The optimization restructures the entire inference stack — from KVCache management and hierarchical caching to scheduling strategies and the prefill-decode pipeline — around a hybrid Sliding Window Attention plus Mixture-of-Experts and multimodal architecture. KVCache storage now occupies one-seventh the memory of full-attention alternatives, sharply reducing inference costs in long-sequence scenarios. The system achieves a 93 percent to 95 percent server cache hit rate, meaning the vast majority of repeated-read requests require near-zero GPU computation.

The cost breakthrough positions Xiaomi to compete directly with DeepSeek, Zhipu, ByteDance's Doubao, and Alibaba's Tongyi in China's crowded large-model market — without the margin erosion that has characterized the sector's two-year price war. Xiaomi shares traded 2.5 percent higher at the time of the announcement, with a short-selling ratio of 31 percent, signaling active institutional hedging around the stock.

Six engineering pillars, one cost chain

The 99 percent discount applies specifically to the Input (Cache Hit) pricing tier — the portion tied to users re-reading historical context in long conversations. Luo Fuli's technical blog detailed six interconnected optimizations that make the discount sustainable.

First, the model architecture uses Sliding Window Attention across 60 of its 70 layers, with each of those layers attending only to the most recent 128 tokens. Only 10 layers act as full-context "archivists," reducing KVCache size to one-seventh of a full-attention model. Second, the team split KVCache into two independent memory pools — a large pool for the 10 full-attention layers and a small pool for the 60 SWA layers — allowing a single GPU to serve five times as many concurrent users.

Third, the prefix caching system was upgraded with a "window security length" rule that prevents cache mismatches in SWA mode, pushing real-world hit rates above 93 percent. Fourth, Xiaomi's storage team built a distributed cache called GCache deployed directly on the SSDs inside GPU machines, eliminating the need for a separate storage cluster and its associated monthly costs.

Fifth, a custom scheduling system called LLM-Router performs affinity scheduling, length-based bucketing, and TTFT optimization — routing requests with the same prefix to the same server, separating short and long requests into different channels, and prioritizing cache-heavy requests in the inference queue. Tests showed a 25 percent increase in L2 cache hit rate and a 30 percent reduction in P90 latency for long requests.

Sixth, the model natively supports three-layer Multi-Token Prediction, predicting the next three tokens at once and skipping intermediate computation when predictions are correct. In agentic scenarios, this delivered 2.3x acceleration for the first 128 tokens and 1.5x for tokens 128 to 256.

Developer ecosystem and competitive stakes

MiMo has launched a 100-trillion-Token Creator Incentive Program that has attracted more than 540,000 applicants, with a cumulative distribution of 100 trillion free tokens valued at more than 65 million yuan. The program aims to deepen developer adoption of the MiMo platform, creating a moat around the model's user base.

The cost structure matters beyond Xiaomi's own P&L. DeepSeek has dragged the entire Chinese AI industry's pricing benchmark to rock-bottom levels, forcing every competitor to either match or justify premiums. Xiaomi's approach — engineering-driven cost reduction rather than subsidy — suggests the company can sustain lower prices where rivals may be burning cash. The company recently disclosed that its profits halved this year while it pours 60 billion yuan into AI investment, making the break-even claim on the price cut a critical signal for investors tracking Xiaomi's capital allocation.

For investors, the question is whether Xiaomi can convert its inference cost advantage into developer market share before competitors replicate the architecture. DeepSeek, Alibaba's Tongyi, and ByteDance's Doubao all have comparable engineering resources and may respond with their own KVCache optimizations. Xiaomi shares trade with a short-selling ratio above 30 percent, suggesting the market remains divided on whether the company's AI bet will pay off against more established rivals.

This article is for informational purposes only and does not constitute investment advice.