Xiaomi breaks 1,000 tokens per second on trillion-parameter AI model

Xiaomi's MiMo-V2.5-Pro-UltraSpeed hits over 1,000 tokens per second on standard GPUs — 15 times faster than GPT-5.5 — using software alone.

Xiaomi's MiMo-V2.5-Pro-UltraSpeed hits over 1,000 tokens per second on a single 8-GPU commodity node, 15 times faster than GPT-5.5, using no custom silicon — a milestone that reshapes assumptions about inference cost and accessibility.

"Extreme model-system codesign is what makes this possible," the company said in its announcement. Per Artificial Analysis, GPT-5.5 runs at 68 tokens per second and Claude Opus 4.6 at 71, while MiMo-V2.5-Pro matches Opus on coding benchmarks.

The speed comes from two coordinated techniques. FP4 quantization shrinks the model's expert layers — most of its 1 trillion parameters — to 4-bit precision, cutting memory footprint while keeping quality loss near zero. DFlash speculative decoding fills a full block of masked positions in a single forward pass, with the model accepting 6.3 out of 8 proposed tokens per verification round in coding tasks. TileRT, the inference engine, keeps the entire pipeline resident inside the GPU, eliminating per-operator launch overhead.

Cerebras hit 969 tokens per second on Meta's Llama 3.1 405B — a model less than half the size — using a wafer-scale chip the size of a dinner plate. Groq's custom LPU architecture tops out at 300 to 750 tokens per second. Neither runs on hardware available from standard cloud providers. Xiaomi's approach does, and at 3 times the standard MiMo rate for roughly 10 times the generation speed. The API trial runs June 9 through June 23.

The achievement matters beyond the raw number. At 1,000 tokens per second, applications with hard latency constraints — fraud detection, real-time trading signals, parallel reasoning chains, live agent loops — become viable where 68 tokens per second could not meet them. MiMo-V2.5-Pro already matched Claude Opus on most coding benchmarks at a fraction of the cost: roughly $0.43 input and $0.87 output per million tokens, compared with Opus at $5 and $25, respectively.

The technical approach is notable for what it does not require. Cerebras designed a wafer-scale chip packing 44GB of on-chip memory to eliminate the bandwidth bottleneck that slows GPU inference. Groq built a custom Language Processing Unit. Xiaomi used commodity GPUs — the same hardware available on AWS — and solved the problem through model-level optimization and a purpose-built inference engine.

FP4 quantization is surgical: only the expert layers are compressed, while everything else stays at full precision. DFlash skips the sequential drafting step used in standard speculative decoding, proposing an entire block of tokens at once. TileRT ties them together by keeping the compute pipeline continuously resident, removing execution gaps that normally slow generation.

Xiaomi (01810.HK) has been building AI capability largely outside the industry spotlight. MiMo-V2.5-Pro launched in April matching frontier models on benchmarks at a fraction of their cost. UltraSpeed accelerates that same model — not a stripped-down version — and the FP4-DFlash checkpoint is already open-sourced on Hugging Face for community testing.

If independent benchmarks confirm the speed claims, Xiaomi has achieved what required hundreds of millions in custom silicon investment from Cerebras and Groq, using software on standard hardware. That changes the calculus for which companies can deploy trillion-parameter models in production — and at what cost.

This article is for informational purposes only and does not constitute investment advice.