
Abstract
- 1,000+ tokens per second confirmed: Peak demos reached 1,200 tokens per second on the 1-trillion-parameter MiMo-V2.5-Professional mannequin — a primary at this scale with out customized silicon.
- Three-layer engineering: FP4 quantization on knowledgeable layers, DFlash speculative decoding, and TileRT’s persistent-core GPU runtime mix to get rid of latency at each stage.
- 10x sooner, 3x the worth: UltraSpeed API prices 3 times the usual MiMo-V2.5-Professional fee — however delivers roughly ten instances the output velocity.
- Restricted trial June 9–23: Utility-based entry, enterprise and developer precedence, two-week free Chat included with approval.
- Open-source checkpoint launched: Xiaomi revealed the MiMo-V2.5-Professional-FP4-DFlash checkpoint on Hugging Face; TileRT open-sourced choose modules on GitHub.
Why 1,000 Tokens Per Second Truly Issues
To know why that is attention-grabbing, you want a reference level. Claude Opus 4.6 lands round 71 tokens per second with the decrease finish mannequin, Haiku, touching 98 tokens per second — and Gemini Flash hits 192 tokens per second. MiMo-V2.5-Professional in UltraSpeed mode runs at over 1,000. That is not a marginal enchancment. It is a totally different class fully.
At that velocity, use circumstances that had been beforehand off the desk turn out to be viable. Actual-time fraud detection, reside buying and selling indicators, parallel reasoning chains, and multi-agent loops all have arduous latency ceilings that normal inference speeds cannot meet. At 1,000 tokens per second, they’ll. Functions that had been beforehand unattainable turn out to be viable — fraud detection, real-time buying and selling indicators, parallel reasoning chains, and reside agent loops all have arduous latency necessities that 68 tokens per second can’t meet.
How They Did It
The speedup comes from three layers: FP4 quantization, DFlash speculative decoding, and the TileRT runtime. FP4 is utilized solely to MoE Specialists, with QAT holding functionality primarily on par. DFlash predicts a complete masked block per ahead cross, hitting 6.30 common acceptance size in coding duties. The TileRT runtime then restructures GPU execution with persistent cores and heterogeneous pipelines, eliminating the delay from operator switching and holding {hardware} working at full capability all through.
The entire thing runs on a single normal 8-GPU node. No customized chips. No specialised {hardware}. That is the half that issues most — it means the barrier to deploying ultra-fast trillion-parameter inference drops considerably for any group with normal GPU infrastructure.
The Trial and the Caveats
Entry is gated. The trial window runs June 9 to June 23, purposes solely, with enterprises {and professional} builders prioritized. Permitted customers get a two-week free Chat expertise with utilization guardrails: 10 queue entries per account day by day, 30-minute session caps, and automated launch after 5 minutes idle. The Token Plan will not be supported — API trial entry solely.
Impartial third-party velocity verification is not public but. Xiaomi’s personal numbers are the first supply. The open-source checkpoint on Hugging Face offers the neighborhood a path to confirm the claims independently.