The numbers coming out of Xiaomi’s AI lab on June 8, 2026 are the kind that make inference engineers stop and reread: a 1.02-trillion-parameter Mixture-of-Experts model, running on a single standard 8-GPU commodity node, delivering sustained throughput above 1,000 tokens per second — with generation peaks pushing closer to 1,200 tokens per second. For context, that is the kind of speed previously associated only with purpose-built custom silicon like Groq’s LPU or Cerebras’s wafer-scale chips, both of which require hardware investments that put them out of reach for most organizations.
The model in question is MiMo-V2.5-Pro-UltraSpeed, built jointly by Xiaomi’s MiMo team and TileRT, a GPU inference systems group. What makes the result notable is not just the throughput — it is the hardware on which it was achieved. The 8-GPU node used is the kind of commodity server any team can rent on AWS or Azure today, not custom silicon requiring hundreds of millions in development spend. At a limited API trial price of 3× the standard MiMo-V2.5-Pro rate, users were getting roughly 10× the generation speed — a ratio that changes the economics of running frontier-scale models in production.
Three factors converge to produce the result: MXFP4 quantization applied selectively to the MoE expert layers, a novel block-level speculative decoding system called DFlash, and a persistent GPU execution runtime called TileRT. Together they form an inference stack that Xiaomi claims runs approximately 10× faster than the standard MiMo-V2.5-Pro baseline — and, in Xiaomi’s own comparisons, roughly 15× faster than the response speed users experience from ChatGPT and Claude via API. Those comparisons are vendor-reported and independent replication is pending, but the underlying architecture is not theoretical: Xiaomi has open-sourced the FP4-DFlash checkpoint on Hugging Face, and TileRT has released select modules on GitHub.
What TileRT Actually Does
TileRT is the engine underneath the speed claim. Most inference runtimes work by launching discrete GPU operators sequentially — each attention kernel, each linear projection, each activation function starts, finishes, and hands off to the next. That per-operator launch overhead accumulates across thousands of operations per forward pass, and at the scale of a 1T-parameter model, it becomes a significant fraction of total latency.
TileRT replaces this approach with a Persistent Engine Kernel — a single GPU kernel that stays resident on the hardware throughout the entire forward pass. Instead of launching and landing thousands of operators, the runtime keeps warps active continuously, using Warp Specialization to assign some threads the role of moving data and others the role of computing. The two roles overlap in time: while compute warps are running matrix multiplications, memory warps are already staging the next set of weights. This continuous pipelining reduces idle GPU cycles and brings individual operator execution times into the microsecond range.
The practical result is that the gap between theoretical peak GPU FLOPS and realized throughput narrows substantially. In a standard inference stack, most of the GPU sits idle waiting for memory transfers or kernel launches. TileRT’s persistent kernel design is co-engineered with the quantization and speculative decoding layers above it, which means the three components are tuned to hand off work in patterns the persistent kernel is specifically designed to exploit. This tight co-design is what separates TileRT from a general-purpose inference optimization framework applied after the fact.
The Technical Stack: FP4 Quantization, Speculative Decoding, and MoE
MiMo-V2.5-Pro-UltraSpeed is a Mixture-of-Experts architecture. In a MoE model, each token is routed through only a subset of expert sub-networks rather than the full parameter set — which means the 1.02 trillion total parameters do not all activate for every token. This design is already more efficient than a dense model of equivalent parameter count, but it also creates a specific opportunity for quantization: the expert layers, which dominate the parameter count, are the right place to apply aggressive precision reduction.
Xiaomi applies MXFP4 — a block-scaled FP4 format — selectively to those MoE expert layers. Other modules, including attention mechanisms and layer norms, retain FP8 precision. The key challenge with FP4 is capability degradation: cutting weights to 4-bit representations typically loses meaningful accuracy. Xiaomi addresses this with Quantization-Aware Training (QAT), fine-tuning the model with quantization simulated during the forward pass so that the weights adapt to the reduced precision. The result, according to Xiaomi’s evaluation, is capability essentially on par with the full-precision original.
The speculative decoding system, called DFlash, tackles a different bottleneck: the autoregressive generation loop. In standard autoregressive decoding, each token requires a full forward pass through the model — meaning 1,000 tokens requires 1,000 sequential forward passes. Speculative decoding breaks this constraint by using a smaller draft model to predict multiple tokens ahead, then verifying a batch of predictions in a single forward pass of the large model. DFlash extends this with block-level masked parallel prediction: the draft model fills an entire masked block in one forward pass using Sliding Window Attention, with block size capped at 8. Rejection sampling ensures that any accepted tokens are statistically identical to what the large model would have generated autoregressively. The acceptance rates Xiaomi reports are high: 6.30 average accepted tokens per verification round on coding tasks, 5.56 on math and reasoning, and 4.29 on agent tasks — numbers that translate directly into proportional throughput gains over naive autoregressive baselines.
Advertisement
What AI Engineers, MLOps Teams, and Inference Platform Builders Should Do
The MiMo-UltraSpeed result is not just a benchmark headline — it is a technical signal that certain inference optimizations, previously considered research-grade, are now production-ready. Teams building or operating LLM inference infrastructure have concrete decisions to make in response.
1. Audit your current inference stack against persistent kernel runtimes
Most teams running large models in production are using vLLM, TGI, or TensorRT-LLM with default configurations. These frameworks are excellent but rely on operator-level launch patterns that TileRT’s persistent kernel design is engineered to outperform. The right immediate action is benchmarking: obtain the open-sourced TileRT modules, run them on a representative workload, and measure actual throughput and GPU utilization against your current stack. The comparison will tell you whether a migration is worth the engineering cost. For teams serving frontier-scale MoE models — 100B parameters and above — the gap is likely to be significant, because the overhead of per-operator kernel launches scales with model depth.
2. Evaluate FP4 quantization paths with QAT for your own models
MXFP4 is now supported in recent CUDA and ROCm toolchains, and the Xiaomi QAT approach is documented in their open-sourced checkpoint release. For teams fine-tuning or pre-training their own MoE models, integrating QAT into the training loop adds compute cost upfront but changes the inference economics permanently: a 4-bit expert checkpoint is half the memory footprint of an FP8 equivalent, enabling higher batch sizes and better GPU memory utilization. The key decision is whether your use case tolerates the residual capability delta — which, based on Xiaomi’s reporting, is small but non-zero for reasoning-heavy tasks. Start with a targeted evaluation: run your production benchmark suite against a QAT-quantized checkpoint before committing to a full deployment migration.
3. Treat DFlash-style speculative decoding as a standard component, not an experimental one
Block-level speculative decoding with high acceptance rates — 6.30 tokens per verification round for coding tasks — is now demonstrated at 1T-parameter scale. If your inference pipeline is still running naive autoregressive generation for a large model, you are leaving significant throughput on the table. The implementation path is concrete: Xiaomi’s FP4-DFlash checkpoint is on Hugging Face, and the DFlash technique is documented in detail. Adoption does require a compatible draft model and careful rejection sampling implementation, but these are solved engineering problems, not open research questions. The payoff is multiplicative: speculative decoding compounds with quantization and runtime optimizations rather than competing with them.
The Inference Cost Collapse and What Comes Next
The MiMo-UltraSpeed result fits into a larger pattern that has been accelerating through 2025 and 2026: frontier-scale inference is getting dramatically cheaper on commodity hardware, and the gap between specialized inference silicon and standard GPU clusters is narrowing faster than most roadmaps anticipated. Groq’s LPU achieves 300–750 tokens per second. Cerebras’s wafer-scale system hit 969 tokens per second on Meta’s Llama 3.1 405B — a model 2.5× smaller than MiMo. Xiaomi’s result, if it holds under independent evaluation, puts commodity GPUs in the same throughput range as dedicated inference hardware, without the capital expenditure or the vendor lock-in that custom silicon entails.
The implications compound across the application layer. When 1,000 tokens per second becomes achievable on a rented 8-GPU node, the economic model for agentic AI applications shifts. Long-horizon coding agents, autonomous research pipelines, and real-time document processing loops all become dramatically more affordable to operate. Latency constraints that previously forced architectural compromises — smaller models, aggressive caching, reduced context windows — start to relax. The constraint moves from “can we run this fast enough” to “can we run this correctly enough,” which is a more tractable engineering problem.
The open-source release of the FP4-DFlash checkpoint and select TileRT modules is the part of this story that matters most for the broader ecosystem. Xiaomi is not just publishing a benchmark — it is releasing the artifacts that allow the community to verify, replicate, and build on the approach. If independent evaluations confirm the reported throughput numbers, the combination of MXFP4 quantization, DFlash speculative decoding, and persistent kernel runtimes is likely to become a standard part of the production inference toolkit within the next 12 to 18 months.
Frequently Asked Questions
What is a Mixture of Experts (MoE) model?
A Mixture of Experts model divides the neural network’s feedforward layers into multiple parallel “expert” sub-networks. When a token is processed, a learned routing mechanism activates only a small subset of these experts — typically 2 to 8 out of dozens or hundreds — rather than passing the token through the full network. This means the total parameter count can be extremely large (1.02 trillion in MiMo’s case) while the compute required per token stays manageable, because only a fraction of parameters activate per inference step. MoE is the architecture underlying several frontier models including Mixtral and Google’s Gemini 1.5, and it is specifically why selective quantization of expert layers is so effective: the experts hold most of the parameters and are the natural target for compression without touching the model’s routing or attention mechanisms.
How does FP4 quantization differ from FP8 or INT8?
FP4, FP8, and INT8 all represent a weight or activation value in fewer bits than the standard 32-bit or 16-bit floating-point formats, but they differ in precision and range. FP8 uses 8 bits with a floating-point representation, preserving a wide dynamic range suitable for most model weights with minimal accuracy loss. INT8 uses 8 bits as a fixed-point integer, which is efficient for inference but requires careful per-tensor calibration to avoid clipping high-magnitude values. FP4 — specifically the MXFP4 block-scaled format used by Xiaomi — uses only 4 bits per value, halving the memory footprint again compared to FP8. The tradeoff is higher quantization error. The block-scaled design addresses this by applying a shared scaling factor across small blocks of weights, which reduces error from outlier values. Quantization-Aware Training then fine-tunes the model to compensate for the remaining precision loss, yielding a checkpoint that is half the memory size of an FP8 model and enables higher batch sizes and faster memory bandwidth utilization.
What does 1,000 tokens per second actually mean for real-world applications?
At 1,000 tokens per second, a typical 2,000-token response from a frontier AI model completes in about 2 seconds. For comparison, most cloud API endpoints for large frontier models deliver 50–150 tokens per second, meaning the same response takes 13–40 seconds. This difference is not just a user experience improvement — it changes what applications are architecturally feasible. Agentic loops that call a large model dozens of times per task become practical at 1,000 tokens per second; at 50 tokens per second, the latency accumulates to minutes per task cycle. Real-time voice interfaces, streaming code completion in IDEs, and multi-step document analysis pipelines all require sustained throughput that commodity inference stacks have not historically provided for 1T-parameter models. MiMo-UltraSpeed’s result moves that threshold significantly, putting frontier-scale model responsiveness within reach for any team with access to a standard multi-GPU cloud instance.
Sources & Further Reading
- Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs — MarkTechPost
- Xiaomi MiMo Hits 1,000 Tokens/Sec on 1T Model With 8 GPUs — ChinaBizInsider
- MiMo-V2.5-Pro-UltraSpeed: Pushing 1T-Parameter Model Past 1,000 Tokens/Second — Xiaomi MiMo Blog
- Xiaomi MiMo Hits 1,000 Tokens Per Second Inference — Let’s Data Science
- China’s Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude — Decrypt
- MiMo UltraSpeed Hits 1,000 Tokens/Sec on Stock GPUs — ByteIota














