
Lovable launches mobile app for building web apps on the go
April 28, 2026
Snapchat launches AI Sponsored Snaps for conversational advertising
April 28, 2026Xiaomi has launched MiMo-V2.5, a major update to its AI model that combines visual, audio, and text understanding in a single system. The company positions this as a significant step forward in AI agents that can both perceive and act on what they see and hear.
The release comes as tech companies race to develop more capable AI systems that can handle multiple types of input beyond just text. MiMo-V2.5’s ability to process up to 1 million tokens of context puts it in competition with frontier models from OpenAI, Google, and others, while its open-source nature sets it apart from many commercial offerings.
Technical architecture and training
MiMo-V2.5 is built on a 310B-parameter Sparse MoE (Mixture of Experts) architecture, though only 15B parameters are active during inference. This design helps balance performance with computational efficiency.
The model went through five distinct training stages:
- Text pre-training on diverse datasets to build the core language capabilities
- Projector warmup to align audio and visual components with the language model
- Large-scale multimodal pre-training on cross-modal data
- Supervised fine-tuning with progressive context expansion from 32K to 256K to 1M tokens
- Reinforcement learning to strengthen perception, reasoning, and agent capabilities
The architecture inherits from MiMo-V2-Flash’s hybrid sliding-window attention system but adds dedicated visual and audio encoders developed in-house.
Performance benchmarks
Xiaomi’s internal testing shows MiMo-V2.5 competing well against established models across coding and agent tasks. On the company’s MiMo Coding Bench, it scored 62.3, matching some frontier models while using fewer computational resources.
The model also performed strongly on multimodal tasks:
- CharXiv RQ (chart analysis): 81.0
- MMMU-Pro (multimodal understanding): 77.9
- Video-MME (video understanding): 87.7
- OmniDocBench (document processing): 87.2
These scores put MiMo-V2.5 roughly on par with models like Claude Sonnet 4.6 and Gemini 3 Pro in various categories, though direct comparisons can be tricky given different testing conditions.
Open source availability and pricing
Unlike many competing models, MiMo-V2.5 is fully open source. Xiaomi has released the complete model weights, tokenizer, and documentation on Hugging Face. Two versions are available: a base model with 256K context and the full version with 1M context support.
For commercial API access, Xiaomi has simplified its pricing structure. MiMo-V2.5 now costs 1 credit per token, while the Pro version costs 2 credits per token. The company has also removed additional charges for using the extended 1M-token context window.
Industry implications
MiMo-V2.5’s release highlights the growing importance of multimodal AI systems that can process different types of input simultaneously. This capability is seen as crucial for practical AI applications that need to understand real-world scenarios involving images, audio, and text together.
The open-source approach also continues a trend of major tech companies releasing capable AI models for public use, potentially accelerating development across the industry. However, the computational requirements for running a 310B-parameter model mean practical deployment will still be limited to organizations with significant resources.
Xiaomi says it’s already working on the next generation of models with improved reasoning and better tool integration, suggesting this competitive space will continue moving rapidly.




