Xiaomi launches MiMo-V2.5 with native visual and audio understanding

April 28, 2026

Xiaomi has launched MiMo-V2.5, a major update to its AI model that combines visual, audio, and text understanding in a single system. The company positions this as a significant step forward in AI agents that can both perceive and act on what they see and hear.

The release comes as tech companies race to develop more capable AI systems that can handle multiple types of input beyond just text. MiMo-V2.5’s ability to process up to 1 million tokens of context puts it in competition with frontier models from OpenAI, Google, and others, while its open-source nature sets it apart from many commercial offerings.

Technical architecture and training

MiMo-V2.5 is built on a 310B-parameter Sparse MoE (Mixture of Experts) architecture, though only 15B parameters are active during inference. This design helps balance performance with computational efficiency.

The model went through five distinct training stages:

Text pre-training on diverse datasets to build the core language capabilities
Projector warmup to align audio and visual components with the language model
Large-scale multimodal pre-training on cross-modal data
Supervised fine-tuning with progressive context expansion from 32K to 256K to 1M tokens
Reinforcement learning to strengthen perception, reasoning, and agent capabilities

The architecture inherits from MiMo-V2-Flash’s hybrid sliding-window attention system but adds dedicated visual and audio encoders developed in-house.

Performance benchmarks

Xiaomi’s internal testing shows MiMo-V2.5 competing well against established models across coding and agent tasks. On the company’s MiMo Coding Bench, it scored 62.3, matching some frontier models while using fewer computational resources.

The model also performed strongly on multimodal tasks:

CharXiv RQ (chart analysis): 81.0
MMMU-Pro (multimodal understanding): 77.9
Video-MME (video understanding): 87.7
OmniDocBench (document processing): 87.2

These scores put MiMo-V2.5 roughly on par with models like Claude Sonnet 4.6 and Gemini 3 Pro in various categories, though direct comparisons can be tricky given different testing conditions.

Open source availability and pricing

Unlike many competing models, MiMo-V2.5 is fully open source. Xiaomi has released the complete model weights, tokenizer, and documentation on Hugging Face. Two versions are available: a base model with 256K context and the full version with 1M context support.

For commercial API access, Xiaomi has simplified its pricing structure. MiMo-V2.5 now costs 1 credit per token, while the Pro version costs 2 credits per token. The company has also removed additional charges for using the extended 1M-token context window.

Industry implications

MiMo-V2.5’s release highlights the growing importance of multimodal AI systems that can process different types of input simultaneously. This capability is seen as crucial for practical AI applications that need to understand real-world scenarios involving images, audio, and text together.

The open-source approach also continues a trend of major tech companies releasing capable AI models for public use, potentially accelerating development across the industry. However, the computational requirements for running a 310B-parameter model mean practical deployment will still be limited to organizations with significant resources.

Xiaomi says it’s already working on the next generation of models with improved reasoning and better tool integration, suggesting this competitive space will continue moving rapidly.

Lovable launches mobile app for building web apps on the go

Snapchat launches AI Sponsored Snaps for conversational advertising

Lovable launches mobile app for building web apps on the go

Snapchat launches AI Sponsored Snaps for conversational advertising

Technical architecture and training

Performance benchmarks

Open source availability and pricing

Industry implications

Related posts

Anthropic warns investors against unauthorized share trading platforms

Google and SpaceX explore partnership for orbital data centers

Android gets a major intelligence upgrade with Gemini-powered automation