Mistral AI launches Voxtral Transcribe 2 with real time speech recognition at low cost

March 27, 2026

Mistral AI has pushed its speech capabilities forward with Voxtral Transcribe 2. The release brings two models that tackle everyday transcription needs while keeping costs and latency in check. One model focuses on batch work, the other on live conversations, and both deliver clear speaker labels plus solid results across multiple languages. 🎙️

The two models in Voxtral Transcribe 2

Voxtral Mini Transcribe V2 targets offline or large file processing. It supports audio up to three hours in a single request, adds word level timestamps, and lets users bias recognition toward specific names or technical terms with up to 100 context words. The model covers 13 languages including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian and Dutch. It also handles background noise well.

Voxtral Realtime shifts the focus to live scenarios. Its streaming design transcribes audio as it arrives, with configurable latency that can drop below 200ms. At a 4B parameter size, the model runs comfortably on laptops or phones. Mistral released the weights on Hugging Face under Apache 2.0, so developers can deploy it locally for privacy sensitive work.

Both models include precise diarization that tags who spoke when, even during overlaps, and they maintain strong accuracy in real world conditions.

Benchmarks and cost advantages

On the FLEURS benchmark, Voxtral Mini Transcribe V2 posts roughly 4 percent word error rate across the top languages. That beats GPT 4o mini Transcribe, Gemini 2.5 Flash, AssemblyAI Universal and Deepgram Nova according to Mistral tests. The realtime version stays within 1 to 2 percent of that quality even at low delay settings.

Pricing starts at $0.003 per minute for the batch model, which the company says is one fifth the cost of comparable services while running roughly 3x faster than ElevenLabs Scribe v2. The realtime API sits at $0.006 per minute. Coverage in VentureBeat and The Decoder highlights the strong price performance balance.

Developers can test everything in the new Mistral Studio audio playground, which accepts multiple file formats up to 1GB each and offers options for diarization and timestamps.

Use cases and deployment flexibility

Teams handling meetings get clean transcripts with speaker attribution for follow up notes or compliance checks. Voice agents benefit from the low latency, while contact centers can feed live text into sentiment tools or CRM systems. Media producers use it for fast subtitles across languages.

For on premise or private cloud setups, the open weights version supports GDPR and HIPAA style requirements because audio never leaves the device. Broader exploration of similar capabilities appears in guides to AI speech to text tools, dedicated real time transcription platforms, or options like speaker diarization solutions.

The release continues Mistral emphasis on efficient, deploy anywhere models that fit both cloud APIs and local hardware. It gives developers fresh choices when accuracy, speed and privacy all matter at once.

How will lower cost live transcription change voice first applications in the coming months? source

Granola raises 125 million dollars at 1.5 billion dollar valuation to turn meeting notes into company wide AI context

CNN starts building its own AI agents to handle future media deals

Granola raises 125 million dollars at 1.5 billion dollar valuation to turn meeting notes into company wide AI context

CNN starts building its own AI agents to handle future media deals

The two models in Voxtral Transcribe 2

Benchmarks and cost advantages

Use cases and deployment flexibility

Related posts

CNN starts building its own AI agents to handle future media deals

Granola raises 125 million dollars at 1.5 billion dollar valuation to turn meeting notes into company wide AI context

Google Research introduces TurboQuant to slash AI memory use by 6x with no accuracy drop