Microsoft shivs OpenAI with three new AI models for speech and images

Microsoft shivs OpenAI with three new AI models for speech and images

Summary

Microsoft has released public preview versions of three in-house machine‑learning models: MAI-Transcribe-1 (speech recognition), MAI-Voice-1 (speech synthesis) and MAI-Image-2 (text-to-image). Microsoft claims MAI-Transcribe-1 delivers enterprise-grade accuracy across 25 languages at about 50% lower GPU cost than leading alternatives; MAI-Voice-1 can produce 60 seconds of audio in under a second on a single GPU; MAI-Image-2 targets high-quality image generation. The models are available to developers via Microsoft Foundry (formerly Azure AI Studio) and already power Microsoft products such as Copilot, Bing, PowerPoint and Azure Speech.

Key Points

  • MAI-Transcribe-1: speech recognition across 25 languages with claimed ~50% lower GPU cost versus competitors.
  • MAI-Voice-1: rapid speech generation — Microsoft says 60 seconds of audio in less than 1 second on one GPU.
  • MAI-Image-2: text-to-image model adding pressure in the image‑generation market.
  • All three models are distributed through Foundry/Azure AI Studio for developer use.
  • Microsoft already uses the models inside Copilot (Audio Expressions and Voice Mode transcription).
  • The move indicates Microsoft is hedging its OpenAI exposure and competing directly in multimodal AI.

Context and relevance

This is a strategic pivot: Microsoft is building and shipping core multimodal models itself rather than acting only as investor or integrator with OpenAI. For organisations working on voice interfaces, captioning, media subtitling, customer support agents, or enterprise AI, these models could affect cost, performance and vendor choice. It also ramps up competition among cloud AI providers and may influence pricing, licensing and where developers host production models.

Why should I read this?

Short and blunt: Microsoft just told the market it can run its own speech and image models — which changes the vendor map and could save (or cost) you money depending on where you build. If you pick speech or image tooling for products or services, this is something to know. We skimmed the detail so you don’t have to.

Source

Source: https://go.theregister.com/feed/www.theregister.com/2026/04/02/microsoft_models_homegrown_ai_models/