Audio engineers, podcasters and voice developers

From open-source speech recognition to real-time voice agents.

You are an audio engineer, podcaster, or voice application developer and want to map the trajectory of AI in spoken audio. This path starts with pre-Whisper self-supervised models (wav2vec, HuBERT), covers the Whisper breakthrough as a universal free transcriber, then climbs through real-time conversational voice (Moshi, OpenAI Realtime) up to the latest multilingual models and next-generation voice synthesizers like Sesame and Voxtral.

01

Why it matters to you

The first reference SSL speech model: it proves that accurate voice recognition is achievable with very little labeled data.

June 20, 2020 High Voice & Audio

wav2vec 2.0: Facebook AI's "BERT for speech"

Facebook AI publishes wav2vec 2.0, a self-supervised model that learns representations from raw audio and reaches SOTA on LibriSpeech with as little as 10 minutes of labeled data.
02

Why it matters to you

HuBERT improves robustness in noisy environments: it becomes the backbone for many TTS and voice cloning systems that follow.

September 9, 2021 Medium Voice & Audio

HuBERT: Meta brings self-supervised to speech, foreshadows Whisper

Meta AI publishes HuBERT, a self-supervised audio model based on masked prediction of discrete clusters — conceptual base for Whisper, w2v-BERT and audio-multimodal models.
03

Why it matters to you

Open-source multilingual transcription at professional quality: it drives the cost of subtitling, podcast transcription and accessibility to zero.

September 21, 2022 High Voice & Audio

Whisper open source: audio transcription becomes a commodity

OpenAI releases Whisper under MIT license: a speech-to-text model trained on 680,000 hours of multilingual audio, near commercial-grade quality, runs locally.
04

Why it matters to you

The first full-duplex conversational model: speaks and listens simultaneously, paving the way for voice agents that interrupt and react naturally.

July 3, 2024 High Voice & Audio

Moshi: Kyutai's first open-source full-duplex voice assistant

French non-profit lab Kyutai unveils Moshi, a full-duplex voice assistant with ~200ms latency based on a single multimodal model handling simultaneous input and output audio.
05

Why it matters to you

OpenAI Realtime API in general availability: developers can integrate low-latency bidirectional voice into any application.

April 9, 2025 High Voice & Audio

OpenAI Realtime API GA: production-ready voice-to-voice over WebRTC

OpenAI promotes the Realtime API to GA: low-latency voice-in/voice-out (~300ms), tool calling, function calling, native WebRTC. Opens the production voice-app era with a single end-to-end API.
06

Why it matters to you

Voxtral brings next-gen multilingual transcription to an open-weight model: benchmarks above Whisper on European languages and mixed-code audio.

February 4, 2026 Medium Voice & Audio

Mistral Voxtral Transcribe 2: open-source speech-to-text that runs on a laptop

Mistral releases Voxtral Transcribe 2: two open-source STT models (Batch + Realtime, 4B params) with latency configurable down to 200ms, Apache 2.0, 13 languages.
07

Why it matters to you

Sesame Maya introduces paralinguistic presence (hesitations, rhythm, emotion) into speech synthesis: the line between human and AI voice definitively blurs.

July 21, 2025 High Voice & Audio

Sesame Maya & Miles: AI voices that 'think aloud' cross the uncanny valley

Sesame (founded by former Oculus/Meta engineers) ships Maya and Miles, conversational voices with prosody, hesitations, and breaths so natural they trigger the 'feels like a real person' effect. Base CSM-1B model open Apache 2.0.

Audio engineers, podcasters and voice developers

wav2vec 2.0: Facebook AI's "BERT for speech"

HuBERT: Meta brings self-supervised to speech, foreshadows Whisper

Whisper open source: audio transcription becomes a commodity

Moshi: Kyutai's first open-source full-duplex voice assistant

OpenAI Realtime API GA: production-ready voice-to-voice over WebRTC

Mistral Voxtral Transcribe 2: open-source speech-to-text that runs on a laptop

Sesame Maya & Miles: AI voices that 'think aloud' cross the uncanny valley