Reading path
Audio engineers, podcasters and voice developers
From open-source speech recognition to real-time voice agents.
You are an audio engineer, podcaster, or voice application developer and want to map the trajectory of AI in spoken audio. This path starts with pre-Whisper self-supervised models (wav2vec, HuBERT), covers the Whisper breakthrough as a universal free transcriber, then climbs through real-time conversational voice (Moshi, OpenAI Realtime) up to the latest multilingual models and next-generation voice synthesizers like Sesame and Voxtral.
- 01
Why it matters to you
The first reference SSL speech model: it proves that accurate voice recognition is achievable with very little labeled data.
High Voice & Audiowav2vec 2.0: Facebook AI's "BERT for speech"
Facebook AI publishes wav2vec 2.0, a self-supervised model that learns representations from raw audio and reaches SOTA on LibriSpeech with as little as 10 minutes of labeled data.
- 02
Why it matters to you
HuBERT improves robustness in noisy environments: it becomes the backbone for many TTS and voice cloning systems that follow.
Medium Voice & AudioHuBERT: Meta brings self-supervised to speech, foreshadows Whisper
Meta AI publishes HuBERT, a self-supervised audio model based on masked prediction of discrete clusters — conceptual base for Whisper, w2v-BERT and audio-multimodal models.
- 03
Why it matters to you
Open-source multilingual transcription at professional quality: it drives the cost of subtitling, podcast transcription and accessibility to zero.
High Voice & AudioWhisper open source: audio transcription becomes a commodity
OpenAI releases Whisper under MIT license: a speech-to-text model trained on 680,000 hours of multilingual audio, near commercial-grade quality, runs locally.
- 04
Why it matters to you
The first full-duplex conversational model: speaks and listens simultaneously, paving the way for voice agents that interrupt and react naturally.
High Voice & AudioMoshi: Kyutai's first open-source full-duplex voice assistant
French non-profit lab Kyutai unveils Moshi, a full-duplex voice assistant with ~200ms latency based on a single multimodal model handling simultaneous input and output audio.
- 05
Why it matters to you
OpenAI Realtime API in general availability: developers can integrate low-latency bidirectional voice into any application.
High Voice & AudioOpenAI Realtime API GA: production-ready voice-to-voice over WebRTC
OpenAI promotes the Realtime API to GA: low-latency voice-in/voice-out (~300ms), tool calling, function calling, native WebRTC. Opens the production voice-app era with a single end-to-end API.
- 06
Why it matters to you
Voxtral brings next-gen multilingual transcription to an open-weight model: benchmarks above Whisper on European languages and mixed-code audio.
Medium Voice & AudioMistral Voxtral Transcribe 2: open-source speech-to-text that runs on a laptop
Mistral releases Voxtral Transcribe 2: two open-source STT models (Batch + Realtime, 4B params) with latency configurable down to 200ms, Apache 2.0, 13 languages.
- 07
Why it matters to you
Sesame Maya introduces paralinguistic presence (hesitations, rhythm, emotion) into speech synthesis: the line between human and AI voice definitively blurs.
High Voice & AudioSesame Maya & Miles: AI voices that 'think aloud' cross the uncanny valley
Sesame (founded by former Oculus/Meta engineers) ships Maya and Miles, conversational voices with prosody, hesitations, and breaths so natural they trigger the 'feels like a real person' effect. Base CSM-1B model open Apache 2.0.