Skip to content
AImpact
IT EN
← Reading paths

Reading path

Audio engineers, podcasters and voice developers

From open-source speech recognition to real-time voice agents.

You are an audio engineer, podcaster, or voice application developer and want to map the trajectory of AI in spoken audio. This path starts with pre-Whisper self-supervised models (wav2vec, HuBERT), covers the Whisper breakthrough as a universal free transcriber, then climbs through real-time conversational voice (Moshi, OpenAI Realtime) up to the latest multilingual models and next-generation voice synthesizers like Sesame and Voxtral.

  1. 01

    Why it matters to you

    The first reference SSL speech model: it proves that accurate voice recognition is achievable with very little labeled data.

    High Voice & Audio

    wav2vec 2.0: Facebook AI's "BERT for speech"

    Facebook AI publishes wav2vec 2.0, a self-supervised model that learns representations from raw audio and reaches SOTA on LibriSpeech with as little as 10 minutes of labeled data.

  2. 02

    Why it matters to you

    HuBERT improves robustness in noisy environments: it becomes the backbone for many TTS and voice cloning systems that follow.

    Medium Voice & Audio

    HuBERT: Meta brings self-supervised to speech, foreshadows Whisper

    Meta AI publishes HuBERT, a self-supervised audio model based on masked prediction of discrete clusters — conceptual base for Whisper, w2v-BERT and audio-multimodal models.

  3. 03

    Why it matters to you

    Open-source multilingual transcription at professional quality: it drives the cost of subtitling, podcast transcription and accessibility to zero.

    High Voice & Audio

    Whisper open source: audio transcription becomes a commodity

    OpenAI releases Whisper under MIT license: a speech-to-text model trained on 680,000 hours of multilingual audio, near commercial-grade quality, runs locally.

  4. 04

    Why it matters to you

    The first full-duplex conversational model: speaks and listens simultaneously, paving the way for voice agents that interrupt and react naturally.

    High Voice & Audio

    Moshi: Kyutai's first open-source full-duplex voice assistant

    French non-profit lab Kyutai unveils Moshi, a full-duplex voice assistant with ~200ms latency based on a single multimodal model handling simultaneous input and output audio.

  5. 05

    Why it matters to you

    OpenAI Realtime API in general availability: developers can integrate low-latency bidirectional voice into any application.

    High Voice & Audio

    OpenAI Realtime API GA: production-ready voice-to-voice over WebRTC

    OpenAI promotes the Realtime API to GA: low-latency voice-in/voice-out (~300ms), tool calling, function calling, native WebRTC. Opens the production voice-app era with a single end-to-end API.

  6. 06

    Why it matters to you

    Voxtral brings next-gen multilingual transcription to an open-weight model: benchmarks above Whisper on European languages and mixed-code audio.

    Medium Voice & Audio

    Mistral Voxtral Transcribe 2: open-source speech-to-text that runs on a laptop

    Mistral releases Voxtral Transcribe 2: two open-source STT models (Batch + Realtime, 4B params) with latency configurable down to 200ms, Apache 2.0, 13 languages.

  7. 07

    Why it matters to you

    Sesame Maya introduces paralinguistic presence (hesitations, rhythm, emotion) into speech synthesis: the line between human and AI voice definitively blurs.

    High Voice & Audio

    Sesame Maya & Miles: AI voices that 'think aloud' cross the uncanny valley

    Sesame (founded by former Oculus/Meta engineers) ships Maya and Miles, conversational voices with prosody, hesitations, and breaths so natural they trigger the 'feels like a real person' effect. Base CSM-1B model open Apache 2.0.