Realtime voice AI: sub-second latency and multilingual become the norm
Realtime voice APIs from OpenAI, Google and ElevenLabs converge on < 500ms latency, fluent multilingual, natural prosody. Phone as an agentic channel becomes practical.
Category
46 entries
Realtime voice APIs from OpenAI, Google and ElevenLabs converge on < 500ms latency, fluent multilingual, natural prosody. Phone as an agentic channel becomes practical.
Mistral releases Voxtral Transcribe 2: two open-source STT models (Batch + Realtime, 4B params) with latency configurable down to 200ms, Apache 2.0, 13 languages.
Sesame (founded by former Oculus/Meta engineers) ships Maya and Miles, conversational voices with prosody, hesitations, and breaths so natural they trigger the 'feels like a real person' effect. Base CSM-1B model open Apache 2.0.
OpenAI promotes the Realtime API to GA: low-latency voice-in/voice-out (~300ms), tool calling, function calling, native WebRTC. Opens the production voice-app era with a single end-to-end API.
F5-TTS uses flow matching with simplified DiTTo architecture for zero-shot real-time voice cloning without fine-tuning, Apache 2.0, competitive latency on consumer GPU.
Cartesia launches Sonic, a TTS with ultra-low 50ms latency, token-by-token streaming, voice cloning without fine-tuning, designed specifically for AI voice agents in production environments.
Dia by Nari Labs is the first open-source TTS to generate natural dialogues with non-verbal cues like laughter, breathing pauses and emotional emphasis, matching ElevenLabs dialogue quality for multi-speaker dialogues under Apache 2.0.
ElevenLabs launches Voice Design: describe a voice in natural language and get a unique synthesized voice in seconds, no source audio or cloning needed.
Kokoro TTS achieves quality comparable to systems 10x its size with only 82M parameters, sub-1-second inference on CPU, Apache 2.0, ideal for edge devices.
Suno releases v4: AI music generation with up to 4-minute tracks, improved quality over v3, more natural vocals, and support for stem separation (splitting vocals and instruments).
Fish Speech 1.4 clones voices from 10s of audio, supports 8 languages, runs real-time on CPU, and offers a serious free alternative to ElevenLabs for developers.
Whisper Large v3 Turbo reduces Large v3's decoder parameters by 40% achieving 8x higher speed with less than 1% WER increase, making high-quality ASR accessible on consumer hardware.
Parler TTS generates voices described in natural language — 'slow, low male voice with echo' — trained on 45k hours, Apache 2.0, first fully controllable open source TTS.
Hume AI launches EVI 2, the first AI voice interface that adapts tone and rhythm based on the detected emotional state of the interlocutor, with API available for developers.
CosyVoice brings production-quality multilingual zero-shot voice cloning to Chinese open source: 3 seconds of reference audio to clone a voice in Chinese, English, Japanese, Korean and Cantonese, with LLM + flow matching architecture.
ChatGPT gets an end-to-end voice mode without separate STT+TTS: 320ms latency, natural emotions, interruptible. First truly natural AI conversation.
Suno updates to v3 with better lyrics-melody coherence, extension up to 4 minutes, and audio upload to continue existing tracks — consolidating its position in the AI music market.
French non-profit lab Kyutai unveils Moshi, a full-duplex voice assistant with ~200ms latency based on a single multimodal model handling simultaneous input and output audio.
Udio launches its music generation platform with convincing AI vocals from text prompts, professional production quality, and immediate viral growth on Twitter.
Stable Audio Open is the first open-weight model for generating music and sound effects from text prompts, with a CC-BY license enabling commercial use, based on latent diffusion with timing conditioning.
Stability AI launches Stable Audio 2.0 with stereo audio generation up to 3 minutes, explicit control over intro/outro/instruments, and 44kHz quality, surpassing previous version limits.
MeloTTS is the first production-quality multilingual TTS to run in real-time on CPU, weighing just 50MB and supporting English, Chinese, Japanese, Korean, Spanish and French.
StyleTTS2 uses style diffusion and adversarial training to generate human-level natural voices on LJSpeech, open source, surpassing Voicebox on intelligibility.
OpenAI launches its TTS API with 6 voices, pricing at $0.015 per 1000 characters, low latency streaming, and direct integration into the ChatGPT and Assistants ecosystem.
Google makes MusicLM publicly available via Google Labs: musical generation from text description in a specific style, the first consumer music AI integration from a big tech company.
Whisper Large v3 reduces error rates on low-resource languages, improves timestamp accuracy and adds new language support, remaining the most widely deployed open-source ASR model.
AudioPaLM fuses PaLM-2 with an audio tokenizer to create an LLM that natively processes audio and text tokens, enabling speech translation while preserving speaker identity.
Meta releases AudioCraft, an open source suite including MusicGen for generating structured music and AudioGen for ambient sounds, both controllable via text description.
SeamlessM4T is the first multimodal system to handle speech-to-text, text-to-speech, and speech-to-speech across 100+ languages in a single model, powering Meta's real-time translation features.
Voicebox uses flow matching with masked training to synthesize, edit, and transfer vocal styles across 6 languages, with no explicit cloning or fine-tuning.
Suno AI releases Bark on HuggingFace: an open source TTS model capable of generating paralinguistics — laughter, sighs, sound effects, music — directly from text prompts.
SoundStorm uses MaskGIT on EnCodec tokens to generate audio in parallel rather than token-by-token: 30s of dialogue in 0.5s, preserving speaker consistency.
XTTS brings multilingual zero-shot voice cloning to open source: just a 6-second audio sample to replicate a voice across 17 different languages, with MIT license.
ElevenLabs exits public beta with 1-minute voice cloning, 29 languages, and prosodically natural TTS, establishing itself as the reference for creators and audiobooks.
VALL-E clones any voice with just 3 seconds of reference audio, no fine-tuning needed, using in-context learning on EnCodec tokens. First zero-shot TTS at naturalistic quality.
EnCodec compresses 24kHz stereo audio to just 1.5–12 kbps at quality surpassing Opus, becoming the standard vocoder for modern neural TTS.
OpenAI releases Whisper under MIT license: a speech-to-text model trained on 680,000 hours of multilingual audio, near commercial-grade quality, runs locally.
AudioLM generates long-range coherent audio using two tiers of tokens — semantic and acoustic — with no text or score conditioning.
SoundStream introduces Residual Vector Quantization to compress audio at 3kbps with quality surpassing Opus at 12kbps, founding the architecture of all modern neural codecs used in audio LLMs.
James Betker releases Tortoise TTS, an open source model with few-second voice cloning and human-like vocal quality — the first real breakthrough in accessible TTS.
NaturalSpeech is the first TTS system to achieve a MOS statistically indistinguishable from recorded human speech on the LJSpeech benchmark, marking a historic milestone for speech synthesis.
Coqui TTS is an open source Python library for quality text-to-speech, forked from Mozilla TTS, supporting over 1100 languages and adopted by the HuggingFace community.
Meta AI publishes HuBERT, a self-supervised audio model based on masked prediction of discrete clusters — conceptual base for Whisper, w2v-BERT and audio-multimodal models.
VITS unifies the acoustic model and vocoder into a single end-to-end model, achieving quality surpassing Tacotron 2 with faster inference.
Facebook AI publishes wav2vec 2.0, a self-supervised model that learns representations from raw audio and reaches SOTA on LibriSpeech with as little as 10 minutes of labeled data.
OpenAI releases Jukebox, a generative model that produces raw songs (audio + vocals + lyrics) conditioned on artist and genre, built on a stack of VQ-VAE and autoregressive transformers.