Voice & Audio

51 entries

May 18, 2026 Medium

Realtime voice AI: sub-second latency and multilingual become the norm

Realtime voice APIs from OpenAI, Google and ElevenLabs converge on < 500ms latency, fluent multilingual, natural prosody. Phone as an agentic channel becomes practical.

Voice & Audio VoiceRealtimeSpeech

February 15, 2026 High

ElevenLabs launches Studio Enterprise: voice cloning with consent verification and 200+ languages

ElevenLabs launches Studio Enterprise with 30-second voice cloning with consent verification, dubbing API with lip-sync, real-time voice agent SDK, and GDPR-compliant EU hosting. 200+ languages.

Voice & Audio

February 4, 2026 Medium

Mistral Voxtral Transcribe 2: open-source speech-to-text that runs on a laptop

Mistral releases Voxtral Transcribe 2: two open-source STT models (Batch + Realtime, 4B params) with latency configurable down to 200ms, Apache 2.0, 13 languages.

Voice & Audio MistralVoxtralASR

January 25, 2026 Medium

Whisper v3 Turbo: real-time local transcription on consumer GPU

Whisper v3 Turbo reaches widespread adoption: 8x faster than v3-large at the same accuracy, runs real-time on consumer GPUs. Integrated in Ollama and LM Studio, enables local transcription pipelines for businesses.

Voice & Audio

October 1, 2025 High

OpenAI Realtime API Goes Generally Available

WebSocket API enabling production-grade voice agents with 300ms latency, interruption handling, and function calling in a single text+audio session.

Voice & Audio

July 22, 2025 High

Udio v3 & Suno v4: Professional-Grade AI Music Generation

Udio v3 and Suno v4 release in the same week with vocal quality indistinguishable from human on produced tracks and full song structure from a single prompt. Music industry legal battle intensifies.

Voice & Audio

July 21, 2025 High

Sesame Maya & Miles: AI voices that 'think aloud' cross the uncanny valley

Sesame (founded by former Oculus/Meta engineers) ships Maya and Miles, conversational voices with prosody, hesitations, and breaths so natural they trigger the 'feels like a real person' effect. Base CSM-1B model open Apache 2.0.

Voice & Audio SesameConversational VoiceCSM

June 17, 2025 High

OpenAI Advanced Voice Mode 2.0: Emotional Range & Memory

OpenAI upgrades Advanced Voice Mode with custom voice personas, empathy/humor/frustration detection, memory across voice conversations, and background noise cancellation.

Voice & Audio

April 9, 2025 High

OpenAI Realtime API GA: production-ready voice-to-voice over WebRTC

OpenAI promotes the Realtime API to GA: low-latency voice-in/voice-out (~300ms), tool calling, function calling, native WebRTC. Opens the production voice-app era with a single end-to-end API.

Voice & Audio OpenAIRealtime APIVoice

March 5, 2025 Medium

F5-TTS: real-time voice cloning without fine-tuning using flow matching and DiTTo architecture

F5-TTS uses flow matching with simplified DiTTo architecture for zero-shot real-time voice cloning without fine-tuning, Apache 2.0, competitive latency on consumer GPU.

Voice & Audio F5-TTSFlow MatchingVoice Cloning

February 12, 2025 High

Cartesia Sonic: 50ms TTS for voice agents in production

Cartesia launches Sonic, a TTS with ultra-low 50ms latency, token-by-token streaming, voice cloning without fine-tuning, designed specifically for AI voice agents in production environments.

Voice & Audio CartesiaSonicTTS

February 10, 2025 High

Dia 1.6B: open-source dialogic TTS with laughter, breathing and human naturalness

Dia by Nari Labs is the first open-source TTS to generate natural dialogues with non-verbal cues like laughter, breathing pauses and emotional emphasis, matching ElevenLabs dialogue quality for multi-speaker dialogues under Apache 2.0.

Voice & Audio Dia TTSdialoguelaughter

January 28, 2025 Medium

ElevenLabs Voice Design: generate a unique voice from text description in seconds

ElevenLabs launches Voice Design: describe a voice in natural language and get a unique synthesized voice in seconds, no source audio or cloning needed.

Voice & Audio ElevenLabsVoice DesignText-to-Voice

January 15, 2025 Medium

Kokoro TTS v0.19: professional TTS quality with just 82 million parameters

Kokoro TTS achieves quality comparable to systems 10x its size with only 82M parameters, sub-1-second inference on CPU, Apache 2.0, ideal for edge devices.

Voice & Audio Kokoro TTSEdge TTSOpen Source

November 22, 2024 Medium

Suno v4: AI music generation reaches studio quality for the general public

Suno releases v4: AI music generation with up to 4-minute tracks, improved quality over v3, more natural vocals, and support for stem separation (splitting vocals and instruments).

Voice & Audio SunoMusic GenerationAudio

November 20, 2024 Medium

Fish Speech 1.4: open source TTS with voice cloning from 10 seconds and 8 languages

Fish Speech 1.4 clones voices from 10s of audio, supports 8 languages, runs real-time on CPU, and offers a serious free alternative to ElevenLabs for developers.

Voice & Audio Fish SpeechTTSVoice Cloning

November 15, 2024 Medium

Whisper Large v3 Turbo: 8x faster ASR with less than 1% quality degradation

Whisper Large v3 Turbo reduces Large v3's decoder parameters by 40% achieving 8x higher speed with less than 1% WER increase, making high-quality ASR accessible on consumer hardware.

Voice & Audio Whisper TurboASRspeed

November 2, 2024 Medium

Parler TTS: HuggingFace releases the first text-controllable open source TTS

Parler TTS generates voices described in natural language — 'slow, low male voice with echo' — trained on 45k hours, Apache 2.0, first fully controllable open source TTS.

Voice & Audio Parler TTSHuggingFaceControllable TTS

September 5, 2024 High

Hume AI EVI 2: the first voice AI with adaptive emotional intelligence

Hume AI launches EVI 2, the first AI voice interface that adapts tone and rhythm based on the detected emotional state of the interlocutor, with API available for developers.

Voice & Audio Hume AIEVIEmotional Intelligence

August 22, 2024 Medium

CosyVoice: Alibaba DAMO's multilingual zero-shot voice cloning

CosyVoice brings production-quality multilingual zero-shot voice cloning to Chinese open source: 3 seconds of reference audio to clone a voice in Chinese, English, Japanese, Korean and Cantonese, with LLM + flow matching architecture.

Voice & Audio CosyVoiceAlibabavoice cloning

July 28, 2024 High

OpenAI Advanced Voice Mode: ChatGPT speaks in real time with natural emotions

ChatGPT gets an end-to-end voice mode without separate STT+TTS: 320ms latency, natural emotions, interruptible. First truly natural AI conversation.

Voice & Audio OpenAIAdvanced Voice ModeChatGPT

July 24, 2024 Medium

Suno v3: longer songs, better coherence, and audio upload

Suno updates to v3 with better lyrics-melody coherence, extension up to 4 minutes, and audio upload to continue existing tracks — consolidating its position in the AI music market.

Voice & Audio SunoMusic GenerationAI Music

July 3, 2024 High

Moshi: Kyutai's first open-source full-duplex voice assistant

French non-profit lab Kyutai unveils Moshi, a full-duplex voice assistant with ~200ms latency based on a single multimodal model handling simultaneous input and output audio.

Voice & Audio KyutaiMoshiVoice

April 10, 2024 High

Udio: professional-quality AI vocal music goes viral

Udio launches its music generation platform with convincing AI vocals from text prompts, professional production quality, and immediate viral growth on Twitter.

Voice & Audio UdioMusic GenerationAI Music

March 28, 2024 Medium

Stable Audio Open: first open-weight model for music generation

Stable Audio Open is the first open-weight model for generating music and sound effects from text prompts, with a CC-BY license enabling commercial use, based on latent diffusion with timing conditioning.

Voice & Audio Stable Audiomusic generationopen source

February 29, 2024 Medium

Stable Audio 2.0: stereo music up to 3 minutes with structure control

Stability AI launches Stable Audio 2.0 with stereo audio generation up to 3 minutes, explicit control over intro/outro/instruments, and 44kHz quality, surpassing previous version limits.

Voice & Audio Stability AIStable AudioMusic Generation

January 12, 2024 Medium

MeloTTS: real-time multilingual TTS on CPU at 50MB

MeloTTS is the first production-quality multilingual TTS to run in real-time on CPU, weighing just 50MB and supporting English, Chinese, Japanese, Korean, Spanish and French.

Voice & Audio MeloTTSmultilingualreal-time

December 15, 2023 Medium

StyleTTS2: open source TTS with style diffusion outperforms Voicebox on intelligibility

StyleTTS2 uses style diffusion and adversarial training to generate human-level natural voices on LJSpeech, open source, surpassing Voicebox on intelligibility.

Voice & Audio StyleTTS2TTSStyle Diffusion

November 21, 2023 High

OpenAI launches TTS API: six voices, streaming and aggressive pricing

OpenAI launches its TTS API with 6 voices, pricing at $0.015 per 1000 characters, low latency streaming, and direct integration into the ChatGPT and Assistants ecosystem.

Voice & Audio OpenAITTSAPI

November 16, 2023 Medium

Google MusicLM: generating music from text goes public

Google makes MusicLM publicly available via Google Labs: musical generation from text description in a specific style, the first consumer music AI integration from a big tech company.

Voice & Audio GoogleMusicLMMusic Generation

October 26, 2023 Medium

Whisper Large v3: improved multilingual ASR trained on 5 million hours

Whisper Large v3 reduces error rates on low-resource languages, improves timestamp accuracy and adds new language support, remaining the most widely deployed open-source ASR model.

Voice & Audio Whisper Large v3ASRspeech recognition

September 28, 2023 High

AudioPaLM: the first LLM that processes and generates audio as text

AudioPaLM fuses PaLM-2 with an audio tokenizer to create an LLM that natively processes audio and text tokens, enabling speech translation while preserving speaker identity.

Voice & Audio AudioPaLMGoogleaudio LLM

September 1, 2023 High

Meta AudioCraft: open source suite for music and audio from text

Meta releases AudioCraft, an open source suite including MusicGen for generating structured music and AudioGen for ambient sounds, both controllable via text description.

Voice & Audio MetaAudioCraftMusicGen

July 17, 2023 High

SeamlessM4T: Meta's universal speech translation model for 100+ languages

SeamlessM4T is the first multimodal system to handle speech-to-text, text-to-speech, and speech-to-speech across 100+ languages in a single model, powering Meta's real-time translation features.

Voice & Audio SeamlessM4TMetaspeech translation

June 16, 2023 High

Voicebox: Meta brings flow matching to TTS with audio editing and 6 languages

Voicebox uses flow matching with masked training to synthesize, edit, and transfer vocal styles across 6 languages, with no explicit cloning or fine-tuning.

Voice & Audio VoiceboxTTSFlow Matching

June 12, 2023 Medium

Bark: open source TTS with laughter, sighs, and music from text

Suno AI releases Bark on HuggingFace: an open source TTS model capable of generating paralinguistics — laughter, sighs, sound effects, music — directly from text prompts.

Voice & Audio BarkSuno AITTS

May 18, 2023 High

SoundStorm: Google generates 30 seconds of natural dialogue in half a second

SoundStorm uses MaskGIT on EnCodec tokens to generate audio in parallel rather than token-by-token: 30s of dialogue in 0.5s, preserving speaker consistency.

Voice & Audio SoundStormAudio GenerationGoogle

January 27, 2023 High

XTTS: Coqui AI's open-source multilingual zero-shot voice cloning

XTTS brings multilingual zero-shot voice cloning to open source: just a 6-second audio sample to replicate a voice across 17 different languages, with MIT license.

Voice & Audio XTTSCoquimultilingual

January 26, 2023 High

ElevenLabs exits beta: AI voice becomes the creator standard

ElevenLabs exits public beta with 1-minute voice cloning, 29 languages, and prosodically natural TTS, establishing itself as the reference for creators and audiobooks.

Voice & Audio ElevenLabsVoice CloningTTS

January 5, 2023 Landmark

VALL-E: Microsoft clones a voice from 3 seconds of audio using in-context learning

VALL-E clones any voice with just 3 seconds of reference audio, no fine-tuning needed, using in-context learning on EnCodec tokens. First zero-shot TTS at naturalistic quality.

Voice & Audio VALL-ETTSVoice Cloning

October 24, 2022 High

EnCodec: Meta AI compresses audio with neural networks and beats Opus

EnCodec compresses 24kHz stereo audio to just 1.5–12 kbps at quality surpassing Opus, becoming the standard vocoder for modern neural TTS.

Voice & Audio EnCodecNeural CodecAudio Compression

September 21, 2022 High

Whisper open source: audio transcription becomes a commodity

OpenAI releases Whisper under MIT license: a speech-to-text model trained on 680,000 hours of multilingual audio, near commercial-grade quality, runs locally.

Voice & Audio OpenAIWhisperASR

September 12, 2022 High

AudioLM: Google teaches a language model to listen and continue audio

AudioLM generates long-range coherent audio using two tiers of tokens — semantic and acoustic — with no text or score conditioning.

Voice & Audio AudioLMLanguage ModelAudio Generation

June 17, 2022 High

SoundStream: Google's first real-time neural audio codec

SoundStream introduces Residual Vector Quantization to compress audio at 3kbps with quality surpassing Opus at 12kbps, founding the architecture of all modern neural codecs used in audio LLMs.

Voice & Audio SoundStreamneural codecRVQ

June 6, 2022 Medium

Tortoise TTS: convincing voice cloning from 3 seconds of audio

James Betker releases Tortoise TTS, an open source model with few-second voice cloning and human-like vocal quality — the first real breakthrough in accessible TTS.

Voice & Audio TTSVoice CloningOpen Source

April 20, 2022 High

NaturalSpeech: Microsoft achieves human parity on LJSpeech benchmark

NaturalSpeech is the first TTS system to achieve a MOS statistically indistinguishable from recorded human speech on the LJSpeech benchmark, marking a historic milestone for speech synthesis.

Voice & Audio NaturalSpeechMicrosofthuman parity

January 27, 2022 Medium

Coqui TTS: open source speech synthesis for everyone

Coqui TTS is an open source Python library for quality text-to-speech, forked from Mozilla TTS, supporting over 1100 languages and adopted by the HuggingFace community.

Voice & Audio CoquiTTSOpen Source

September 9, 2021 Medium

HuBERT: Meta brings self-supervised to speech, foreshadows Whisper

Meta AI publishes HuBERT, a self-supervised audio model based on masked prediction of discrete clusters — conceptual base for Whisper, w2v-BERT and audio-multimodal models.

Voice & Audio FacebookMetaAV-HuBERT

June 15, 2021 High

VITS: end-to-end TTS with variational autoencoder

VITS unifies the acoustic model and vocoder into a single end-to-end model, achieving quality surpassing Tacotron 2 with faster inference.

Voice & Audio VITSTTSend-to-end

June 20, 2020 High

wav2vec 2.0: Facebook AI's "BERT for speech"

Facebook AI publishes wav2vec 2.0, a self-supervised model that learns representations from raw audio and reaches SOTA on LibriSpeech with as little as 10 minutes of labeled data.

Voice & Audio Facebook AIwav2vec 2.0Speech Recognition

April 30, 2020 Medium

OpenAI Jukebox: generating whole songs with vocals

OpenAI releases Jukebox, a generative model that produces raw songs (audio + vocals + lyrics) conditioned on artist and genre, built on a stack of VQ-VAE and autoregressive transformers.

Voice & Audio OpenAIJukeboxMusic Generation