Skip to content
AImpact
IT EN
Medium Voice & Audio · 1 min read

HuBERT: Meta brings self-supervised to speech, foreshadows Whisper

In one sentence Meta AI publishes HuBERT, a self-supervised audio model based on masked prediction of discrete clusters — conceptual base for Whisper, w2v-BERT and audio-multimodal models.

Needs review Official source
ShareLinkedInX
Reading level

While the world watches text-generating models, Meta works on something else: teaching a model to understand audio without labels. It's called HuBERT.

The idea: take hours and hours of speech (no transcripts), cut it into chunks, mask some, and have the model guess what was there. After lots of training, the model learns an internal speech representation that can be reused for recognition, generation, translation.

Same pattern as BERT, applied to audio. HuBERT isn't a consumer product, but it nails down the idea that leads to OpenAI's Whisper a year later and the explosion of multimodal audio models.

Companies

Meta

Tools

AV-HuBERT

Tags

FacebookMetaAV-HuBERTSpeechSelf-supervised

Sources