HuBERT: Meta brings self-supervised to speech, foreshadows Whisper

In one sentence Meta AI publishes HuBERT, a self-supervised audio model based on masked prediction of discrete clusters — conceptual base for Whisper, w2v-BERT and audio-multimodal models.

Needs review Official source

ShareLinkedIn X

While the world watches text-generating models, Meta works on something else: teaching a model to understand audio without labels. It's called HuBERT.

The idea: take hours and hours of speech (no transcripts), cut it into chunks, mask some, and have the model guess what was there. After lots of training, the model learns an internal speech representation that can be reused for recognition, generation, translation.

Same pattern as BERT, applied to audio. HuBERT isn't a consumer product, but it nails down the idea that leads to OpenAI's Whisper a year later and the explosion of multimodal audio models.