wav2vec 2.0: Facebook AI's "BERT for speech"

In one sentence Facebook AI publishes wav2vec 2.0, a self-supervised model that learns representations from raw audio and reaches SOTA on LibriSpeech with as little as 10 minutes of labeled data.

Verified Official source

ShareLinkedIn X

Training a good speech-recognition system used to require thousands of hours of speech already transcribed by humans: slow and expensive work, especially for less common languages.

Facebook proposes a BERT-like idea, applied to speech: first let the model listen to many hours of "unlabeled" audio, hiding chunks and asking it to guess what's missing. Then show it just a few labeled hours and it learns to transcribe very accurately.

With this trick, ten minutes of labeled audio is enough to build a working system. Speech recognition opens up to hundreds of minority languages and vertical use cases (calls, dialects, noisy environments) where labeled data is scarce.