Skip to content
AImpact
IT EN
High Voice & Audio · 1 min read

wav2vec 2.0: Facebook AI's "BERT for speech"

In one sentence Facebook AI publishes wav2vec 2.0, a self-supervised model that learns representations from raw audio and reaches SOTA on LibriSpeech with as little as 10 minutes of labeled data.

Verified Official source
ShareLinkedInX
Reading level

Training a good speech-recognition system used to require thousands of hours of speech already transcribed by humans: slow and expensive work, especially for less common languages.

Facebook proposes a BERT-like idea, applied to speech: first let the model listen to many hours of "unlabeled" audio, hiding chunks and asking it to guess what's missing. Then show it just a few labeled hours and it learns to transcribe very accurately.

With this trick, ten minutes of labeled audio is enough to build a working system. Speech recognition opens up to hundreds of minority languages and vertical use cases (calls, dialects, noisy environments) where labeled data is scarce.

Companies

Meta, Facebook AI Research

Tools

wav2vec 2.0

Tags

Facebook AIwav2vec 2.0Speech RecognitionSelf-supervisedASR

Sources