wav2vec 2.0: Facebook AI's "BERT for speech"
In one sentence Facebook AI publishes wav2vec 2.0, a self-supervised model that learns representations from raw audio and reaches SOTA on LibriSpeech with as little as 10 minutes of labeled data.
Training a good speech-recognition system used to require thousands of hours of speech already transcribed by humans: slow and expensive work, especially for less common languages.
Facebook proposes a BERT-like idea, applied to speech: first let the model listen to many hours of "unlabeled" audio, hiding chunks and asking it to guess what's missing. Then show it just a few labeled hours and it learns to transcribe very accurately.
With this trick, ten minutes of labeled audio is enough to build a working system. Speech recognition opens up to hundreds of minority languages and vertical use cases (calls, dialects, noisy environments) where labeled data is scarce.
Companies
Meta, Facebook AI Research
Tools
wav2vec 2.0
Tags
Sources