NaturalSpeech: Microsoft achieves human parity on LJSpeech benchmark
In one sentence NaturalSpeech is the first TTS system to achieve a MOS statistically indistinguishable from recorded human speech on the LJSpeech benchmark, marking a historic milestone for speech synthesis.
How do you measure whether an artificial voice sounds like a human one? For decades, researchers have used the MOS (Mean Opinion Score): real people listen to audio samples and rate them on a scale from 1 to 5. A human voice recorded in a studio typically scores around 4.44 on LJSpeech.
Until 2022, the best synthetic voices hovered around 4.35-4.40: good, but still perceptibly artificial to careful ears. Microsoft's NaturalSpeech crosses this threshold, achieving a MOS of 4.44 — statistically identical to the reference human voice.
The research goes beyond the result: it proposes a rigorous framework for defining when a TTS system has achieved "human parity" (statistical identity in MOS, not just closeness).
Technically, the system uses a differentiable variational autoencoder with a new phoneme alignment module that abandons fixed duration predictors in favor of a fully differentiable approach. This allows the optimization to flow through the entire pipeline without interruption.
The practical message: for languages with abundant high-quality data, the single-speaker TTS problem can be considered substantially solved. Attention shifts to multilingual, zero-shot cloning, and character voices.
Companies
Microsoft
Tools
—
Tags
Sources