NaturalSpeech: Microsoft achieves human parity on LJSpeech benchmark

How do you measure whether an artificial voice sounds like a human one? For decades, researchers have used the MOS (Mean Opinion Score): real people listen to audio samples and rate them on a scale from 1 to 5. A human voice recorded in a studio typically scores around 4.44 on LJSpeech.

Until 2022, the best synthetic voices hovered around 4.35-4.40: good, but still perceptibly artificial to careful ears. Microsoft's NaturalSpeech crosses this threshold, achieving a MOS of 4.44 — statistically identical to the reference human voice.

The research goes beyond the result: it proposes a rigorous framework for defining when a TTS system has achieved "human parity" (statistical identity in MOS, not just closeness).

Technically, the system uses a differentiable variational autoencoder with a new phoneme alignment module that abandons fixed duration predictors in favor of a fully differentiable approach. This allows the optimization to flow through the entire pipeline without interruption.

The practical message: for languages with abundant high-quality data, the single-speaker TTS problem can be considered substantially solved. Attention shifts to multilingual, zero-shot cloning, and character voices.