ELECTRA: more efficient NLP pre-training than BERT

In one sentence Clark, Luong, Le, and Manning publish ELECTRA at ICLR 2020: instead of masked language modeling, it trains the model to detect tokens replaced by a small generator, matching BERT with a quarter of the compute.

Verified Official source

ShareLinkedIn X

Building a good language model like BERT takes a very long training run that burns a lot of electricity. Stanford and Google propose a clever trick to cut that.

Instead of hiding some words and asking the model to guess them (BERT's approach), a small "fake" model replaces some words with plausible synonyms, and the main model has to learn which words are original and which were swapped.

It sounds like a detail, but the training signal becomes much denser: the model learns from every token, not just the 15% that's masked. Result: same quality as BERT, with a quarter of the compute.