Big Bird at NeurIPS 2020: sparse attention for sequences up to 4096 tokens

In one sentence Google Research presents Big Bird at NeurIPS 2020, a transformer with sparse attention (local + global + random) that scales linearly, reaches SOTA on long-document QA and summarization, and proves Turing-completeness.

Verified Official source

ShareLinkedIn X

After Longformer and Reformer, Google joins the race for models that read very long texts. Their model is Big Bird. It mixes three strategies: each word looks at nearby ones (like Longformer), a few "special" words look at everything, and each word looks at a few words chosen at random.

The mix sounds odd, but it works mathematically: with this sparse attention the model can still "approximate any sequence" like a dense transformer, while using much less memory.

Big Bird is applied to real problems — questions over scientific articles, long-document summarization — beating previous records.