Longformer: sliding-window attention for long documents

In one sentence Allen Institute for AI releases Longformer, a transformer that combines local sliding-window attention with global attention on special tokens, scaling linearly up to 4096 tokens and beating RoBERTa on long-document tasks.

Verified Official source

ShareLinkedIn X

Models like BERT can read at most a few hundred words. If you want them to read an article, a contract, or a whole PDF, you have to chunk them, losing context.

Allen Institute presents Longformer, a variant that changes how the model looks at words. Instead of comparing every word with every other word, it only compares nearby ones (a sliding window), plus a few "key points" that look at the whole text.

Result: the same BERT can now read 4,000-word documents or more while keeping performance. It's one of the first practical models for QA, summarization, and classification on real documents.