Big Bird: Transformers for Longer Sequences

Big Bird

Industrial Applications: Big Bird is new approach to solve different sequence modeling problems such as:

  • Summarization
  • Translation
  • Q&A
  • Textual vector representation
  • Sentiment Analysis

Problem statement: Why Big Bird, when we have Bert, Elmo and their variants, Current Transformers-based models, such as BERT uses self-attention architecture and they have been one of the most successful deep learning models for NLP problems. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism.

In simple words, self-attention applies series of word-level comparisons for the power of 2 combinations. In other words, comparing each word in an input sentence with another word in the same sentence (self-attention). For example, mentioned in image1 word it_ is being compared with itself, because_, street_, and other words.

Doing multiple of those comparisons in parallel and combining the results works well for short sequences however, if you want to apply transformer to longer sequences similar to the one illustrated in the scientific paper, once starts facing various challenges such as high memory requirement because the large number of comparisons which happen in memory (O(n²) complexity). so this quadratic complexity is the major problem.

image1: Self-attention graph for word it_ You can also try your on notebook

The maximum input size is around 512 tokens which means this model cannot be used for larger inputs & for tasks like large document summarization.

Solutions: There have been multiple attempts to simplify this calculation for longer texts.

  • Reducing full-attention: Child et al. reduces complexity to O(N√𝑁), Kitav et al.O(N log(N))
  • Longer length work-around using sliding window: SpanBERT, ORQA, REALM, RAG, etc.
  • Longformer ( from Allen Institute for AI) and Extended Transformers Construction
  • Understanding Self-Attention — 1. Expressivity Yun et al. 2. Turing Complete Perez et al.
  • Linformer from Facebook AI
  • Big Bird from Google

Linformer from Facebook AI, which has proven that the attention matrices are effectively low rank and can be approximated by smaller ones, which in that model is achieved by using neural linear projection layers.

Longformer used a combination of window attention (attending to the close neighbors of each token) and global attention (having a few global tokens that attend to every other token in the sequence).

Big Bird: Google’s answer to the same problem:

  • It achieves the reduction in matrix dimension and computational complexity by calculating attention in addition to the window attention and global attention as in the Longformer thus making O(n) inner-products. converting Quadratic Dependency to Linear.
  • Sparse attention mechanism, a mechanism that uses memory that scales linearly to the sequence length without compromising the expressiveness of a full attention mechanism.
  • BigBird can preserve properties of quadratic, full attention models. This ability has enabled the model to showcase an enhanced performance in processing eight times longer text sequences than other transformer models.


BigBird uses the Sparse Attention mechanism which means the attention mechanism is applied token by token, unlike BERT where the attention mechanism is applied to the entire input just once!

Big Bird Attention Mechanism

Matrix A (attention matrix) is a binary-valued nxn matrix where A(i,j)=1 if query i attends to key j and is zero otherwise. When A is all 1s then it is the traditional full attention mechanism. Since every token attends to every other token, the memory requirement is quadratic.

  1. Random Attention:

Each query attends over r random number of keys. Mathematically, A(i,⋅)=1 for r randomly chosen keys. Complexity of o (r*n), which is linear.

2. (Sliding) Window Attention:

There is a great deal of locality of reference in NLP data which is that information about a token can be derived from its neighboring tokens. To utilize this, BigBird uses sliding window attention of width w

The query at location i attends from i−w/2 to i+w/2 keys. Mathematically, A(i,i−w/2:i+w/2)=1. Complexity o (w*n), which is also linear.

3. Global Attention:

Global tokens are tokens that attend to all tokens in the sequence and to whom all tokens attend. BigBird utilizes this global token notion in two ways:

BIGBIRD-ITC (Internal Transformer Construction): Make some existing tokens “global” and make them attend over the entire input sequence.

BIGBIRD-ETC (Extended Transformer Construction): Add g additional “global” tokens (e.g. CLS) that attend to all existing tokens. This extends the columns and rows of the matrix A by g rows/columns.


  • Using BigBird and its Sparse Attention mechanism, the team of researchers decreased the complexity of O(n²) (of BERT) to just O(n).
  • This means that the input sequence which was limited to 512 tokens is now increased to 4096 tokens (8 * 512).

Results & Comparisons:

The sparse attention makes the mechanism to attend to 8times longer sequences. It is possible to use gradient checkpointing to handle >8x longer sequences. Below are results from NLP tasks.

  1. Pretraining & MLM
Predict random subset of masked-out tokens: metric( MLM: BPC )

Bits-per-character (BPC) is another metric often reported for recent language models. It measures exactly the quantity that it is named after the avg no of bits needed to encode on character. This leads to revisiting Shannon’s explanation of the entropy of a language.

2. Encoder Only Tasks: Question Answering

Encoder Only Tasks:Question Answering
Fine-tuning results onTestset for QA tasks. The Test results (F1 for HotpotQA, NaturalQuestions, TriviaQA, and Accuracy for WikiHop) have been picked from their respective leaderboard. For each task, the top-3 leaders were picked not including BIGBIRD-etc. For Natural Questions Long Answer (LA), TriviaQA, and WikiHop, BIGBIRD-ETC is the new state-of-the-art. OnHotpotQA we are third in the leaderboard by F1 and second by Exact Match (EM).

3. Document Classification: Improves Sota by %5 points

4. Encoder-Decoder Tasks

Taking Pegasus checkpoint and converting into sparse attention

Pegasus pretraining task is intentionally similar to summarization: important sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary. ROUGE or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing.



Applying ML at Scale