Big Bird: Transformers for Longer Sequences

5 min readMar 8, 2021

Industrial Applications: Big Bird is new approach to solve different sequence modeling problems such as:

Summarization
Translation
Q&A
Textual vector representation
Sentiment Analysis

Problem statement: Why Big Bird, when we have Bert, Elmo and their variants, Current Transformers-based models, such as BERT uses self-attention architecture and they have been one of the most successful deep learning models for NLP problems. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism.

In simple words, self-attention applies series of word-level comparisons for the power of 2 combinations. In other words, comparing each word in an input sentence with another word in the same sentence (self-attention). For example, mentioned in image1 word it_ is being compared with itself, because_, street_, and other words.

Doing multiple of those comparisons in parallel and combining the results works well for short sequences however, if you want to apply transformer to longer sequences similar to the one illustrated in the scientific paper, once starts facing various challenges such as high memory requirement because the large number of comparisons which happen in memory (O(n²) complexity). so this quadratic complexity is the major problem.

image1: Self-attention graph for word it_ You can also try your on notebook

The maximum input size is around 512 tokens which means this model cannot be used for larger inputs & for tasks like large document summarization.

Solutions: There have been multiple attempts to simplify this calculation for longer texts.

Reducing full-attention: Child et al. reduces complexity to O(N√𝑁), Kitav et al.O(N log(N))
Longer length work-around using sliding window: SpanBERT, ORQA, REALM, RAG, etc.
Longformer ( from Allen Institute for AI) and Extended Transformers Construction
Understanding Self-Attention — 1. Expressivity Yun et al. 2. Turing Complete Perez et al.
Linformer from Facebook AI
Big Bird from Google

Linformer from Facebook AI, which has proven that the attention matrices are effectively low rank and can be approximated by smaller ones, which in that model is achieved by using neural linear projection layers.

Longformer used a combination of window attention (attending to the close neighbors of each token) and global attention (having a few global tokens that attend to every other token in the sequence).

Big Bird: Google’s answer to the same problem:

It achieves the reduction in matrix dimension and computational complexity by calculating attention in addition to the window attention and global attention as in the Longformer thus making O(n) inner-products. converting Quadratic Dependency to Linear.
Sparse attention mechanism, a mechanism that uses memory that scales linearly to the sequence length without compromising the expressiveness of a full attention mechanism.
BigBird can preserve properties of quadratic, full attention models. This ability has enabled the model to showcase an enhanced performance in processing eight times longer text sequences than other transformer models.

Architecture:

BigBird uses the Sparse Attention mechanism which means the attention mechanism is applied token by token, unlike BERT where the attention mechanism is applied to the entire input just once!

Matrix A (attention matrix) is a binary-valued nxn matrix where A(i,j)=1 if query i attends to key j and is zero otherwise. When A is all 1s then it is the traditional full attention mechanism. Since every token attends to every other token, the memory requirement is quadratic.

Random Attention:

Each query attends over r random number of keys. Mathematically, A(i,⋅)=1 for r randomly chosen keys. Complexity of o (r*n), which is linear.

2. (Sliding) Window Attention:

There is a great deal of locality of reference in NLP data which is that information about a token can be derived from its neighboring tokens. To utilize this, BigBird uses sliding window attention of width w

The query at location i attends from i−w/2 to i+w/2 keys. Mathematically, A(i,i−w/2:i+w/2)=1. Complexity o (w*n), which is also linear.

3. Global Attention:

Global tokens are tokens that attend to all tokens in the sequence and to whom all tokens attend. BigBird utilizes this global token notion in two ways:

BIGBIRD-ITC (Internal Transformer Construction): Make some existing tokens “global” and make them attend over the entire input sequence.

BIGBIRD-ETC (Extended Transformer Construction): Add g additional “global” tokens (e.g. CLS) that attend to all existing tokens. This extends the columns and rows of the matrix A by g rows/columns.

Impact:

Using BigBird and its Sparse Attention mechanism, the team of researchers decreased the complexity of O(n²) (of BERT) to just O(n).
This means that the input sequence which was limited to 512 tokens is now increased to 4096 tokens (8 * 512).

Results & Comparisons:

The sparse attention makes the mechanism to attend to 8times longer sequences. It is possible to use gradient checkpointing to handle >8x longer sequences. Below are results from NLP tasks.

Pretraining & MLM

**Predict random subset of masked-out tokens: metric( MLM: BPC )**

Bits-per-character (BPC) is another metric often reported for recent language models. It measures exactly the quantity that it is named after the avg no of bits needed to encode on character. This leads to revisiting Shannon’s explanation of the entropy of a language.

2. Encoder Only Tasks: Question Answering