Hello and welcome to a deep-dive on one of the architectures that is a better variant of the Transformer network. In this article, we will be discussing Longformer, which overcomes one of the famous pitfalls of transformers — the inability to process long sequences because of its quadratic scaling with increase in the sequence length. The Longformer is a vanilla transformer with a change in the attention mechanism, which is a combination of local self-attention and a global attention. 

Time and Memory required for different sequence lengths. Pay attention to the Blue (Regular self attention) and Green (Longformer vectorized) lines in the graph. Image Credits — Longformer paper


We all know from our previous article on Transformers (Attention is All You Need) article that the architecture is not able to scale for very long sequences and much of the community has worked on mitigating this and there have been various improvements in the form of Transformer-XL, Adaptive Span Transformer, Reformer, Compressive Transformer, etc. We will revisit these architectures in our future blog posts as each one of them presents novel methodologies for improving performance and efficiency of the Transformers (We already have an article on Transformer-XL).

Summary of work on adapting Transformers for long documents. ltr means left-to-right. Image Credits — Longformer paper

Longformer has an attention element that scales linearly rather than quadratically and this is a big win already with regards to training time when applied to long sequences. This works very well for Document Classification and Question Answering tasks because we need to model long sequences in these tasks. The self-attention setup in Longformer is very interesting and draws parallels to another breakthrough (and one of the most fundamental) architecture in Computer Vision — The Convolutional Neural Network. 

Sounds exciting, isn’t it? Let’s quickly jump into the architecture and working mechanism.

Note: Please make sure you have a conceptual understanding of Transformer architecture before proceeding.


The self attention in Longformer is a sliding window which captures context effectively over a stack of layers. If we assume a context window length w then each token is attending w/2 tokens on either side of the current token. The complexity for this will be O(n.w) which scales linearly with the sequence length n. The overall receptive field of self-attention will be w x l, with l being the number of layers in the network. This sliding window is very similar to the convolutional kernel which is sliding all over the image, which makes this kind of attention inspired from CNNs!

Comparison of Full self-attention and sliding window attention. Image Credits — Longformer paper

As we can see in the above image, the sliding window attention has a limited context in a single layer, but that increases as we move higher down the network to form some sort of a tree of receptive fields.

Similarly, we have a dilated sliding window attention which skips consecutive tokens in the receptive field and makes the matrix even sparser, more dilated and having no real increase in computational requirements (similar to Dilated CNNs). There is one trick in the bag of Longformer, Global Attention at selected places for task-specific modelling (in the case of BERT). The authors introduced this global attention in specific tasks like classification applied on [CLS] tokens and in QA tasks at all question tokens.

A token with a global attention attends to all tokens across the sequence, and all tokens in the sequence attend to it.

Dilated sliding window attention and Global + sliding window attention together. Image Credits — Longformer paper

The authors introduced the Longformer-Encoder-Decoder (LED), which has the local+global attention in the Encoder stack and the full attention in the Decoder stack. This made the architecture robust to both Encoder specific tasks such as a majority of NLP tasks and Encoder-Decoder specific tasks like Summarization. This inclusion of LED made sure that the architecture is scaling linearly for longer sequences on tasks like summarization, which resulted in improved performance.

With the sliding window comes a major drawback — No parallelization of Attention computation, which was the main reason for transformers coming into existence over the improved RNNs. The authors have implemented a custom CUDA kernel. There are no existing implementations of this sort in either PyTorch/TensorFlow, at the time of this writing. 


For the Auto-regressive Language Modelling task, Longformer used the Diluted sliding window attention with different window sizes across layers (smaller at lower layers and larger at higher layers). The focus was on character-level language modelling (text8 and enwik8 datasets). The longformer model was trained for a sequence length of 2048 and 23040 for the smaller and larger versions of the model respectively. The evaluation sequence length is 32256.

Performance of the smaller version models on the text8 and enwik8 dataset. Longformer achieves the best performance against a host of other transformer models. Image Credits — Longformer paper

The Longformer was trained in such a way that it had RoBERTa’s checkpoint as the starting point of Longformer’s training. Effectively it was pre-trained on RoBERTa which also indicates that Longformer’s architecture has a plug-and-play attention pattern which can be plugged into any other pre-trained transformer! The positional embeddings’ size was increased from RoBERTa’s default 512 to 4096. Head over to the Ablation Studies and Tasks section in the paper for more specific details on training and the hyperparameter setup. 

The below image summarizes the results of Longformer and compares it against RoBERTa over various tasks and datasets. 

Summary of fine-tuning results on QA, co-reference resolution, and document classification. Image Credits — Longformer paper


That was a really good architecture, wasn’t it? Longformer presented an innovative technique to overcome the scalability problem of Transformers using various attention techniques. Implementing this architecture for custom datasets is made easier due to the option of using any existing transformer model’s pre-trained weights and fine-tuning it for a few-epochs. The longformer presented one of the various techniques to overcome the scalability issue of transformers and is easy to plug pretrained attention weights of other models, fine-tune with a scaled sequence length. The model achieved SOTA results on the character LM and outperformed RoBERTa on document-level tasks. The LED variant made sure that the model is performing excellently on seq-to-seq tasks too. Try using the longformer on your custom datasets using the Hugging Face implementations, which are very easy to implement. Until next time, share your thoughts on the longformer and try to come up with ideas on how to make it more efficient.


  1. Longformer: https://arxiv.org/pdf/2004.05150.pdf
  2. Transformer (Attention is All You Need): https://arxiv.org/pdf/1706.03762.pdf
  3. RoBERTa: https://arxiv.org/pdf/1907.11692.pdf
  4. Source Code: https://github.com/allenai/longformer
  5. Hugging Face implementation: https://huggingface.co/transformers/model_doc/longformer.html


Pranav Raikote

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s