Hello and welcome to yet another interesting article related to Transformers. You would be thinking how many more and I would say not much, since we have covered a few major improvement architectures derived from the Transformer architecture. The reason for learning these architectures is that we learn the art of research and understand how improvements are made or how the flaws are mitigated for a particular architecture. Transformer has been a breakthrough architecture which has fared excellently in both NLP and Computer Vision and learning about these kinds of architectures is always beneficial in the long run. I promise a hands-on exercise in our next article wherein we will use various architectures, pick a dataset and observe the performance. For now, let’s get on with Linformer.
As we all know the main bottleneck in terms of time and computation is the self-attention mechanism in which, at any given point, each token is attending to all the tokens from the previous layer, which results in O(n²) computation for a fixed sequence length. However, if the sequence length is increased, the complexity increases quadratically, which has been explained in the recent papers we have discussed. The Linformer using an approximation and projection technique aims to make this scaling of sequence length a linear factor rather than a quadratic factor that results in using much more GPUs for a longer time and eventually leaving a greater carbon footprint. Let’s see the architecture design of the Linformer and understand how it manages to reduce the self-attention complexity to O(n) from O(n²).
In the above diagram, we can see the matrices are divided into multiple smaller ones. This technique is somewhat analogous to the Singular Value Decomposition (SVD), an approximation formula where we approximate a matrix using smaller ranked matrices. Consider a matrix P which is decomposed across the layers and the attention heads. If we plot the normalized SVD values we observe that the left end spectrum, meaning the data in higher layers, is concentrated and of lower rank.
This shows that the SVD theorem/formulae works and we can decompose and recalculate the P matrix’s values with a lower rank matrix P(low) without losing much information. But this approximation adds to additional computation because the SVD has to be applied at each self-attention matrix and hence the authors proposed another low-rank approximation that avoids these additional calculations.
In this proposed scheme of things, there are 2 linear projections matrices E and F added for computing the key and value numbers. The original (n x d) dimensional key and value KW and VW are projected into (k x d) dimensional matrices. Then the scaled dot-product attention is computed using the given below formula.
where Q, K and V are query, key and value matrices. On top of this, the context embeddings are computed using P.(FVW). The complexity of the above operation is O(nk) where k is the projected dimension size. If we choose a k<<n, the memory overhead is significantly reduced.
This results in a constant speed of computation over longer sequence lengths when compared to vanilla transformers that slow down considerably with longer sequence lengths.
The Linformer is compared against the RoBERTa model, which is also derived from the vanilla Transformer. The Linformer’s performance is on par with the vanilla Transformer with k = 128; n = 512 and k =256; n = 1024. There was also an evaluation of the effect of increase in sequence length with k kept constant at 256. The below graph illustrates that even though the n value is increasing the perplexity of the model is converging at the same point reiterating the empirical assumption of a new low-rank approximation performing in linear-time complexity.
The Linformer showcases a linear increase in complexity when the sequence lengths are increased from n = 512 to n = 4096 (shown in the results section) by employing a newly proposed low-rank matrix approximation technique that draws inspiration from the SVD technique. The Linformer did not drop its performance greatly and was on par with the Transformer whilst drastically improving the inference time and memory required at scale, making it up to 20x faster and 60x less memory intensive with a sequence length of 65536!
I would recommend a read of the research paper for intricate details on experiments and a few theorems on the low-rank approximations. Do try using the Linformer (pretrained with longer sequences) for your downstream tasks and comment below how it performed. Also, think of adding value to the existing research and see whether you can further improve this architecture!
- Transformer (Attention is All You Need): https://arxiv.org/pdf/1706.03762.pdf
- Linformer: https://arxiv.org/pdf/2006.04768.pdf
- Linformer Implementation (PyTorch): https://github.com/lucidrains/linformer