Hello and welcome to yet another interesting article related to Transformers. You would be thinking how many more and I would say not much, since we have covered a few major improvement architectures derived from the Transformer architecture. The reason for learning these architectures is that we learn the art of research and understand how improvements are made or how the flaws are mitigated for a particular architecture. Transformer has been a breakthrough architecture which has fared excellently in both NLP and Computer Vision and learning about these kinds of architectures is always beneficial in the long run. I promise a hands-on exercise in our next article wherein we will use various architectures, pick a dataset and observe the performance. For now, let’s get on with Linformer.

### Background

As we all know the main bottleneck in terms of time and computation is the self-attention mechanism in which, at any given point, each token is attending to all the tokens from the previous layer, which results in O(n²) computation for a fixed sequence length. However, if the sequence length is increased, the complexity increases quadratically, which has been explained in the recent papers we have discussed. The Linformer using an approximation and projection technique aims to make this scaling of sequence length a linear factor rather than a quadratic factor that results in using much more GPUs for a longer time and eventually leaving a greater carbon footprint. Let’s see the architecture design of the Linformer and understand how it manages to reduce the self-attention complexity to O(n) from O(n²).

### Architecture

In the above diagram, we can see the matrices are divided into multiple smaller ones. This technique is somewhat analogous to the Singular Value Decomposition (SVD), an approximation formula where we approximate a matrix using smaller ranked matrices. Consider a matrix P which is decomposed across the layers and the attention heads. If we plot the normalized SVD values we observe that the left end spectrum, meaning the data in higher layers, is concentrated and of lower rank. Spectrum analysis of the self-attention matrix. Y-axis represents the cumulative singular value of matrix P and X-axis represents the eigen values. Image Credits — Linformer paper

This shows that the SVD theorem/formulae works and we can decompose and recalculate the P matrix’s values with a lower rank matrix P(low) without losing much information. But this approximation adds to additional computation because the SVD has to be applied at each self-attention matrix and hence the authors proposed another low-rank approximation that avoids these additional calculations.

In this proposed scheme of things, there are 2 linear projections matrices E and F added for computing the key and value numbers. The original (n x d) dimensional key and value KW and VW are projected into (k x d) dimensional matrices. Then the scaled dot-product attention is computed using the given below formula.

where Q, K and V are query, key and value matrices. On top of this, the context embeddings are computed using P.(FVW). The complexity of the above operation is O(nk) where k is the projected dimension size. If we choose a k<<n, the memory overhead is significantly reduced.

This results in a constant speed of computation over longer sequence lengths when compared to vanilla transformers that slow down considerably with longer sequence lengths.

### Results

The Linformer is compared against the RoBERTa model, which is also derived from the vanilla Transformer. The Linformer’s performance is on par with the vanilla Transformer with k = 128; n = 512 and k =256; n = 1024. There was also an evaluation of the effect of increase in sequence length with k kept constant at 256. The below graph illustrates that even though the n value is increasing the perplexity of the model is converging at the same point reiterating the empirical assumption of a new low-rank approximation performing in linear-time complexity.