Hello and welcome back to an article where we are going to discuss an architecture that had mixed impressions. Some called it brilliant and some of them said, “Nah, this ain’t no transformer!” The architecture I’m talking about is the Fastformer: Additive Attention can be all you need. As we all know by now, Transformers are quite inefficient while scaling up and we have seen a plethora of architectures that claim to mitigate this in their own ways.
As promised, we are doing a hands-on exercise in this blog and taking a much needed break from studying architectures! In this article, we shall attempt to classify text — IMDB movie reviews using a pre-trained Transformer architecture.
Transformer has been a breakthrough architecture which has fared excellently in both NLP and Computer Vision and learning about these kinds of architectures is always beneficial in the long run. I promise a hands-on exercise in our next article wherein we will use various architectures, pick a dataset and observe the performance. For now, let’s get on with Linformer.
This particular architecture has a lower memory requirement than Vanilla Transformer and is similar to the Transformer-XL that models longer sequences efficiently. The below image depicts how the memory is compressed. We can also say that this is drawing some parallels to the human brain — We have a brilliant memory because of the power of compressing and storing information very intelligently. This sure seems interesting, doesn’t it?
In this article, we will be discussing Longformer, which overcomes one of the famous pitfalls of transformers — the inability to process long sequences because of its quadratic scaling with increase in the sequence length. The Longformer is a vanilla transformer with a change in the attention mechanism, which is a combination of local self-attention and a global attention.