NLP Tutorials — Part 23: Fastformer: Additive Attention Can Be All You Need

Hello and welcome back to an article where we are going to discuss an architecture that had mixed impressions. Some called it brilliant and some of them said, “Nah, this ain’t no transformer!” The architecture I’m talking about is the Fastformer: Additive Attention can be all you need. As we all know by now, Transformers are quite inefficient while scaling up and we have seen a plethora of architectures that claim to mitigate this in their own ways.

NLP Tutorials — Part 21: Linformer: Self-attention with Linear Complexity 

Transformer has been a breakthrough architecture which has fared excellently in both NLP and Computer Vision and learning about these kinds of architectures is always beneficial in the long run. I promise a hands-on exercise in our next article wherein we will use various architectures, pick a dataset and observe the performance. For now, let’s get on with Linformer.

NLP Tutorials — Part 20: Compressive Transformer

This particular architecture has a lower memory requirement than Vanilla Transformer and is similar to the Transformer-XL that models longer sequences efficiently. The below image depicts how the memory is compressed. We can also say that this is drawing some parallels to the human brain — We have a brilliant memory because of the power of compressing and storing information very intelligently. This sure seems interesting, doesn’t it?

NLP Tutorials — Part 19: Longformer: Long Document Transformer

In this article, we will be discussing Longformer, which overcomes one of the famous pitfalls of transformers — the inability to process long sequences because of its quadratic scaling with increase in the sequence length. The Longformer is a vanilla transformer with a change in the attention mechanism, which is a combination of local self-attention and a global attention.