Welcome back to yet another interesting improvement of the Transformer (Attention is All You Need) architecture — Compressive Transformers. This particular architecture has a lower memory requirement than Vanilla Transformer and is similar to the Transformer-XL that models longer sequences efficiently. The below image depicts how the memory is compressed. We can also say that this is drawing some parallels to the human brain — We have a brilliant memory because of the power of compressing and storing information very intelligently. This sure seems interesting, doesn’t it?

Background
When it comes to overcoming the major drawback of Transformers, which is modelling longer sequences efficiently (without a quadratic increase in computation), Transform-XL showed the way how it is done. But Compressive Transformer came in and maybe did it in a better way without discarding the memory, which used to happen in the XL after a certain window length. Compressive Transformers (CTs) can also intelligently discard non-relevant memory and compress the relevant blocks into a much coarser memory that can be inferred over a longer time.
Architecture
To begin with, there are 2 memory banks in each layer called the Respective memory and Compressive memory. Consider a window size of ns and when the model processes the sequence, the memory is pushed into a First-In-First-Out (FIFO) mechanism onto a memory stack (we can call this cache).
Compressive Transformer is an extension to the vanilla transformer where memories are mapped to a smaller set of compressed representations.
Unlike the Transformer-XL where, given a certain sequence length and window size the older memories are discarded over time, Compressive Transformers instead compress the older memories via an algorithm which are stored in a secondary FIFO memory.
The above formula is followed for compressing the memory where ns/c is the compressed memory, d is the hidden activation size and c is the compression rate. In the above image, c value is 3, ns is also 3, memory size nm is set at 6.
There were many approaches tried before fixating on a compression function. A few of them are Max Pooling, 1D Convolutions, Dilated Convolutions and Most Used that are just memories sorted by attention values and the top ones are preserved.
There was a neural network employed with an Auto-encoding loss to reconstruct the memories from the compressed memory stack. But, this attempts to retain all of the memory. Instead, they tried an attention-reconstruction loss which is a lossy objective and discarded memories as and when required. Hence, the compression network optimizes the compression objective and the transformer optimizes the task objective. Also, the temporal range of CT is at least 2–3 times greater than Transformer-XL with the same computational cost.
Results
The model was benchmarked on a newly formed language modelling benchmark called the PG-19 using the dataset containing long-range dependencies (text books, novels etc). Selected Project Gutenberg books were used to consolidate a dataset consisting of more than 25,000 books which added up to 11GB in size. It was also tested and benchmarked on enwik8 dataset. The below image illustrates the performance of Compressive Transformer.

As we can see, Compressive Transformer fares better than all other architectures. CTs were optimized by the Adam Optimizer with a linear warm-up based learning rate scheduler. The gradients were also clipped to have a norm of 0.1, which helped in successful optimization and a better model fit.
The authors also observed the average attention of the network — sequence, memory and compressed memory. Based on the data observed, it was clear that the attention was still present in older memories that used to be discarded in the default setting of vanilla transformers. Hence, the compressive approach is helping retain the older memories in turn enabling modelling of longer-sequences.
Based on the data collected, it showed that there is an increase in attention from the older memories/activations stored in the compressed memory also which was going against the trend of discarding and not giving importance to the older memories also strengthening the case of using Compressive approach is benefiting in the longer-sequence modelling.
Conclusion
The Compressive Transformer has extended the temporal range 2–3x times that of Transformer-XL with no additional cost. Also, there is another network solely handling the compression objective and is complimenting the overall performance of the model as a whole. There arose a new language modelling benchmark — PG-19 by virtue of this paper/research. And all this is not limited to text, but extended and proven to perform well in other modalities like speech, vision and reinforcement learning too, which is outstanding.
So, a lot of things to take away from the architecture and the various experiments conducted. Do read the research paper for the intricate details and experimentations. I will see you in the next article in this NLP Tutorials series where we are now firmly at an intermediate level in our journey to NLP Mastery!
References
- Transformer (Attention is All You Need): https://arxiv.org/pdf/1706.03762.pdf
- Compressive Transformer: https://arxiv.org/pdf/1911.05507.pdf
- PyTorch Implementation: https://github.com/lucidrains/compressive-transformer-pytorch
Author
Pranav Raikote