Welcome back to yet another interesting article in our NLP Tutorials series. In this article we will be talking about Transformer-XL which outperformed the Vanilla Transformer (Attention is All You Need) in accuracy metrics and handling long-term context dependencies which we often see in real world tasks.
Transformer-XL was able to learn dependency 80% longer than RNNs and 450% longer than Vanilla Transformer.
You heard it right, a whooping 450%! Transformer-XL is also a mind-blowing 1800 times faster than Vanilla Transformers. These numbers are very huge claims. Let’s dig deep into the architecture and understand the mechanism by which it is able to achieve SOTA performance.
Vanilla Transformer was a breakthrough and very fundamental architecture which was a base for all newer SOTA architectures. The authors of Transformer-XL touch upon a few major drawback in Vanilla Transformer — “Can’t handle long-term dependency”. This was mainly due to the design which involved a fixed length which meant that there was no flow of information across the segments/blocks. Vanilla Transformers also solved the Context Fragmentation problem which is the phenomenon of the model not being able to predict the symbols/tokens due to the way the context was selected (without respect to a sentence or semantic boundaries).
These issues were overcome by two main concepts called Recurrence mechanism and a novel Relative Positional Encoding. Let’s look into these concepts in detail now.
To address the limitations of fixed-length context, the authors tried implementing a recurrence-like mechanism involving a cache which is holding the previous segments’ hidden sequence context. Note that the segment length is fixed, the novelty lies in the introduction and usage of the cache concept (Similar to RNNs). This allows the network to look into the cache and get context from previous segments and hence overcome the long-term context dependency and context fragmentation problems.
With this recurrence mechanism applied to every two consecutive segments of a corpus, it essentially creates a segment-level recurrence in the hidden states. As a result, the effective context being utilized can go way beyond just two segments.
Simple and effective solution (caching a sequence of hidden states instead of just the previous one). We get the best of both worlds of RNNs and Transformers! There is a speed advantage due to the recurrence mechanism and the virtue of segments is that the iteration can process the complete segment (and use previous segments’ context) at once instead of token by token in Vanilla Transformers. During inference, the pre-computed and cached segments’ retrieval and usage is much faster than going back and attending to the previous tokens one by one (1874 times faster than in a Vanilla Transformer) .
We can see in the above illustration how in the training phase in any given segment the network is able to attend to previous segments. The recurrent hidden state context dependencies are shown by the green lines. In the evaluation phase, we can see how the context is extended.
This poses a problem because there will be a duplication of positional encodings as the network is handling positional encoding segment-wise. The first token of both first segment and second segment will have the same positional encoding which might be disastrous. This is also known as Temporal Confusion Let’s now look into the second novelty — “Relative Positional Encoding”.
Relative Positional Encoding
To solve Temporal Confusion, the authors employed a relative positional encoding technique which is based on relative distance between tokens and not their absolute position. Also, this positional encoding is part of each attention layer as opposed to having it only before the first attention layer.
Positional encoding is like a bias injected which allows the network to attend to specific information (where to attend). In Relative Positional Embedding, this bias (Positional Information Bias) is added at each attention layer score. When the network is iterating over a segment, only the relativity of the tokens (relative distance between each Key Vector and itself) with each other is sufficient, not requiring a unique encoding (absolute position). This is achieved by computing the relative distance between each key vector and query vector and adding to the attention score. Now, the query vector can distinguish the representations and enable reusing the state and caching mechanism. The way we won’t lose the temporal information is we can retrieve the absolute position by recursive computations from the relative distances.
Our formulation uses fixed embeddings with learnable transformations instead of learnable embeddings, and thus is more generalizable to longer sequences at test time.
Performance of Transformer-XL was compared with a plethora of models and the results are shown below.
As we see, Transformer-XL is bettering every other architecture on almost every dataset and task. Do read the paper to get more insights on results and don’t miss the Ablation Study section (it will be most interesting).
Transformer-XL gives SOTA results for language modelling on many datasets with two novel introductions which overcome the problems faced in Vanilla Transformers. That was an exciting architecture to study, wasn’t it? Now we are progressing well into the Intermediate NLP stage from a Beginner by understanding a host of architectures right from RNNs to Transformer-XL. Well, it doesn’t stop here. There are really good architectures which are built on top of Transformers either mitigating the disadvantages (problems) or building on top of it to achieve SOTA scores. Stay tuned for our next article where we will consider yet another brilliant architecture in our quest for NLP supremacy.
- Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context: https://arxiv.org/pdf/1901.02860.pdf
- Original Implementation of Transformer-XL: https://github.com/kimiyoung/transformer-xl/