NLP Tutorials — Part 11: Transformers

Hello all and welcome back to yet another interesting concept which has time and again proven as one of the best methods to solve major NLP problems with State-of-the-Art accuracy which are near human in performance! That architecture is known as the “Transformers”. The important gain by Transformers was to enable parallelization which wasn’t on offer in the previous model we saw — “Seq2Seq”. In this blog, we shall navigate through the Transformer architecture in detail and understand why it is the breakthrough architecture in recent years.

Background

The base for Transformers is the all important Attention mechanism which is a concept enabling the neural network to learn and pay attention to specific words or phrases and give more weightage to those, enabling modelling of longer sequences with parallelization. Here are a few really good articles to get an overview of the Self-Attention or Attention mechanism:

Once you are ready with the Attention mechanism, we shall delve into the architecture of Transformers.

Note: Make sure to compulsorily read and be thorough with the attention mechanism concept before proceeding further.

Architecture

The Transformer architecture contains a set of Encoder and Decoder blocks. Each of the blocks have a stacked set of layers which are self-attention layers, encoder-decoder attention layers coupled with feed forward layers which are able to capture and learn the concept. There is also usage of residual connections within an encoder or decoder stack layer with Layer Normalization. More on Layer Normalization in this article. Below illustrated diagram depicts a very high level architecture design of a Transformer.

Let’s get into the Encoder block first and understand the various components of the encoder step by step.

Encoder

First things first, we have the inputs in form of sequences — phrases, sentences or paragraphs. The input sentence is converted into a matrix/vector form using a corpus and dictionary index values. The embeddings are linguistic features of that particular sentence which can depict the sentence in a vector form retaining the semantics. These embeddings are fine-tuned in the training process of Transformers and similar contextual words will have similar embeddings. For ex: Sports and Exercise will have somewhat similar embedding representation since both are related to each other. These embeddings are the ones which are leveraged during inference time. And also, all sequence inputs once converted into embedding form will have a fixed d-dimension shape of a matrix. d values can be 512, 768, 1024 and so on.

The next step is Positional Encoding, which is a clever and sophisticated aspect of the Transformer. Position of words in a sentence is very important and a change of position of a single word can alter the meaning of the sentence.

Sentence 1: Even though she did not win the award, she was satisfied.

Sentence 2:: Even though she did win the award, she was not satisfied.

In the above sentences, only the position of the word “not” is changed and the rest are identical but still, we see a drastic change in the meaning. In traditional RNNs and LSTMs, the input was fed in token by token sequentially which retained the token/word position but made it impossible to parallelize. That’s not the case in transformers. You might be wondering, without feeding in sequentially, how will the network know the order of the tokens in a sentence? Answer is via the Positional Encoding. This is basically a method to assign a position to an input embedding indicating the order or sequence of that embedding in the sentence.

The authors employed a sine-cosine based formula with different frequencies which are added to each word embedding. This formula is constructed intelligently such that there won’t be duplication position encoded values for a given input embedding.

Positional Embedding formula. Image Credits — Attention is All You Need paper

where pos is the position of the word in the sentence, d is the input embedding dimension and i is the individual dimension of the embeddings. Here d is fixed, whereas pos and i vary. Since sin and cos functions are cyclic and can take the same values for multiple parameters and the way to counter this is by changing the i values. The frequency changes with each i value. There will be unique representations of positional encoding for each of the tokens.

For every odd index on the position vector, cosine function is used and for every even index the sine function is used. Illustrated below is a rough sketch of positional embeddings for a given sentence. The vector representations for each token are unique and of fixed length (In this image it’s 128, but in reality the number is usually 512 or 768).

Positional Embedding Representation. Image Credits — Taken from Tamoghna Saha’s lecture

So, now we have the final input to the encoder block -> Input Embedding + Positional Encoding which we call as Embedding with Time Signal.

Next up is the Multi-Head Self Attention mechanism. What is self attention you might ask. The answer is, self-attention is a methodology which allows the modelling of each word in the input to the other words in that sentence. It allows the model to learn the hidden connections and meaning within the words in a sentence.

The multi-head self attention was inspired by YouTube searches which were key-query pairs combined with the value that is retrieved. There are 3 matrices initialized Key (K), Query (Q)and Values (V). The final input embeddings are passed through these matrices which are broken down into query, key and value components (this operation finds the similarity between pairs of words and stacking them up together). Next step is to map the query and key and find how similar they are when compared to each other. This is done using a cosine similarity formula (vectorized operation to handle matrix inputs). This matrix is called the Score Matrix. We are trying to compare each word with the rest of the words and generate scores which are going to indicate how important the other words are with respect to the candidate word.

Once we have this score, it is divided by the square root of the dimension of the key vectors (to stabilize the gradients and have a smooth training). This value is then passed via a softmax function to normalize the score. Finally, the softmax value is multiplied with the Values vector which gives the final Self-attention output vector. These steps expressed mathematically are shown below and this operation yields us the self attention output vector.

For a sentence — “You are playing FIFA ”, the scores at the intersection of “you” and “playing” will be higher than “you” and any other words in that sentence. This is the magic of self-attention mechanism and when extended with multiple heads, it is much more robust and powerful in modelling the data beautifully. In the picture below, see the scores for the word pairs which are related to each other in context (They have higher attention scores).

Attention score matrices before and after training. Image by Author.

This is one attention head, to increase the capability of the model the authors of the papers employed 8 attention heads (Multi-head Self Attention) and then concatenated the outputs.

Multi-head attention. Image Credits — Attention is All You Need paper

The advantage was that the model was able to give more emphasis to other words also, which was otherwise focused on the word itself in the self-attention matrix (each head might learn something different and then aggregating it together). Also, the model was able to initialize 8 vector sets which gave multiple options for the model to train for 8 different varieties of Q, K and V matrix values, further aiding in regularization. The result is 8 different attention score matrices which are concatenated and multiplied with another weight matrix to get a linear embedding vector. After the multi-head attention comes the Residual Connections and Layer Normalization. The residual connections help in making sure the gradients are not vanishing and the normalization prevents overfitting of the model.

The attention vector outputs are passed onto a Feed-forward layer which has a ReLU activation function. The purpose of this layer is to process the scores into a linear format which can be used by the next encoder block.

All this put together forms one encoding block and we can have N encoding blocks stacked up to increase the performance of the model.

Decoder

The output here includes special tokens <start> and <end> at the start and end of the sentence respectively. The output will be the translated sentence shifted by one token such that auto-regressively it is able to attend to the previous tokens and generate the next token. The positional encoding is similar to what we saw in the encoder block.

Coming to the Masked Multi-Head Attention which is the first Attention sub-block, it is very similar to the normal attention but with a mask. This mask prevents the model from looking up further ahead while training the weights. The model at a given time step or a candidate token should be able to access/attend itself and all the previous words only and not the future words.

Masking of the upper triangle values in the matrix. Image by Author.

As we see the above matrix, the upper values are masked. If we run by an example and consider the second row, the word in contention is are and it can attend to scores of itself and the previous word You. Similarly in the third row, the word playing can attend to playing, are and you. Not the next word FIFA. This is a naive explanation (The special <start> and <end> tokens are not represented) of masking the attention scores so that the model can attend to previous tokens and try generating the next word.

The attention mechanism is the same as Encoder except an extra mask operation. Image Credits — Attention is All You Need paper

The X values depicted in the above image are filled with negative infinity (Look-ahead mask). When we apply the softmax function, the upper triangle values which are filled with negative infinity will be filled with 0s and this is how the masking is conceptualized. Again this masked self attention has multiple heads and is concatenated and multiplied with a weight matrix similar to the encoder.

These outputs are passed onto the next Attention sub-block which is the Encoder-Decoder Attention which takes input from the Encoder block (Latent representation vector) and from the previously discussed Masked Multi-Head Attention. Here the model is able to understand and decide which encoder section is relevant to focus on given both encoder output and decoder output. The masked attention outputs are mapped to the non-masked attention outputs, and thereby the decoder is able to predict the next word by looking at the encoder output and self-attending to its own output. This is where it all comes together where input and output are mapped for relevance.

Next steps of Residual connections, Layer Normalization and the Feed-forward layer are exactly the same as the Encoder block.

Coming to the last parts of the Transformer architecture, we have a Linear layer followed by a softmax layer. The number of output logits will be equal to the total number of unique words in the vocabulary of training data. If we have 30000 words in the training data, we will have 30000 output neurons. The softmax layer outputs the probability of the next word. The neuron with the highest probability is looked up into the vocabulary index and hence we get the generated word.

The model’s loss function is a simple Categorical Cross Entropy optimized by an Adam optimizer function. The authors trained the model with a cyclic learning rate. For more details on training please have a look at the Attention is All You Need paper (Link to the paper in References section below).

Advantages and Disadvantages

Advantages

Can utilize multiple GPUs for distributed training
Outperforms all other language models like LSTMs, Bi-LSTMs, GRUs etc.
Retains the contextual information very well

Disadvantages

The computational power required is very high as lot of calculations are happening across the network
When we increase the sequence length the increase in computation is quadratic and this is not good for scaling the network for longer sequences

Conclusion

The Transformer is a complex architecture filled with many concepts, sub-models and stackings which might be overwhelming at first but then give the article a read more than twice and I’m sure you will understand the magic Transformers are able to do! This architecture gave rise to many phenomenal neural nets which have improved over time since the original research was published in 2017. Even today tremendous work is happening in this segment to improvise and build on the Transformer paradigm. I hope you were able to absorb all the concepts and understand this architecture as we are going big in our future articles — GPT, BERT, BigBIRD architecture which are significant architecture pertaining to the NLP world.

Don’t forget to have a look at the resources in the References section (I have collected some cool things for you to explore, learn, and also train a Transformer!).

References

Transformers (Attention is All You Need): https://arxiv.org/pdf/1706.03762.pdf
Attention is All You Need talk by Lukasz Kaiser: https://www.youtube.com/watch?v=rBCqOTEfxvg
Tensor2Tensor library: https://ai.googleblog.com/2017/06/accelerating-deep-learning-research.html
Language Translation using Transformers: https://www.tensorflow.org/text/tutorials/transformer

Author

Pranav Raikote

NLP Tutorials — Part 11: Transformers

Background

Architecture

Encoder

Decoder

Advantages and Disadvantages

Advantages

Disadvantages

Conclusion

References

Author

Published by Nihal Kashinath

Leave a comment Cancel reply

NLP Tutorials — Part 11: Transformers

Background

Architecture

Encoder

Decoder

Advantages and Disadvantages

Advantages

Disadvantages

Conclusion

References

Author

Share this:

Related

Published by Nihal Kashinath

Leave a comment Cancel reply