Warm welcome to another interesting article in the NLP Tutorials series. In this article we will try to understand an architecture which forms the base for advanced models like Attention, Transformers, GPT, and BERT. This is widely used in machine and language translation tasks. The encoder will encode the input to a fixed-length internal representation which is then taken by the decoder to output words in another form/language. Nowadays, we are seeing multi-modal tasks being performed using a single model i.e Text translation from English to French, Spanish and German language using a single model! Since the input and output are always text in the form of sequences, this architecture is popularly known as Seq2Seq.
The Encoder-Decoder blocks are comprised of RNNs (Vanilla RNN or LSTM or GRU) which are connected sequentially as shown below. In this article, we shall assume LSTMs are the underlying architecture within the Encoder and Decoder blocks.
There are various sequence tasks which differ in various ways, which are:
- One-to-One: fixed size input and fixed size output (Image Classification)
- One-to-Many: single input and sequence output (Image Captioning)
- Many-to-One: sequence input and single output (Sentiment Analysis)
- Many-to-Many: sequence input and sequence output (Machine Translation)
- Many-to-Many: synced sequence input and output (Video Classification)
Seq2Seq was used fundamentally for Machine Translation. We are considering that as the problem to solve using Seq2Seq. Machine Translation is simply translating a sequence word-by-word into another form/language. For example, translating a sentence in English language to German language. Now that we have context about the task, let’s deep dive into the Encoder and Decoder block’s architecture and the working principles.
The encoder block is an LSTM which takes a sequence input and starts converting it into a vector form (internal hidden representation). The outputs from the top are discarded and the final hidden and cell states are collected at the end of LSTM cells. We are not concerned about the output lines from each LSTM cell, but the context vector, which is the hidden representation, is important to us. Input to the encoder is the source text of translation. In our example, it is an English sentence.
The decoder block which is also an LSTM takes the final hidden state and cell states from the encoder and that is the starting point for the decoder’s LSTM (initial states of the decoder are the final states of the encoder). Here the input is the translated string i.e German sentence which is the real-world translation of the English sentence.
At each time step the decoder is able to output the translation token-wise; First token at time-step t1, second translated token at time-step t2 and so on. One thing to observe here is for the second time-step, the output of the t1 is provided as an input to t2. Output of each time-step is fed as an input to the next time-step. Once it reaches the end of a sequence, usually a special character like <EOS> or <END> indicates that the decoder should stop translating here. We shall delve into the next section where we will discuss the overall working principle, training and testing the Seq2Seq which will give more explanation on Encoder-Decoder working in tandem.
English Sentence: hello how are you doing
German Translation: hallo wie geht es dir
First thing to do is vectorize and add the <START> and <END> token for the sequences. The vectorization form is one-hot encoding. Both input and output data is one-hot encoded.
The encoder behaves identically in both training and testing phases — Accepts the input sequence token by token and starts calculating the hidden and cell states. At the end of the sequence, it will pass on the final hidden and cell state to the decoder. Coming to the decoder working, it is trained to generate the translated tokens sequentially given the translated token from previous time-step. In the decoder block diagram, for input token <START> it is trained to generate the output hallo. Now the “hallo” token which is translated is fed to the input of the next LSTM cell along with “wie”. Here the network can learn the sequenced translation given the current token and the previous translated token.
A softmax layer is added across the LSTM outputs in the decoder which will calculate the probability distribution over the entire vocabulary in the output sequence data. The token with max probability will be outputted as the predicted translated token. The errors are calculated at every time step and the network minimizes the loss by backpropagation through time just like a conventional LSTM.
During testing, the decoder will pass on the previous translated token to the next time-step but the explicit translated inputs will of course not be there. The network has to rely on the context vector, the token from previous time-step and the weights which were fine-tuned in training steps.
The performance of Seq2Seq will depend on robust training data, embedding dimensions and the training steps. For our example hello how are you doing, the network may output correctly hallo and wie tokens at the end of two time-steps but instead of geht it predicted wie again. Even if the predicted word is wrong in the 3rd time-step, it is passed on to the next step and the predictions will revolve around the incorrect prediction. We may get the final output as hallo wie wie es dir. If this is the case, we need to feed some more data and train it for longer epochs to get good translations. One more improvement which can be made is using embedding layers for both encoder and decoder while training as one-hot encoding can be quite lengthy for longer sequences and embeddings are one of the best text-representation techniques out there.
Given below will be the final architecture of Seq2Seq after combining all the blocks.
Advantages & Disadvantages
- State-of-the-Art performance on sequence modelling tasks
- End-to-End trainable system without individual optimization and fine-tuning of encoder or decoder blocks
- Limited sequence lengths as increasing this parameter led to more noisy output
- Requirements for data and compute resources are pretty huge. The original Seq2Seq architecture took around 10 days to be trained on multiple GPUs to achieve the state-of-the-art BLEU (Bilingual Evaluation Understudy, which is computation of similarity score between human translations and machine translations) score of 34.81
- Harder to control without specifying rules for the task. Some tasks might need specific grammatical rules for better sequence modelling and translations
Here we are at the end of a comprehensive understanding of Seq2Seq (Encoder-Decoder) architectures. I will recommend a read of the original paper Sequence to Sequence learning with Neural Networks. Here is a video of the presentation of this paper. Watching the video after going through the blog will enhance your understanding of Seq2Seq. Also, try a hand at implementing a Seq2Seq using Google Colab’s powerful GPUs for free!
One more interesting sequence modelling task is Image Caption Generator. Given an image, it should output a sequence which will describe the image. How cool is that! Have a glance at that too. It is a very interesting problem in Computer Vision and NLP.
The Seq2Seq was simply the best in many sequence tasks including Machine Translation. It does have some pitfalls which were looked into and eventually gave birth to the Attention mechanism which took the world by storm a few years ago and continues to be involved in major NLP and Computer Vision architectures. On to that in our next article!
- Sequence to Sequence learning with Neural Networks: https://arxiv.org/pdf/1409.3215
- Deep Visual-Semantic Alignments for Generating Image Descriptions: https://arxiv.org/pdf/1412.2306.pdf (Image Caption Generator)
- Stanford NLP’s Sequence-to-Sequence PPT Slides: https://nlp.stanford.edu/~johnhew/public/14-seq2seq.pdf