Hola! Welcome back to the follow-up article on LSTMs. In this article we shall discuss 2 more architectures which are very similar to LSTMs. They are Bi-LSTMs and GRUs (Gated Recurrent Units). As we saw in our previous article, the LSTM was able to solve most problems of vanilla RNNs and solve a few important NLP problems easily with good data. The Bi-LSTM and GRU can be treated as architectures which have evolved from LSTMs. The core idea will be the same with a few improvements here and there.
The expansion is Bidirectional LSTMs. Straightaway, the intuition is something related to double direction LSTM. Is it LSTM trained forward and backward? The precise answer is — Two LSTMs taking sequential input in forward and backward directions. The two networks are identical and share the hyperparameters also while training. Only difference is one network is taking input from the beginning of a sentence and moving forward while the other is feeding data from the end and moving backward. You might be thinking — Is this any good? Before we answer this question, let’s take a look at the architecture illustrated below.
Architecture & Working Principle
The answer to your question “Is this any good?” is Yes! It is much better than vanilla LSTMs. Bi-LSTMs are able to model the text in a better way. It can increase the information for that particular sentence/corpora by modelling it both forwards and backwards. The example sentence in the image “Heart is not enlarged” is too small to consider and explain the real workings. We shall consider this example-“The pizza with paneer topping is very tasty”.
The forward-direction LSTM will be seeing “The pizza” and the backward-direction LSTM will be going through the other part of the sentence “paneer topping is very tasty”. If you consider this at a single pass, the information available to learn is definitely more than a single LSTM. Increase in context results in faster training and better results. The backward-direction LSTM enables locking the context from the future and it will be like filling in the blanks in the best possible way. And with the information of two hidden states (from two LSTM networks), we are able to preserve contextual information of both past and future. In certain situations the backward context is very very important.
“Teddy bears are beautiful toys”
“Teddy Roosevelt, the president of United States”
What word comes after Teddy?? Very difficult to predict if you have context only from the forward direction. (This example was used by Andrew Ng in his course).
In terms of formulae and computation, it doesn’t change apart from adding one more matrix and its own set of secondary calculation steps. At any given unit, the result will be a combination of both forward and backward direction networks. However, the backpropagation is happening twice — one for forward network and another for backward network (independently).
Bi-LSTMs achieved a state-of-the-art performance in Part-of-speech tagging and Named Entity Recognition. They have applications in word classification and are brilliant in Time-Series modelling. On the other hand, since there are two networks involved, the computation power required is double than that of a single LSTM network and also takes much more time to train but the performance is state-of-the-art. There are few areas where Bi-LSTMs can’t be applied. One example would be real-time speech recognition or translation as Bi-LSTMs will require the entire sequence for doing any task. This was solved by BERT which is a much more powerful and complex architecture than Bi-LSTMs. (I’m already excited about BERT! But, we should be patient and understand 1 or 2 more architectures before we are at BERT level).
There is another architecture which is not so popular now but had gained traction in earlier days of LSTMs and Bi-LSTMs. Nonetheless, we shall have an overview on that also. (Leaving no stone unturned in our quest of mastering NLP!) Right then, onto GRUs.
Gated Recurrent Units
GRUs are very similar to LSTMs. It functions by letting information pass or not via gates. Instead of 3 gates in an LSTM, there are only 2 here — Update gate and Reset gate, which make is outright faster because of less computation per iteration. Let’s get into the architecture and working principle quickly.
Architecture & Working Principle
If we compare it to the LSTM architecture (refer to the previous article), the forget and input gates are combined to form a unified update gate. Also, there is no separate cell state line running through the cell, instead we are left with dealing with the data using the hidden state and input information only. Let us understand step by step.
First, the Update gate takes in previous hidden state info and combines it with current input and multiplies with its own weight Wz. As usual the update gate is dealing with the operation — “How much of past information is to be passed along to the next unit”. It has the ability to retain/use entire information from the previous state h(t-1). (Vanishing gradient problem might not occur here!)
The Reset gate is dealing with “How much to forget?”. The formula is very similar to the Update gate, but will use a different weight matrix. We will come back to the reset gate sometime.
The latest memory content in the current unit is calculated using the below formula which will in-turn be depending on the past hidden state and reset gate weight matrices. The reset gate information r(t) and h(t-1) is multiplied first. This will determine the information to be removed from previous steps. x(t) input is then multiplied with a weight matrix W and then both the results are combined and squashed within a range [-1, 1] using a tanh function.
Finally the output h(t) state is calculated, which is the information at the current step and is passed along to the next units in the network. The state of “What information to pass along” is held by the update gate. An element wise multiplication is performed between h(t-1) and 1 — z(t) and added to another element wise multiplication performed between z(t) and h(t)(latest). The formula is given below.
Consider an example where the most relevant information is at the beginning of the sentence. In this case, the z(t) value will be closer to 1 so as to take in the majority chunk of the previous information. Now, in the first half of the formula, z(t) being very close to 1 will result in not much information from the current state being factored into the calculations. Isn’t this brilliant? Yes it is.
Bi-LSTMs were superior to vanilla LSTMs and performed way better by adding another similar network which is running in parallel but with sequences given in reverse order. The contextual information is doubled and hence, can factor past and future context at a given point of time. Coming to the GRUs, by using just 2 gates and maybe a few extra element-wise operations, they are able to store, remember/forget and pass along vital information throughout the network units. Because of fewer major operations, it is faster than LSTM. There is no superiority between LSTMs and GRUs and they can be used interchangeably. In fact depending on the data and task, all three architectures — LSTMs, GRUs and Bi-LSTMs can be considered to give impressive results. In our next article, we shall look into another special architecture Seq2Seq which for the first time used an Encoder-Decoder setup. Exciting things coming up in the next one, don’t miss it at any cost. Put down your thoughts in comments below whilst you try implementing RNNs for various NLP tasks.
P.S: Below are the resources for implementing the discussed architectures.
- GRU: https://arxiv.org/pdf/1406.1078v3.pdf
- Text Generation using RNNs: https://www.tensorflow.org/text/tutorials/text_generation
- Text Classification using RNNs: https://www.tensorflow.org/text/tutorials/text_classification_rnn
- Spam Filtering: https://hub.packtpub.com/implement-rnn-tensorflow-spam-prediction-tutorial/ (Slightly Advanced)