Hi and welcome back to yet another article in the NLP Tutorials series. So far, we have covered a few important concepts, architectures and projects which are important in the NLP domain, the latest being Recurrent Neural Networks (RNNs). Time to move a step ahead and understand about an architecture which is advanced and performs excellently over RNNs. You heard it right — Long Short Term Memory (LSTMs) networks. Without wasting time, let us first go through a few disadvantages of RNNs and what did not work for them, which in turn will set the context right for understanding LSTM and how it was able to solve the problems of RNNs.
Pitfalls of RNNs
RNNs fail at long sequences. Longer the sequence, more chances of the network not training effectively and fit for the data. This is mainly due to the vanishing gradients. Theoretically the architecture should be able to manage this provided the parameters are hand-picked and fine-tuned very carefully. Again, it was limited to smaller problems and it failed to solve the more complex problems like Text Generation. Enter LSTMs with their ability to solve the longer dependency problem. There are a few very good architectural improvements and we shall inspect them in detail in the coming sections.
LSTM Architecture & Working Principle
One of the key components of LSTMs is the cell state. It can be thought of as a horizontal line that runs through all the cells where information is collected and flows from cell to cell. The core working principle is to manipulate this data using a few gates. This gives LSTMs the ability to add, remove, increase or decrease the information flow. A simple sigmoid gate with its output and a pointwise multiplication with the existing cell state alters the information flow and this is how LSTMs are able to modify the information flow very efficiently and make sure the core context is retained for that training example.
The below diagram illustrates the LSTM architecture. There are a total of 4 operations happening in each pass of the cell. Let us look at them in a stepwise manner and understand how the cell state information is altered.
The initial stage is dealing with what information we are discarding from the cell state. The h(t-1) and x(t) combined are passed to the first sigmoid gate which outputs f(t). f(t) will be a number between 0 and 1 since it is coming from a sigmoid gate. An output of 1 will indicate “Retain this information” and 0 is indicating “Discard this information”.
Next up, it is “What new information is to be stored in the cell state in this iteration?” This is a combination of the second sigmoid gate and the tanh layer represented by i(t) and C(dt) respectively. This sigmoid gate is called Input Gate Layer which decides the values to update. The tanh layer creates a new candidate value vector C(dt) for the data.
These combined are going to make way for a new update to the cell state by the below formula.
The values from the first stage are multiplied point-wise first and then added to the pointwise addition of i(t) and C(dt). This is the third stage which is also the Update layer. Now comes the final stage where we output the value to the next cell. Given below is the formula for the output stage/layer.
The output is based on cell state, input and the parts of cell states which are going to be updated and sent as output. o(t) is obtained from a sigmoid layer which will give us which cell states to output. This is a filtering operation. After this, it is multiplied with tanh(C(t)) which is the updated cell state. Here, tanh is used to squash the cell information in the range of (-1, 1).
Advantages & Disadvantages
- Eliminates the long-term dependency problem
- Excellent performance in text generation and language modelling
- Can perform speech and handwriting recognition
- Complex operations at each cell adding up to the overall computational parameters — Slow to train and consumes high memory bandwidth
- There might still be traces of vanishing gradients observed as data has to propagate from cell to cell. However, it is minimized to a large extent compared to RNNs
- Are prone to overfitting and difficult to apply the standard Dropout
LSTMs showed that they can handle longer dependencies and solved the vanishing gradient problem to a certain extent. Due to good performance on longer sequences, it opened up an avenue of problem solving techniques using LSTMs for complex NLP tasks like Speech Recognition, Language modelling and Text generation. These projects seem interesting, don’t they? In our next post in the NLP Tutorials series, we shall work on an exciting problem statement and attempt to solve it using LSTMs. Until then, make sure you are ready for the next article by understanding and revising concepts of RNNs & LSTMs!
- Long Short Term Memory: http://www.bioinf.jku.at/publications/older/2604.pdf