Hello and welcome back to the NLP Tutorials! In our previous article we had a discussion on one of the popular Word Embedding technique — Word2Vec. It was a revolutionary word representation technique which changed the face of solving NLP problems. Although Word2Vec was good, it still has a few drawbacks which were strongly overcome by the GloVe Word Embeddings. GloVe stands for Global Vectors. This embedding model is mainly based on capturing vector statistics in global context. Due to capturing more data on a global level (document), it is high-dimensional and memory intensive but gives excellent results in a majority of NLP tasks. Let’s quickly get into the details of GloVe embeddings.

Background

Before understanding GloVe, we shall first have a look at Word2Vec’s drawbacks. It mainly has 2 drawbacks:

  1. Captured only local-context but the overall corpus context was modelled effectively. 
  2. Not able to generate vectors for unseen words. It struggles to handle these cases as it is a highly localized context. If we had a global statistic involved, maybe the vector generated would have been good and work well for the NLP usecase.

Now that we know the drawbacks of Word2Vec, lets understand the GloVe Embeddings (Global Vectors)

GloVe

Glove captures both global statistics and local statistics for generating the embeddings. GloVE is a count-based model, which learns vectors by performing dimensionality reduction on a co-occurrence counts matrix. First they construct a large matrix of co-occurrence information, which contains the information on how frequently each “word” (stored in rows), is seen in some “context” (the columns).

The question now is how do we measure the co-occurrence of 2 words? Answer is Pointwise Mutual Information, one of the popular co-occurrence measures.

PMI formula

P(w) is the probability of a word occurring and p(w1, w2) is the joint probability. Higher the PMI, higher the association between the words.

GloVe embeddings are better than Word2Vec by virtue of one amazing concept — Co-occurrence. Basically, the global statistics that GloVe is capturing is all due to the quantification of co-occurrence of 2 words. A co-occurrence matrix X is constructed where a cell X(ij) is giving us how often the word i is appearing in context of word j in the entire corpus. For understanding this clearly, we need some example words and a matrix

Co-occurrence probabilities for words — ice and steam from a 6 billion token corpus. Image Credits — GloVe paper

In the above table, P(k|w) is the probability that the word k appears in the context of word k. The value for P(water|ice) is higher which means they occur together (water and ice) more often than not. Same goes for P(water|steam). But the real deal is the Ratio of two sets i.e. P(solid|ice)/P(solid|steam) is higher than any other ratio, that indicates these word pairs co-occur strongly than the other pairs. For a given word to be related to both ice and steam, such as water, the expected ratio would be close to one. Ratio lesser than 1 is for those sets which don’t really co-occur well, like fashion is not related to ice and steam.

The best starting point for word embedding learning should be with ratios of co-occurrence probabilities rather than the probabilities themselves. 

Why GloVe is better than Word2Vec

Rather than using a window to define local context, GloVe constructs an explicit word-context or word co-occurrence matrix using statistics across the whole text corpus. The result is a learning model that may result in generally better word embeddings.

Training

The way GloVe predicts surrounding words is by maximizing the probability of a context word occurring given a centre word by performing a dynamic logistic regression. The objective function here is to learn the vectors such that the dot product of the word vectors will equal the log of words’ probability co-occurrence. As we know that log a/b will be equal to log a — log b, this will result in vector differences in the word vector space. Because these vector differences are based on a strong statistic,  Co-occurrence probability, the resulting embeddings are powerful and perform excellently compared to other embeddings like Word2Vec.

For dealing with noisy examples, the model is a weighted least squares regression with the threshold alpha of 3/4. This is analogous to clipped power-law concept which assigns less weights for the corresponding low co-occurrence word sets.

Results

GloVe model was trained on 5 different corpora — 2010 Wikipedia (1bn tokens), 2014 Wikipedia (1.6bn tokens), Gigaword 5 (4.3bn tokens), a combo of Gigaword 5 and 2014 Wikipedia (6bn tokens) and finally the 42bn token Common Crawl dataset. The data was lowercased and then tokenized by the Stanford tokenizer. 

GloVe achieved a combined accuracy on various tasks at 75% which was very good. It performed well on word similarity and named entity recognition tasks too. Illustrated below is the comparison between GloVe vs CBOW and GloVe vs Skip-gram. As we can see, the overall accuracy is increasing with time.

Overall accuracy on Word analogy task (function of training time). Image Credits — Glove paper

Advantages

Faster training, Highly scalable to massive datasets, Good performance on small datasets too (Custom trainings)

Disadvantages

Memory — Utilizes massive memory and takes time to load and store the vector values on RAM and is quite sensitive to the initial learning rate.

Conclusion

GloVe can effectively model global and local statistics very well and comes out leaps and bounds ahead in capturing the semantic relationships, which results in a second-to-none performance in various NLP tasks. Although GloVe and Word2Vec are closely matched in a few tasks, GloVe takes the upper hand due to the global statistical feature.

This is how the authors of GloVe concluded their paper:

We construct a model that utilizes this main benefit of count data while simultaneously capturing the meaningful linear substructures prevalent in recent log-bilinear prediction-based methods like word2vec. The result, GloVe, is a new global log-bilinear regression model for the unsupervised learning of word representations that outperforms other models on word analogy, word similarity, and named entity recognition tasks.

There are a lot of tricks performed to solve few prominent problems in modelling with the ratio approach of co-occurrence probabilities; they are detailed out very well in the GloVe paper. I would recommend a thorough read to get into the core math aspects of statistical modelling. And now we are at the end of yet another comprehensive review on an NLP concept. This time it was all about GloVe! 

Share your thoughts on GloVe in the comments below and try a hand in implementing a GloVe model for your own custom dataset. I will leave some links below which can help you in training an embedding for your dataset.

  1. Robust GloVe with SpaCy
  2. Guide to Train GloVe
  3. Keras guide to implementing GloVe

References

  1. GloVe: https://nlp.stanford.edu/pubs/glove.pdf

Author

Pranav Raikote

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s