Welcome back to the NLP Tutorials! Hope y’all had a good time reading my previous articles and were able to learn and make progress in your journey to NLP proficiency! In our previous post we looked at a project — Document Similarity using two vectorizers — CountVectorizer & Tf-Idf Vectorizer. I hope you tried your hand at Document Similarity with various other techniques and datasets. In this article we shall dive deep into the world of Text Embeddings, which are more advanced and sophisticated ways of representing text in vector form. There are many Word Embeddings out there, but in this article we shall have an overview of Word2Vec, one of the earliest and most famous Word Embeddings developed and published by Google. Let’s get started then!

Example visualization of vectors which are semantically close-by. King-Queen, Man-woman etc. These grouped vectors hold some semantic information which later can be used for representing any given word better than Tf-Idf & Count Vectorizer

Word Embeddings

Word Embeddings are a way of representing text in vector form. They are quite advanced methods of converting natural language text into a set of vectors which are able to hold the meaning of the words in a sentence quite well. They are learned to represent words in an n-dimensional space where similar words are placed close to each other. For ex: King and Queen, Sport and Football, etc. Keeping them close helps to capture and retain some more information about a specific word. 

Why Word Embeddings?

The reason we need Word Embeddings is simple — they are dense representations. The previous techniques — CountVectorizer & Tf-Idf Vectorizer are simple encoding techniques which form a sparse matrix, which can be quite inefficient and might not hold contextual information to a good extent. In Word Embeddings, similar words are placed or grouped together which are denser given a larger vocabulary, thus adding to the strength of the Embedding. 

Word2Vec

Word2Vec is a neural-network approach in creating efficient and semantically strong Word Representations. It is a shallow 2-layer neural network which is trained to reconstruct linguistic contexts of words. The objective function is such that words which have similar context have similar embeddings. If we look into a mathematical perspective, the cosine angle between these vectors should be closer towards 0 degree, which indicates the vectors are very close to each other in the given vector space. For example, while representing a word such as frog, the nearest neighbour of a frog would be frogs, toads, Litoria. The Word2vec model captures both syntactic and semantic similarities between the words. One of the well known examples of the vector algebraic on the trained Word2Vec vectors is “Man” — “Woman” + “Queen” = “King”.

There are 2 parts of Word2Vec — CBOW model (Continuous Bag of Words) and Skip-gram model. Both are shallow networks which are mapping a group of words to a target word. The weights which are learned for the mapping enable the generation of Word Embedding as a whole. Let’s discuss these two models in a detailed manner.

CBOW Model

CBOW model predicts the probability of a word given the surrounding words in a phrase/statement. Let me explain it with an example. Consider this sentence — The sky is blue and beautiful. Given the word blue as the context word, the network tries to predict sky and beautiful as the target words. 

This is repeated multiple times with a certain window size (context window) to cover multiple permutations in that sentence. Here, window size refers to the number of surrounding words to consider for training and prediction. If we set the size to three, in the first iteration: Context Word is [Sky] and Target Words are [Blue, Beautiful]. Second iterations context word will be [Blue] and target words [Sky, Beautiful] and so on. The network is trained by sliding the window repeatedly to get the final weights. Usually a window size of 8–10 is preferred and we get a vector size of 300. This weight matrix itself will become the Word2Vec Embedding and can be used as a vector look-up table later on to convert any given word into a vector form.

Depiction of CBOW Architecture. Image Source — Word2Vec paper

Skip-gram Model

This model does exactly the opposite of the CBOW model. Given a set of target words, it tries to predict the context words. If we take the same example as above — The sky is blue and beautiful, the target word can be [Beautiful] and the source (context) word [Sky]. As we can see in the below picture, the architecture is quite different because given a single target word, it will try to predict multiple surrounding words. 

Depiction of Skip-gram Architecture. Image Source — Word2Vec paper

The below table shows how well Word2Vec is able to represent the word relationships.

Example of word-pair relationships with a vector size of 300 (Skip-gram model)

A word of advice here, for a large corpus skip-gram model works well but it is slow to train, whereas the CBOW is good for a relatively smaller corpus and is faster to train.

Architecture & Training

We have already seen the 2 networks of CBOW & Skip-gram, but we shall have a look at the general architecture and training procedures. So, Word2Vec is quite similar to an auto-encoder but instead of compressing and decompressing, an output layer is attached which gives the probabilities of target words. The input to the neural network is the collection of words which are one-hot encoded. The second layer is a Dense (Fully-connected layer) which contains the Embedding (weights matrix). Output layer is a vanilla softmax function. What we are interested in is the hidden layer matrix and we simply pop out the output layer when the training is done.

Coming to the training aspect, the network is only 2 layers but is very wide which makes it computationally expensive to compute and normalize the vocabulary. The authors reduced the computation by using a hierarchical softmax method. Hierarchical softmax uses a Huffman tree to reduce the computation to a very good extent. More on Hierarchical softmax here.

Some words which are very frequent in the corpus offer very little significant information for the embedding. These are stopwords in NLP like ‘a’, ‘an’, ‘the’ etc. These are subsampled to accelerate the training process. We can also pair words in bigrams and trigrams to increase training speed and also have a modified embedding which will strengthen the phrase vectorization. 

Given below is a snapshot of the similar words for a list of input words. This is a Word2Vec trained on just 50 sentences on Medium Articles dataset. We can see the results are pretty good and if we utilize the entire dataset, the model will perform even better.

Top-n similar words for a given context word. Word2Vec trained on Medium Articles dataset

Conclusion

Word Embeddings are very powerfull and the most widely used text representation and modelling tasks which can outright make a huge difference in a variety of NLP tasks. We had quite a detailed look at Word2Vec, but I still recommend a complete read of the Word2Vec paper for getting a feel of how they developed this architecture and the popular previous works in the Word Embedding space. Word2Vec formed the base for all the latest and more powerful word embeddings like GloVe and fastText. Share your thoughts on Word2Vec in the comments below and try a hand in training a Word2Vec for your own custom dataset. I will leave some links below which can help you in implementing a Word2Vec.

  1. Natural Language Processing Classification Using Deep Learning And Word2Vec 
  2. Word2Vec Keras
  3. Word2Vec TensorFlow
  4. Word2Vec PyTorch

References

  1. Word2Vec: https://arxiv.org/pdf/1301.3781.pdf

Author

Pranav Raikote

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s