Hello and welcome back to the NLP Tutorials Series! Today we will move forward on the Road to becoming proficient in NLP and delve into Text Representation and Word Embeddings. To put it in simple terms, Text Representation is a way to convert text in its natural form to vector form – Machines like it and understand it in this way only! The numbers/vectors form. This is the second step in an NLP pipeline after Text Pre-processing. Let’s get started with a sample corpus, pre-process and then keep ‘em ready for Text Representation.
The various methods of Text Representation included in this article are:
- Bag of Words Model (CountVectorizer)
- Bag of n-Words Model (n-grams)
- Tf-Idf Model
- Word2Vec Embedding
- GloVe Embedding
- FastText Embedding
We are going to download NLTK and Spacy packages. Also, we shall download a few files and data on top of the base packages.
!python -m spacy download en !python -m spacy download en_vectors_web_lg !pip install nltk import nltk nltk.download('punkt') nltk.download('wordnet') nltk.download('stopwords')
Reading Data and Pre-processing Function
For this article, we are using the Medium Articles’ data. This dataset has a collection of randomly picked articles from a few popular publications like Towards Data Science, The Startup, UX Collective, Data Driven Investor etc. Download the dataset from this link.
import pandas as pd data = pd.read_csv('medium_data.csv data.head()
Now, it’s time to process our data. We shall borrow some bits from the previous article, model it into a function which we can call once and process all our data at once.
STOPWORDS = stopwords.words('english') lemmatizer = WordNetLemmatizer() def process_text(text): text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore') text = re.sub(r'[^a-zA-Z\s]', '', text) #text = text.translate(str.maketrans('', '', string.punctuation)) text = text.lower() text = " ".join([word for word in str(text).split() if word not in STOPWORDS]) text = " ".join([lemmatizer.lemmatize(word) for word in text.split()]) return text
We will consider the Article Title for our corpus; Let’s quickly select only the title from our dataframe.
data_sentences = data['title].to_list()
Defining and Processing the Corpus
Right, our next step is to define our corpus. Here, let’s take a subset of the titles we selected earlier and apply the processing function.
corpus = data_sentences[:50] #print(corpus) process_corpus = np.vectorize(process_text) processed_corpus = process_corpus(corpus) #print(processed_corpus)
Now, at this point we have the corpus ready for the all important step i.e. Converting text to number/vector formats for further applications.
Bag of Words Model: CountVectorizer
Bag-of-words model is a way of representing text data when modeling text with machine learning algorithms. Machine learning algorithms cannot work with raw text directly; the text must be converted into well defined fixed-length (vector) numbers.
A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:
- A vocabulary of known words.
- A measure of the presence of known words.
It is called a bag-of-words because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document. The complexity comes both in deciding how to design the vocabulary of known words (or tokens) and how to score the presence of known words.
from sklearn.feature_extraction.text import CountVectorizer count_vect = COuntVectorizer(min_df = 0., max_df = 1.) matrix = count_vect.fit_transform(processed_corpus) matrix.toarray() matrix
We can also get the vocabulary details,
vocabulary = count_vect.get_feature_names() pd.DataFrame(matrix, columns = vocabulary)
Bag of Words Model: n-grams
Quickly moving onto the n-gram model, a vocabulary of grouped words can be created. This allows the bag-of-words to capture a little bit more meaning from the document. In this approach, each word or token is called a gram. Creating a vocabulary of two-word pairs is called a bigram model. An n-gram is an n-token sequence of words.
But what if we also wanted to take into account phrases or collections of words which occur in a sequence? N-grams help us achieve that. An n-gram is basically a collection of word tokens from a text document such that these tokens are contiguous and occur in a sequence. Bi-grams indicate n-grams of order 2 (two words), Tri-grams indicate n-grams of order 3 (three words), and so on.
Example Text : It was the best of times
count_vect_n_gram = CountVectorizer(ngram_range = (2, 2)) matrix = count_vect_n_gram.fit_transform(processed_corpus) matrix = matrix.toarray() matrix
vocabulary = count_vect_n_gram.get_feature_names() pd.DataFrame(matrix, columns = vocabulary)
Quick thought here, we may want to consider the BoW (Bag of Words) model rather than the Word Embeddings in the following situations,
- Building a baseline model. By using scikit-learn, there are just a few lines of code to build a model. Later on, we can use Deep Learning to build on it.
- If your dataset is small and context is domain specific, BoW may work better than Word Embedding. Context is very domain specific which means that you cannot find corresponding Vector from pre-trained word embedding models (GloVe, fastText etc).
Moving to the most popular method in the Statistical Text Representation methods – Tf-Idf. Tf-Idf stands for Term Frequency-Inverse Document Frequency, which uses a combination of two metrics in its computation, namely: term frequency (tf) and inverse document frequency (idf), which are explained below. This technique was developed for ranking results for queries in search engines and now it is an indispensable model in the world of information retrieval and NLP.
Mathematically, we can define Tf-Idf as tf-idf = tf *idf, which can be expanded further to be represented as follows.
Here, tfidf(w, D) is the TF-IDF score for word w in document D.
- The term tf(w, D) represents the term frequency of the word w in document D, which can be obtained from the Bag of Words model. Term frequency is simply the number of times a particular term is occuring in a document
- The term idf(w, D) is the inverse document frequency for the term w, which can be computed as the log transform of the total number of documents in the corpus C divided by the document frequency of the word w. Document frequency refers to the frequency of occurrence of term t in a set of N documents. Inverse document frequency gives us the informativeness of the term t which will be low for most occuring words and high for terms occurring rarely.
from sklearn.feature_extraction.text import TfidfVectorizer tf_idf = TfidfVectorizer(min_df = 0., max_df = 1., use_idf = True) tf_idf_matrix = tf_idf.fit_transform(processed_corpus) tf_idf_matrix = tf_idf_matrix.toarray() tf_idf_matrix
So far we have seen how text is represented and the techniques can be used to get decent results on any dataset. However, there are certain disadvantages – Semantics and Context are not captured at all i.e. the meaning is not modeled effectively in the above methods.
To push the limits and leverage massive deep learning models, we need better, stronger and more robust representation of corpus. These are called Embeddings. They have been trained on massive datasets and they capture the context of a word by taking into consideration a few other words around it as well as the order of words in a sentence. We will look into the 3 most prominent Word Embeddings:
First up is the popular Word2Vec! It was created by Google in 2013 to generate high quality, distributed and continuous dense vector representations of words, which capture contextual and semantic similarity. Essentially these are unsupervised models which can take in massive textual corpora, create a vocabulary of possible words and generate dense word embeddings for each word in the vector space representing that vocabulary.
We can specify the size of the word embedding vectors and the total number of vectors are essentially the size of the vocabulary. This makes the dimensionality of this dense vector space much lower than the high-dimensional sparse vector space built using traditional Bag of Words models.
There are two different model architecture types in Word2Vec. They are:
- CBOW (Continuous Bag of Words) Model
- Skip Gram Model
The CBOW model architecture tries to predict the current target word (the center word) based on the source context words (surrounding words).
For example, let us take the sentence “The sky is blue and beautiful.” The context word can be [sky] and the target word can be [blue]. Here we can alter the window size of context words.
We can have multiple context words matching to a single word. For example, [blue, beautiful] are context words and the target word is [sky]. In this configuration, the window size is 2.
Skip Gram Model
The Skip-gram model architecture usually tries to achieve the reverse of what the CBOW model does. It tries to predict the source context words (surrounding words) given a target word (the center word).
For example, for the same sentence used in the CBOW model, given a target word [blue], it will try to predict the context word [sky].
Moving onto the implementation, we shall use the gensim library which has a very efficient and scalable implementation of Word2Vec.
from gensim.models import word2vec tokenized_corpus = [nltk.word_tokenize(doc) for doc in processed_corpus] # Parameters for Word2Vec model # Word vector dimensionality feature_size = 15 # Context window size window_context = 20 # Minimum word count min_word_count = 1 # Downsample setting for frequent words sample = 1e-3 # Skip-gram model configuration. If not specified, the configuration is CBOW skg = 1 w2v_model = word2vec.Word2Vec(tokenized_corpus, size = feature_size, window = window_context, min_count = min_word_count, sg = skg, sample=sample, iter = 5000) w2v_model
Visualizing the data points
import matplotlib.pyplot as plt %matplotlib inline # Visualize embeddings from sklearn.manifold import TSNE words = w2v_model.wv.index2word wvs = w2v_model.wv[words] tsne = TSNE(n_components=2, random_state=42, n_iter=5000, perplexity=5) np.set_printoptions(suppress=True) T = tsne.fit_transform(wvs) labels = words plt.figure(figsize=(12, 6)) t.scatter(T[:, 0], T[:, 1], c = 'orange', edgecolors = 'r') for label, x, y in zip(labels, T[:, 0], T[:, 1]): plt.annotate(label, xy=(x+1, y+1), xytext=(0, 0), textcoords='offset points')
print('Embedding') print(w2v_model.wv['ai']) print('Embedding Shape') print(w2v_model.wv['ai'].shape)
Visualizing the matrix
vec_df = pd.DataFrame(wvs, index = words) vec_df
import numpy as np from sklearn.metrics.pairwise import cosine_similarity similarity_matrix = cosine_similarity(vec_df.values) similarity_df = pd.DataFrame(similarity_matrix, index=words, columns=words) similarity_df
We can get the top similar words and map the similar words to the main reference word
feature_names = np.array(words) similarity_df.apply(lambda row: feature_names[np.argsort(-row.values)[1:4]], axis=1)
GloVe stands for Global Vectors which is used to obtain dense word vectors similar to Word2Vec. However the technique is different and training is performed on an aggregated global word-word co-occurrence matrix, giving us a vector space with meaningful sub-structures. This method was invented in Stanford and is one of the widely used Word-Embedding for NLP tasks. The team also published a paper which is an excellent document to understand GloVe in a detailed manner.
The basic methodology of the GloVe model is to first create a huge word-context co-occurrence matrix consisting of (word, context) pairs such that each element in this matrix represents how often a word occurs with the context (which can be a sequence of words). The idea then is to apply matrix factorization to approximate this matrix.
The spacy framework comes with capabilities to leverage GloVe embeddings based on different language models. You can also get pre-trained word vectors and load them up as needed using gensim or spacy. We get the standard 300-dimensional GloVe word vectors using SpaCy.
import spacy nlp = spacy.load('en_vectors_web_lg') total_vectors = len(nlp.vocab.vectors) print('Total word vectors:', total_vectors) unique_words = list(set([word for sublist in tokenized_corpus for word in sublist])) word_glove_vectors = np.array([nlp(word).vector for word in unique_words]) vec_df = pd.DataFrame(word_glove_vectors, index=unique_words) vec_df
Visualizing the data points
tsne = TSNE(n_components = 2, random_state = 42, n_iter = 5000, perplexity = 3) np.set_printoptions(suppress=True) T = tsne.fit_transform(word_glove_vectors) labels = unique_words plt.figure(figsize=(12, 6)) plt.scatter(T[:, 0], T[:, 1], c='red', edgecolors='r') for label, x, y in zip(labels, T[:, 0], T[:, 1]): plt.annotate(label, xy=(x+1, y+1), xytext=(0, 0), textcoords='offset points')
import numpy as np from sklearn.metrics.pairwise import cosine_similarity similarity_matrix = cosine_similarity(vec_df.values) similarity_df = pd.DataFrame(similarity_matrix, index=unique_words, columns=unique_words) similarity_df
feature_names = np.array(unique_words) similarity_df.apply(lambda row: feature_names[np.argsort(-row.values)[1:4]], axis=1)
One of the last listed methods for this article, the FastText model, was first introduced by Facebook in 2016 as an extension and supposed improvement of the vanilla Word2Vec model. It is based on the original paper titled ‘Enriching Word Vectors with Subword Information’ by Mikolov et al. which is an excellent read to gain an in-depth understanding of how this model works. Overall, FastText is a framework for learning word representations and also performing robust, fast and accurate text classification. The framework is open-sourced by Facebook on GitHub and claims to have the following:
- Recent state-of-the-art English word vectors. Word vectors for 157 languages trained on Wikipedia and Crawl. Models for language identification and various supervised tasks.
- The Word2Vec model typically ignores the morphological structure of each word and considers a word as a single entity. The FastText model considers each word as a Bag of Character n-grams. This is also called a subword model in the paper.
from gensim.models.fasttext import FastText # Various Parameters feature_size = 15 window_context = 20 min_word_count = 1 # Downsample setting for frequent words sample = 1e-3 sg = 1 ft_model = FastText(tokenized_corpus, size = feature_size, window = window_context, min_count = min_word_count, sg = sg, sample = sample, iter = 5000) ft_model
Visualizing the data points
from sklearn.manifold import TSNE words = ft_model.wv.index2word wvs = ft_model.wv[words] tsne = TSNE(n_components=2, random_state=42, n_iter=5000, perplexity=5) np.set_printoptions(suppress=True) T = tsne.fit_transform(wvs) labels = words plt.figure(figsize=(12, 6)) plt.scatter(T[:, 0], T[:, 1], c='green', edgecolors='k') for label, x, y in zip(labels, T[:, 0], T[:, 1]): plt.annotate(label, xy=(x+1, y+1), xytext=(0, 0), textcoords='offset points')
As a small assignment, try to obtain the Similarity Matrix and the set of similar words for a given reference word.
Given here is the code containing all the methods in a single python file.
That was quite a long ride, but we learnt the absolute best methods for Text Representation! Text Representation is very important, we need to model the corpus into vectorized form in a brilliant way which captures and holds the meaning of the sentence or paragraph. This step is absolutely crucial as it will dictate the performance of anything you build (NLP projects/solutions) using these embeddings/representations.
I will leave you with the knowledge and code which is all yours to explore and experiment. Pick a dataset of your choice, try out the various methods out there and get the best embeddings out of the corpus because…we are building something in the next article! Yes, you heard it right, time to build something from what we learnt in our journey so far. Put down your thoughts in the comments below, give a link of your work/experimentations and share your experience!