Hello again, glad to welcome you back to this article on Text Classification in the NLP Tutorials series. In our previous posts we had a detailed overview on the fundamental text representation — CountVectorizer & Tf-Idf Vectorizer and also the two most prominent Word Embeddings — Word2Vec & GloVe. In this article we will put our knowledge to task — Build a Text Classification model using all these techniques and analyse the results.

Text Classification

The classic NLP problem, one of the basic stepping stones in terms of problem statements in NLP. We shall consider movie reviews with labels positive and negative as the dataset. This can also be called the very basic Sentiment Analysis, but again that also falls into Text Classification. We will represent our text in 3 different ways and try to fit (train) a few ML/DL models. 

Note: Don’t worry if you are not that proficient in ML/DL; I will share some nice resources for getting the basics right.

Defining Data and Preprocessing Function

We shall use the IMDB Movie Reviews Dataset for text classification as it is a balanced dataset and we don’t have to perform major data cleaning also. It is one of the popular dataset for text classification. The dataset is available on kaggle here. I have made a minor change (Performed One-hot encoding so that we can jump straight into basic pre-processing and then the representations) and you can download my version directly from this link

import pandas as pd
data = pd.read_csv('IMDB_Reviews.csv')
Viewing the data

As we can see, it is already one-hot encoded and ready to use for classification, but we will do some preprocessing on the data to improvise the strength of the corpus and also make it uniform/normalize across the dataset. Given below is the function which we are going to use for text pre-processing. (We have used the same function in our Document Similarity article)

STOPWORDS = stopwords.words('english')
lemmatizer = WordNetLemmatizer()

def process_text(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = text.lower().strip()
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = " ".join([word for word in str(text).split() if word not in STOPWORDS])
    text = " ".join([lemmatizer.lemmatize(word) for word in text.split()])

    return text

We apply this function to the entire review column in the dataframe:

data['Processed_Review'] = data['review].map(preprocessing)
Processed data

Time to split the data into train and test split sets. Good thing about using sklearn’s train_test_split() function is that it will shuffle the dataset and then split according to the test_size parameter. We can see the code below:

# Text data
X = np.array(data['Cleaned_Review'])

# Labels
#Y = np.array(data[['negative', 'positive']])
Y = np.array(data.sentiment.map({'positive':1,'negative':0}))

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.20, random_state = 42)

print('Shape of Training Data')
print(X_train.shape, Y_train.shape)
print('Shape of Testing Data')
print(X_test.shape, Y_test.shape)
Data in variable “X”. Processed corpus in numpy array format
Data in variable “Y”
Shape of training and testing data splits. We have allocated 15% of the total data for testing.

Text Representation: Count Vectorizer

from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()

X_train_count_vector = count_vectorizer.transform(X_train)
X_test_count_vector  = count_vectorizer.transform(X_test)
X_train_count_vector looks like this

Text Representation: Tf-Idf Vectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf_vectorizer = TfidfVectorizer(ngram_range=(1, 3))

X_train_tfidf_vector = tf_idf_vectorizer.transform(X_train)
X_test_tfidf_vector  = tf_idf_vectorizer.transform(X_test)
X_train_tfidf_vector looks like this

With the basic Text Representation done, we shall train the basic ML models — Logistic Regression and Naive Bayes on both sets of data and see the results.

Count Vectorizer data — Logistic Regression

from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(X_train_count_vector, Y_train)

score = classifier.score(X_test_count_vector, Y_test)
print("Accuracy:", score)

Accuracy: 0.8864

Count Vectorizer data — Naive Bayes

from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
classifier.fit(X_train_count_vector, Y_train)

score = classifier.score(X_test_count_vector, Y_test)
print("Accuracy:", score)

Accuracy: 0.8617

Tf-Idf data — Logistic Regression

from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(X_train_tfidf_vector, Y_train)

score = classifier.score(X_test_tfidf_vector, Y_test)
print("Accuracy:", score)

Accuracy: 0.8876

Tf-Idf data — Naive Bayes

from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
classifier.fit(X_train_tfidf_vector, Y_train)

score = classifier.score(X_test_tfidf_vector, Y_test)
print("Accuracy:", score)

Accuracy: 0.8921

Pretty good results from both Count Vectorized and Tf-Idf Vectorized methods. Tf-Idf seems to slightly edge the former!

Higher Level Embedding: GloVe

Find the detailed explanation on GloVe here (NLP Tutorials: Part 5 — GloVe). We have to load the GloVe pre-trained embedding and initialize the matrix with the tokenizer we have used to tokenize the corpora. Then we are ready to use the GloVe embedding for classification.

We will do this iteratively: 

  1. Baseline Deep Learning model with Keras Embedding layer
  2. Deep Learning model with GloVe matrix in the Embedding layer
  3. Deep Learning model with GloVe matrix in the Embedding layer which can be trained i.e. the matrix will be altered in the training process

Now, we need to initialize a few things to facilitate usage of GloVe and Deep Neural Networks for Text Classification. We need to tokenize the corpus and set some hyperparameters. First things first, we the data in one-hot encoded form for the neural network. Let’s get started on the code then.

X = np.array(data['Processed_Review'])
Y = np.array(data[['positive', 'negative']])

Once we have the data ready, let’s tokenize using tensorflow tokenizer and then split the data into train and test sets.

# Word-vector Size
embed_size = 300

# Unique Words
max_features = 10000

# No of words per document
max_len = 250

tokenizer = Tokenizer(num_words = max_features)

list_tokenized_train = tokenizer.texts_to_sequences(X)
X = pad_sequences(list_tokenized_train, maxlen = max_len)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.20, random_state = 42)

print('Shape of Training Data')
print(X_train.shape, Y_train.shape)
print('Shape of Testing Data')
print(X_test.shape, Y_test.shape)

Time to load the GloVe pre-trained embeddings. You can find the file here. We are going to use the glove.6B.300d.txt file. 6B means trained on 6bn training examples and 300d means each embedding vector is a 300-d vector (1, 300).

GLOVE_FILE = 'glove.6B.300d.txt'

def get_coefs(word, *arr):
    return word, np.asarray(arr, dtype = 'float32')

embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(GLOVE_FILE, encoding = 'utf-8'))

all_embs = np.hstack(embeddings_index.values())
emb_mean, emb_std = all_embs.mean(), all_embs.std()

word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))

embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))

We will look up the words in the word_index and get the embedding_vectors from the GloVe matrix and create an embedding_matrix for the data we have. Basically the pre-trained embeddings are like a lookup table and we sort of create a matrix out of it for the data/words we have.

for word, i in word_index.items():
    if i >= max_features:
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

Baseline DL Model: Keras Embedding

The most basic models with just an embedding will try to learn on its own. We are not using GloVe here.

from tf.keras.models import Sequential
from tf.keras import layers

embedding_dim = 50

model = Sequential()
model.add(layers.Embedding(input_dim = max_features, 
                           output_dim = embed_size, 
                           input_length = maxlen))
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(2, activation='softmax'))
model.compile(optimizer = 'adam',           loss='binary_crossentropy', metrics=['accuracy'])
Model Summary — Default Keras Embedding

Model hyperparameters are defined below:

epochs = 10
batch_size = 64

checkpoint = ModelCheckpoint('Models/Text-Classify.h5', save_best_only = True, monitor = 'val_loss', mode = 'min', verbose = 1)

early_stop = EarlyStopping(monitor = 'val_loss', min_delta = 0.001, patience = 5, mode = 'min', verbose = 1, restore_best_weights = True)

reduce_lr = ReduceLROnPlateau(monitor = 'val_loss', factor = 0.3, patience = 3, min_delta = 0.001, mode = 'min', verbose = 1)

history = model.fit(X_train, Y_train, epochs = epochs, batch_size = batch_size, validation_split = 0.15, callbacks = [checkpoint, early_stop, reduce_lr])
Model trained well!

We reached an accuracy of 0.87 on the validation dataset! Not bad. We set the training parameter to minimize the val_loss and stop training after a patience of 5 epochs if the val_loss is not decreasing. 

One thing to note here, we are using the validation_split parameter in model.fit() which automatically takes that percentage of data as test data. And we have kept X_test and Y_Test as final validation data, after training the model, you can validate with this data.

loss, acc = model.validate(X_test, Y_test)
print('Final Validation Accuracy', acc*100)

Finally plotting the accuracy & loss graphs.

plt.plot(history.history["loss"], label = "Training Loss")
plt.plot(history.history["val_loss"], label = "Validation Loss")
plt.plot(history.history["accuracy"], label = "Training Accuracy")
plt.plot(history.history["val_accuracy"], label = "Validation Accuracy")
Loss-Accuracy Graph

DL Model: LSTM + GloVe Embedding

Now, in order to learn the sequences in an effective manner, we need to upgrade our model from a simple Densely connected network to a LSTM layered network. More on LSTM here. (Don’t worry about these terms, in our next article we will understand all architectures in detail)

inp = Input(shape = (max_len,))

x = Embedding(max_features, embed_size, weights = [embedding_matrix], trainable = False)(inp)
x = LSTM(256, return_sequences = True, recurrent_dropout = 0.2)(x)x = GlobalMaxPool1D()(x)
x = Dense(100, activation = 'relu')(x)
x = Dropout(0.2)(x)
x = Dense(2, activation = 'sigmoid')(x)

model = Model(inputs = inp, outputs = x)
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
The parameters have increased, so will the training time.

We include the embedding_matrix in the weights parameter of the Embedding Layer. Now we are using pre-trained embeddings. Note how we have set the trainable parameter to False. (In our final iteration, we can make this trainable = True and get slightly better results). The trainable= True is best used when you have a really custom dataset where the network is able to learn the embeddings also slowly and enable better optimization of the network.

This is how we use the GloVe pre-trained embeddings for Text Classification. Rest of the steps are similar to the Baseline model training. I won’t repeat the code for training and viewing the accuracy & loss graphs. We shall see how it trained.

Training process
Loss-Accuracy Graph

As we see, more or less it is the same, but it took a longer time to train. LSTMs are quite computationally intensive 😛 

Note: I’m showing all the possible iterations you can go through for a text classification problem. You might find the results in any of the ML models or the basic Keras Embedding model also, you can stop at that point. The iterative scaling of the architecture and using Embeddings is when we are not able to find our desired results. Best way to train heavier models is to use Google Colab. The higher end models when dealing with large training sets boost the training process to be much faster and more accurate than without.

Given here is the complete code in a single python script.


Phew, that was a long tutorial! Hope you were able to learn how to build an effective text classification model in a variety of approaches and word representations. If you were able to implement this correctly, you can work with major datasets out there and successfully build a text classification model. There will be challenges, but what we discussed in this article is giving a roadmap of steps and approaches to try to arrive at the final solution. You will find many datasets on Kaggle, make sure to practice on 1 or 2 datasets before moving onto the next problem statement/concept.

In our next article, we will do a deep dive into the various architectures tailored for text (which can also be called sequences) like RNNs, LSTMs and GRUs. What a journey it has been, we have ascended the ranks and reached the Deep Learning architectures. Hang on tight, we are very near to the Transformer architecture 😛 which is a buzzword in NLP and now in Computer Vision also! Seems interesting, right? Well, all you have to do is wait for the next article and keep practicing the previous implementations and revising the concepts and you will be proficient in NLP in no time. Adios.


11 Most Common Machine Learning Algorithms Explained in a Nutshell

A summary of common machine learning algorithms.towardsdatascience.com

9 Key Machine Learning Algorithms Explained in Plain English

Machine learning is changing the world. Google uses machine learning to suggest search results to users. Netflix uses…www.freecodecamp.org


Pranav Raikote

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s