Hello again, glad to welcome you back to this article on Text Classification in the NLP Tutorials series. In our previous posts we had a detailed overview on the fundamental text representation — CountVectorizer & Tf-Idf Vectorizer and also the two most prominent Word Embeddings — Word2Vec & GloVe. In this article we will put our knowledge to task — Build a Text Classification model using all these techniques and analyse the results.
Text Classification
The classic NLP problem, one of the basic stepping stones in terms of problem statements in NLP. We shall consider movie reviews with labels positive and negative as the dataset. This can also be called the very basic Sentiment Analysis, but again that also falls into Text Classification. We will represent our text in 3 different ways and try to fit (train) a few ML/DL models.
Note: Don’t worry if you are not that proficient in ML/DL; I will share some nice resources for getting the basics right.
Defining Data and Preprocessing Function
We shall use the IMDB Movie Reviews Dataset for text classification as it is a balanced dataset and we don’t have to perform major data cleaning also. It is one of the popular dataset for text classification. The dataset is available on kaggle here. I have made a minor change (Performed One-hot encoding so that we can jump straight into basic pre-processing and then the representations) and you can download my version directly from this link.
import pandas as pd
data = pd.read_csv('IMDB_Reviews.csv')
data.head(10)

As we can see, it is already one-hot encoded and ready to use for classification, but we will do some preprocessing on the data to improvise the strength of the corpus and also make it uniform/normalize across the dataset. Given below is the function which we are going to use for text pre-processing. (We have used the same function in our Document Similarity article)
STOPWORDS = stopwords.words('english')
lemmatizer = WordNetLemmatizer()
def process_text(text):
text = re.sub(r'[^a-zA-Z\s]', '', text)
text = text.lower().strip()
text = text.translate(str.maketrans('', '', string.punctuation))
text = " ".join([word for word in str(text).split() if word not in STOPWORDS])
text = " ".join([lemmatizer.lemmatize(word) for word in text.split()])
return text
We apply this function to the entire review column in the dataframe:
data['Processed_Review'] = data['review].map(preprocessing)

Time to split the data into train and test split sets. Good thing about using sklearn’s train_test_split() function is that it will shuffle the dataset and then split according to the test_size parameter. We can see the code below:
# Text data
X = np.array(data['Cleaned_Review'])
# Labels
#Y = np.array(data[['negative', 'positive']])
Y = np.array(data.sentiment.map({'positive':1,'negative':0}))
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.20, random_state = 42)
print('Shape of Training Data')
print(X_train.shape, Y_train.shape)
print('Shape of Testing Data')
print(X_test.shape, Y_test.shape)

Text Representation: Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()
count_vectorizer.fit(X_train)
X_train_count_vector = count_vectorizer.transform(X_train)
X_test_count_vector = count_vectorizer.transform(X_test)

Text Representation: Tf-Idf Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vectorizer = TfidfVectorizer(ngram_range=(1, 3))
tf_idf_vectorizer.fit(X_train)
X_train_tfidf_vector = tf_idf_vectorizer.transform(X_train)
X_test_tfidf_vector = tf_idf_vectorizer.transform(X_test)

With the basic Text Representation done, we shall train the basic ML models — Logistic Regression and Naive Bayes on both sets of data and see the results.
Count Vectorizer data — Logistic Regression
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train_count_vector, Y_train)
score = classifier.score(X_test_count_vector, Y_test)
print("Accuracy:", score)
Accuracy: 0.8864
Count Vectorizer data — Naive Bayes
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train_count_vector, Y_train)
score = classifier.score(X_test_count_vector, Y_test)
print("Accuracy:", score)
Accuracy: 0.8617
Tf-Idf data — Logistic Regression
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train_tfidf_vector, Y_train)
score = classifier.score(X_test_tfidf_vector, Y_test)
print("Accuracy:", score)
Accuracy: 0.8876
Tf-Idf data — Naive Bayes
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train_tfidf_vector, Y_train)
score = classifier.score(X_test_tfidf_vector, Y_test)
print("Accuracy:", score)
Accuracy: 0.8921
Pretty good results from both Count Vectorized and Tf-Idf Vectorized methods. Tf-Idf seems to slightly edge the former!
Higher Level Embedding: GloVe
Find the detailed explanation on GloVe here (NLP Tutorials: Part 5 — GloVe). We have to load the GloVe pre-trained embedding and initialize the matrix with the tokenizer we have used to tokenize the corpora. Then we are ready to use the GloVe embedding for classification.
We will do this iteratively:
- Baseline Deep Learning model with Keras Embedding layer
- Deep Learning model with GloVe matrix in the Embedding layer
- Deep Learning model with GloVe matrix in the Embedding layer which can be trained i.e. the matrix will be altered in the training process
Now, we need to initialize a few things to facilitate usage of GloVe and Deep Neural Networks for Text Classification. We need to tokenize the corpus and set some hyperparameters. First things first, we the data in one-hot encoded form for the neural network. Let’s get started on the code then.
X = np.array(data['Processed_Review'])
Y = np.array(data[['positive', 'negative']])
Once we have the data ready, let’s tokenize using tensorflow tokenizer and then split the data into train and test sets.
# Word-vector Size
embed_size = 300
# Unique Words
max_features = 10000
# No of words per document
max_len = 250
tokenizer = Tokenizer(num_words = max_features)
tokenizer.fit_on_texts(list(X))
list_tokenized_train = tokenizer.texts_to_sequences(X)
X = pad_sequences(list_tokenized_train, maxlen = max_len)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.20, random_state = 42)
print('Shape of Training Data')
print(X_train.shape, Y_train.shape)
print('Shape of Testing Data')
print(X_test.shape, Y_test.shape)
Time to load the GloVe pre-trained embeddings. You can find the file here. We are going to use the glove.6B.300d.txt file. 6B means trained on 6bn training examples and 300d means each embedding vector is a 300-d vector (1, 300).
GLOVE_FILE = 'glove.6B.300d.txt'
def get_coefs(word, *arr):
return word, np.asarray(arr, dtype = 'float32')
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(GLOVE_FILE, encoding = 'utf-8'))
all_embs = np.hstack(embeddings_index.values())
emb_mean, emb_std = all_embs.mean(), all_embs.std()
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
We will look up the words in the word_index and get the embedding_vectors from the GloVe matrix and create an embedding_matrix for the data we have. Basically the pre-trained embeddings are like a lookup table and we sort of create a matrix out of it for the data/words we have.
for word, i in word_index.items():
if i >= max_features:
continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
Baseline DL Model: Keras Embedding
The most basic models with just an embedding will try to learn on its own. We are not using GloVe here.
from tf.keras.models import Sequential
from tf.keras import layers
embedding_dim = 50
model = Sequential()
model.add(layers.Embedding(input_dim = max_features,
output_dim = embed_size,
input_length = maxlen))
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(2, activation='softmax'))
model.compile(optimizer = 'adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

Model hyperparameters are defined below:
epochs = 10
batch_size = 64
checkpoint = ModelCheckpoint('Models/Text-Classify.h5', save_best_only = True, monitor = 'val_loss', mode = 'min', verbose = 1)
early_stop = EarlyStopping(monitor = 'val_loss', min_delta = 0.001, patience = 5, mode = 'min', verbose = 1, restore_best_weights = True)
reduce_lr = ReduceLROnPlateau(monitor = 'val_loss', factor = 0.3, patience = 3, min_delta = 0.001, mode = 'min', verbose = 1)
history = model.fit(X_train, Y_train, epochs = epochs, batch_size = batch_size, validation_split = 0.15, callbacks = [checkpoint, early_stop, reduce_lr])

We reached an accuracy of 0.87 on the validation dataset! Not bad. We set the training parameter to minimize the val_loss and stop training after a patience of 5 epochs if the val_loss is not decreasing.
One thing to note here, we are using the validation_split parameter in model.fit() which automatically takes that percentage of data as test data. And we have kept X_test and Y_Test as final validation data, after training the model, you can validate with this data.
loss, acc = model.validate(X_test, Y_test)
print('Final Validation Accuracy', acc*100)
Finally plotting the accuracy & loss graphs.
plt.plot(history.history["loss"], label = "Training Loss")
plt.plot(history.history["val_loss"], label = "Validation Loss")
plt.plot(history.history["accuracy"], label = "Training Accuracy")
plt.plot(history.history["val_accuracy"], label = "Validation Accuracy")
plt.legend()
plt.grid()
plt.show()

DL Model: LSTM + GloVe Embedding
Now, in order to learn the sequences in an effective manner, we need to upgrade our model from a simple Densely connected network to a LSTM layered network. More on LSTM here. (Don’t worry about these terms, in our next article we will understand all architectures in detail)
inp = Input(shape = (max_len,))
x = Embedding(max_features, embed_size, weights = [embedding_matrix], trainable = False)(inp)
x = LSTM(256, return_sequences = True, recurrent_dropout = 0.2)(x)x = GlobalMaxPool1D()(x)
x = Dense(100, activation = 'relu')(x)
x = Dropout(0.2)(x)
x = Dense(2, activation = 'sigmoid')(x)
model = Model(inputs = inp, outputs = x)
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
model.summary()
We include the embedding_matrix in the weights parameter of the Embedding Layer. Now we are using pre-trained embeddings. Note how we have set the trainable parameter to False. (In our final iteration, we can make this trainable = True and get slightly better results). The trainable= True is best used when you have a really custom dataset where the network is able to learn the embeddings also slowly and enable better optimization of the network.
This is how we use the GloVe pre-trained embeddings for Text Classification. Rest of the steps are similar to the Baseline model training. I won’t repeat the code for training and viewing the accuracy & loss graphs. We shall see how it trained.


As we see, more or less it is the same, but it took a longer time to train. LSTMs are quite computationally intensive 😛
Note: I’m showing all the possible iterations you can go through for a text classification problem. You might find the results in any of the ML models or the basic Keras Embedding model also, you can stop at that point. The iterative scaling of the architecture and using Embeddings is when we are not able to find our desired results. Best way to train heavier models is to use Google Colab. The higher end models when dealing with large training sets boost the training process to be much faster and more accurate than without.
Given here is the complete code in a single python script.
Conclusion
Phew, that was a long tutorial! Hope you were able to learn how to build an effective text classification model in a variety of approaches and word representations. If you were able to implement this correctly, you can work with major datasets out there and successfully build a text classification model. There will be challenges, but what we discussed in this article is giving a roadmap of steps and approaches to try to arrive at the final solution. You will find many datasets on Kaggle, make sure to practice on 1 or 2 datasets before moving onto the next problem statement/concept.
In our next article, we will do a deep dive into the various architectures tailored for text (which can also be called sequences) like RNNs, LSTMs and GRUs. What a journey it has been, we have ascended the ranks and reached the Deep Learning architectures. Hang on tight, we are very near to the Transformer architecture 😛 which is a buzzword in NLP and now in Computer Vision also! Seems interesting, right? Well, all you have to do is wait for the next article and keep practicing the previous implementations and revising the concepts and you will be proficient in NLP in no time. Adios.
Resources
11 Most Common Machine Learning Algorithms Explained in a Nutshell
A summary of common machine learning algorithms.towardsdatascience.com
9 Key Machine Learning Algorithms Explained in Plain English
Author
Pranav Raikote