Welcome back to the NLP Tutorials! In our previous posts we had a detailed look at Text Representation & Word Embeddings, which are ways to accurately convert the text into vector form. The corpus in vector form is easily stored, accessible and can be used further for solving the NLP problem at hand. In this article, we shall try our hand at a small NLP problem –  Document Similarity/Text Similarity. Without wasting much time, let’s quickly get started!

Document Similarity

It is the measure of how similar two documents or groups of texts are to each other. In this article let us consider movie descriptions as documents, convert them into vectors and compare the vectors via a similarity measure. Regardless of the similarity measure, the emphasis will be on the vectors/features – how brilliantly can we represent the documents. In this article we shall go about creating 2 different text representations – CountVectorizer & Tf-Idf Vectorizer.

Defining Data and Preprocessing Function

We shall use the TMDB Movie Dataset as they have attributes like tagline, overview (short description about the movie) and keywords which is perfect for a Document Similarity implementation. Download the dataset here.

import pandas as pd

data = pd.read_csv('tmdb_5000_movies.csv')

# Let's also take care of the null values present in the data
data.fillna('', inplace = True)
Viewing the data

For us, the important columns are original_title and overview. Moving on to the preprocessing, we have two approaches –  the first is to join title and overview and then apply the preprocessing, and the second approach is to consider only the overview. In this article we will consider only the overview data in our NLP pipeline for document similarity. Given below is the preprocessing function.

STOPWORDS = stopwords.words('english')
lemmatizer = WordNetLemmatizer()
def process_text(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = text.lower().strip()
    text = " ".join([word for word in str(text).split() if word not in STOPWORDS])
    text = " ".join([lemmatizer.lemmatize(word) for word in text.split()])
    return text

We can apply the preprocessing function to an entire column in just a single line of code:

data['processed_overview'] = df2['overview'].map(process_text)

# Also, we shall select the top 4 columns for our problem statement
data = data[['title', 'overview', 'processed_overview', 'tagline']]
Processed Data
Now it’s time to generate the Document Representations. For a detailed explanation on the Text Representation methods head over to the previous article, NLP Tutorials — Part 2: Text Representation & Word Embeddings


# First let us get the processed data 
data_list = data['processed_overview'].to_list()

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(min_df = 0., max_df = 1.)

count_vect_matrix = count_vect.fit_transform(data_list)

# Output - (4803, 20449)

Now let’s create a similarity matrix using sklearn’s cosine_similarity.

from sklearn.metrics.pairwise import cosine_similarity

count_doc_sim = cosine_similarity(count_vect_matrix)

# Let us create a dataframe out of this matrix for easy retrieval of data
count_doc_sim_df = pd.DataFrame(count_doc_sim)
Similarity Matrix generated for the CountVectorizer features

Now it’s time to write some code for getting data from this similarity matrix!

Getting index given a movie title

movies = data['title'].to_list()

movie_idx = movies.index("Captain America: Civil War")
# Output - 26

Getting the specific row from the dataframe

movie_similarities = count_doc_sim_df.iloc[movie_idx].values

# Output - array([0.04564355, 0.04428074, 0.03450328, ..., 0.02635231, 0. , 0. ])

Getting the similar movies’ indices (Top 5)

similar_movie_idxs = np.argsort(-movie_similarities)[1:6]

# Output - array([  85, 4489,  653, 2433, 4009])

Getting the movie titles from the obtained indices

similar_movies = []
for i in similar_movie_idxs:

# Output - ['Captain America: The Winter Soldier',  'Escape from Tomorrow',  'This Means War',  'Superman IV: The Quest for Peace',  "2016: Obama's America"]

Let us package these lines of code into a single function for reuse.

def get_similar_document(movie_title, similarity_matrix):
    index = movies.index(movie_title)
    sim = sim_matrix[index].values
    sim_index = np.argsort(-sim)[1:6]
    similar_movies = []
    for i in sim_index:
    return similar_movies

# Now it will be easy to get the similar_docs given a title and the similarity matrix
For ex: get_similar_document("Captain America: Civil War", count_doc_sim_df)
# Output - ['Captain America: The Winter Soldier',  'Escape from Tomorrow',  'This Means War',  'Superman IV: The Quest for Peace',  "2016: Obama's America"]

Encouraging results, but we can improve them by using an advanced vectorizer — Tf-Idf Vectorizer.

Tf-Idf Vectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf = TfidfVectorizer(ngram_range=(1, 2), min_df=2)
tfidf_matrix = tf_idf.fit_transform(data_list)
# Output - (4803, 19082)

from sklearn.metrics.pairwise import cosine_similarity

tf_doc_sim = cosine_similarity(tfidf_matrix)
tf_doc_sim_df = pd.DataFrame(tf_doc_sim)
Similarity Matrix generated for the Tf-Idf features
get_recommendations("Captain America: Civil War", tf_doc_sim_df)

# Output - ['Captain America: The Winter Soldier',  'X-Men: The Last Stand',  'This Means War',  'Iron Man 2',  'Escape from Tomorrow']

They aren’t too bad, but still not too good either. They are slightly better than the CountVectorizer approach. What if we consider only the title for getting the similar docs? Maybe the overview attribute has a lot of noise which is hindering performance. Did we preprocess it too much? Can we tune the hyperparameters of CountVectorizer & Tf-Idf Vectorizer? All are possible approaches for improvement. Given here is the complete code in a single python script.


We had a brief overview on Document Similarity and learnt how to put together a decent Similar Document Fetcher, or a similar movie in our example! Document Similarity is essentially a method to get similar documents from a given set of documents. We extended that to a movie dataset, where we get the similar movies from a database of movies. Can we call our implementation a Movie Recommender? Sure, we can. Well, this is just the beginning, there is a lot of scope for improving this initial prototype. We can improve the preprocessing, the text representation i.e. use advanced Word Embeddings for a better representation. (Try with the data shared in the previous article of the NLP Tutorial series)Head over to the previous blogs for more info on the Text Processing & Text Representations and experiment with various settings and build a better Document Similarity system. Meanwhile, think about where we can effectively use this system. It surely has quite a few good applications. Put your ideas in the comments below and sit tight for the next Blog post on Word Embeddings in detail where we revisit the two most popular Word Embeddings — Word2Vec & GloVe.


Pranav Raikote

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s