Hello and welcome back to the NLP Tutorials blog post series. We shall cover an interesting topic in the unsupervised learning section under NLP — Topic Modelling. Topic Modelling is a process wherein the algorithm is able to automatically detect the topics covered/occuring in a given text extract/document. It can segregate the topics in a vector space and provides access to them via ‘topic clusters’ for viewing the terms contributing towards a topic. One major application would be to have a system to find out the latest trending topic or emerging topic for a big corpus.
The formal definition of Topic Modelling goes like this — Topic Modelling is a process to automatically identify topics present in a text object and to derive the hidden and underlying patterns exhibited by a text corpus. This is very useful in document clustering, organization of big data, information retrieval systems and, to some extent, feature selection. We can employ topic modelling in unsupervised classification/segregation also to a great extent provided the data is quite concise and clean. There are a few popular algorithms for Topic Modelling — Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), etc. We shall study and attempt to implement LDA algorithm for a small dataset.
Latent Dirichlet Allocation
The LDA algorithm is a 2-step process as shown below,
- Words belonging to a document (that we already know)
- Words belonging to a topic or the Probability of a word belonging to a topic
The actual algorithmic working is as follows,
- Randomly assign each word to one the k topics. Here the assumption made is — We know the number of topics beforehand
- For each word w, compute —
- P(topic t | doc d) which is the proportion of words in the document that are assigned topic t;
- P(word w | topic t) which is the proportion of assignments to topic t over all docs d that come from word w;
- Update P(word w with topic t) = P(topic t | doc d)*P(word w | topic t)
The first step is calculating the number of words belonging to the topic in a given document. If a greater number of words from a document belong to the topic, then the probability of the word being in this topic is high. In the second step, we are trying to calculate how many documents are there in a particular topic because of the presence of a certain word. If a word has a high probability of being in a topic, then all words in that document will also be strongly associated with that topic.
Let’s have a look at this concept with a simple example.
Let the topics be Nature & City, and the sentence be “Tree is in front of the building and behind the car”. And let’s assume that this sentence is currently falling in the nature topic. P(topic | doc) value will be low since building and car are usually found in the city. On the contrary, P(word | topic) will be very high because of tree ~ nature. The value of tree decreases a bit from nature category because of its co-occurrence with words like mountains, fields, beach, etc (under nature topic). Also, these terms appear under nature topic, tree also appears in nature but with the presence of buildings and cars, it’ll slightly be tilted towards the city topic. If we consider the count, tree is a single word for nature topic and building, car are two words for city topic, which results in Avg P for nature and High P for city, ultimately getting assigned the topic City for the example sentence.
Applying LDA on a dataset
Let’s try to implement a basic LDA and observe how good the topics are, given a good enough dataset with a variety of topics. Download the dataset here. It is a collection of articles published on the Medium platform.
# This library is very good for training and visualizing an LDA output in an interactive way !pip install PyLDAvis df = pd.read_csv('medium.csv') df.head(10)
We need to process and clean our data before the representation and LDA training. The representation technique used will be Tf-Idf vectorizer.
def preprocess(df): stopwords = nltk.corpus.stopwords.words('english') df['title_process'] = df['title'].astype(str) df['title_process'] = df['title_process'].apply(lambda x : x.lower()) df['title_process'] = df['title_process'].apply(lambda x : nltk.word_tokenize(x)) df['title_process'] = df['title_process'].apply(lambda x : [item for item in x if item not in stopwords]) df['title_process'] = df['title_process'].apply(lambda x : " ".join(x)) df['title_process'].str.replace('https?:\/\/.*[\r\n]*', '') df['title_process'] = df['title_process'].str.replace('\d+', '') df['title_process'] = df['title_process'].str.replace('[^\w\s]', '') return df df_data_science = preprocess(df) df_data_science
Onto text representation.
from sklearn.feature_extraction.text import TfidfVectorizer tf_idf = TfidfVectorizer() doc_term_matrix = tf_idf.fit_transform(df_data_science["title"].values) doc_term_matrix
We will get an output like this — <6508×7770 sparse matrix of type ‘<class ‘numpy.float64’>’ with 49592 stored elements in Compressed Sparse Row format>
Now, we are ready to train an LDA model for this data.
from sklearn.decomposition import LatentDirichletAllocation # LDA requires us to specify the number of topics. So that will be hyperparameter to tweak. number_topics = 3 number_words = 25 LDA = LatentDirichletAllocation(n_components = number_topics, n_jobs = -1) LDA.fit(doc_term_matrix) def print_topics(model, count_vectorizer, n_top_words): words = count_vectorizer.get_feature_names() for topic_idx, topic in enumerate(model.components_): print("\nTopic #%d:" % topic_idx) print(" ".join([words[i] for i in topic.argsort()[:-n_top_words - 1:-1]])) print("Topics:") print_topics(LDA, tf_idf, number_words)
As we can see in the above image, it is quite good and we can abstractly label Topic #0 as something related to front-end or UX design, Topic #1 as data science/AI/machine learning and Topic #2 are something related to self-help, motivational type articles/words.
Given more data and a better hyperparameter tuning, results can surely be improved. Under the default LDA algorithm there are two main hyperparameters,
- alpha — Mixture of topics allowed for any document
- beta — Distribution of words per topic
(Usually, both are set to <1)
Given here is the complete code in a single python script.
Hope you were able to follow this concept which is very interesting and quite simple at the algorithmic level. Try building an LDA topic model for a custom dataset and observe any underlying topics or if you can implement a trend identifier/emerging topic. LDA is not only restricted to the text domain and has already been adapted in images too, to extract scene-specific context of an image. I will leave you with an observation I made while learning this concept — If the number of topics is less than the number of documents, LDA will act as a Dimensionality Reduction algorithm. Am I right or wrong? Put down in the comments below what you think.