Introduction
Welcome back to another article in the NLP Tutorials series! Continuing our quest towards mastery in NLP, we will be looking at an exciting application in NLP — Text Summarization. In the current times where data is being generated at a massive scale, at times we want to shrink them and have an overview only than the entire length. This is where text summarization plays a key role in condensing the document/data into a concise form. It is a challenging problem to solve since it depends on the cognitive intellect, language understanding and domain knowledge. In this article, we shall have a brief overview on the types of Text Summarization and attempt to implement a basic model ourselves using the NLP concepts and libraries we have come across so far.
Text Summarization
There are two main types of text summarization,
- Extractive summarization
- Abstractive summarization
Extractive summarization is more of a passive method which is scoring all the sentences based on their importance in that particular text extract and creating a summary coherently using the top n sentences. Sentences are extracted from the document and arranged in a linguistic way is another way of looking at it.
The algorithm involves converting the text into an internal representation based on basic embedding methodologies. A saliency metric is used to score the sentences on how important they are in the given text corpus. Using this metric the best sentences are extracted from the corpus and joined together to form the summary. So far Deep Learning models have been used to learn the sentence representation and ranking them jointly. The system is easy to build and the results also are pretty decent. Since it’s just extracting phrases within the document, there is not much emphasis on grammar (linguistics) because it is already in a good form. Abstractive summary is much harder and includes rephrasing of sentences and also generating sentences which are aligned to the core information of the corpus.
Abstractive summarization aims to generate the summary by using advanced NLP to generate the text based on the contextual information. The system has to take care of syntax and semantics while generating the summarized text. The sentences in the summary may or may not be found in the original document which makes it a harder problem to solve and evaluate. We can draw parallels to sequence modelling where a block is encoding the data and the decoder is able to generate a linguistically fluent text. A famous model called Neural Attention Model for Abstractive Sentence Summarization (NAMAS) was released in 2015 which used Attention mechanism for abstractive sentence summarization. Other implementations also have surfaced which consisted of RNNs, CNNs, RL algorithms etc.
Now that we have a fair understanding about text summarization, it’s types and the challenges involved, let’s build a basic text summarization system.
Implementing a Text Summarizer using basic NLP tools
We start by defining our text document and the text preprocessing functions. The text document is a Wikipedia extract of Formula 1.
import re
import nltk
import heapq
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
text = """ Formula One (also known as Formula 1 or F1) is the highest class of international auto racing for single-seater formula racing cars sanctioned by the Fédération Internationale de l'Automobile (FIA). The World Drivers' Championship, which became the FIA Formula One World Championship in 1981, has been one of the premier forms of racing around the world since its inaugural season in 1950. The word formula in the name refers to the set of rules to which all participants' cars must conform. A Formula One season consists of a series of races, known as Grands Prix, which take place worldwide on both purpose-built circuits and closed public roads.
The results of each race are evaluated using a points system to determine two annual World Championships: one for drivers, the other for constructors. Each driver must hold a valid Super Licence, the highest class of racing licence issued by the FIA. The races must run on tracks graded "1" (formerly "A"), the highest grade-rating issued by the FIA. Most events occur in rural locations on purpose-built tracks, but several events take place on city streets.
Formula One cars are the fastest regulated road-course racing cars in the world, owing to very high cornering speeds achieved through the generation of large amounts of aerodynamic downforce. The cars underwent major changes in 2017, allowing wider front and rear wings, and wider tyres, resulting in peak cornering forces near 6.5 lateral g and top speeds of around 350 km/h (215 mph). As of 2021, the hybrid engines are limited in performance to a maximum of 15,000 rpm; the cars are very dependent on electronics and aerodynamics, suspension and tyres. Traction control, launch control, and automatic shifting, plus other electronic driving aids, were first banned in 1994, reintroduced in 2001, and have more recently been banned since 2004 and 2008, respectively.
While Europe is the sport's traditional base, the championship operates globally, with 13 of the 23 races in the 2021 season taking place outside Europe. With the annual cost of running a mid-tier team – designing, building, and maintaining cars, pay, transport – being US$120 million, its financial and political battles are widely reported. Its high profile and popularity have created a major merchandising environment, which has resulted in large investments from sponsors and budgets (in the hundreds of millions for the constructors). On 23 January 2017, Liberty Media confirmed the completion of the acquisition of Delta Topco, the company that controls Formula One, from private-equity firm CVC Capital Partners for $8 billion."""
text = text.lower()
text = re.sub(r'\[[0-9]*\]', ' ', text)
text = re.sub(r'\s+', ' ', text)
text = re.sub(r'\d+', ' ', text)
Now we can tokenize the text using NLTK’s sentence tokenizer and remove the stopwords.
sentences = nltk.sent_tokenizer(text)
STOPWORDS = nltk.corpus.stopwords.words('english')
words = {}
for w in nltk.word_tokenize(text):
if w not in STOPWORDS:
if w not in words.keys():
words[w] = 1
else:
words[w] += 1
The word count is generated for each word after tokenization and now we can calculate the sentence scores.
sent = {}
for s in sentences:
for w in nltk.word_tokenize(s.lower()):
if len(s.split('')) < 35:
if s not in sent.keys():
sent[s] = words[w]
else:
sent[s] += words[w]
Time to convert these scores into dataframe records so as to retrieve based on the highest scores.
df = pd.DataFrame.from_dict(sent, orient = 'index')
df.rename(columns = {0: 'score'}, inplace = True)
df.sort_values(by = 'score', ascending = False)
Finally, we can retrieve the top scores sentences by the heapq library’s nlargest function and print the summary for the document
top = heapq.nlargest(5, df, key = df.get)
for s in sentences:
if s in top:
print(s)
## OUTPUT
A Formula One season consists of a series of races, known as Grands Prix, which take place worldwide on both purpose-built circuits and closed public roads. Each driver must hold a valid Super Licence, the highest class of racing licence issued by the FIA. Formula One cars are the fastest regulated road-course racing cars in the world, owing to very high cornering speeds achieved through the generation of large amounts of aerodynamic downforce. While Europe is the sport's traditional base, the championship operates globally, with 13 of the 23 races in the 2021 season taking place outside Europe.
The summary was quite good and we can play around with the way to score the sentences like using Tf-Idf method which can be more accurate. Let’s leverage the power of pre-trained transformers and see how well they summarize a given document.
Text Summarizer using Hugging Face library
Using transformers for summarization is very straightforward and simple. With a few lines of code we can generate a very good summary as shown below.
from transformers import pipeline
summarize = pipeline('summarization')
text = summarize(text, min_length = 100, max_length = 300)
print(text[0]['summary_text'])
## OUTPUT
Formula One (also known as Formula 1 or F1) is the highest class of international auto racing for single-seater formula racing cars sanctioned by the Fédération Internationale de l'Automobile (FIA) The World Drivers' Championship, which became the FIA Formula One World Championship in 1981, has been one of the premier forms of racing around the world since 1950 . A Formula One season consists of a series of races, known as Grands Prix, which take place worldwide on both purpose-built circuits and closed public roads
Quite an impressive summary isn’t it?
Conclusion
Text summarization is an important application in the NLP domain which can save time, increase productivity and automate mundane summarization tasks. In this article we have had a comprehensive journey in text summarization and two implementations for the same. Try building a text summarizer yourself by exploring other methodologies and techniques.
Author
Pranav Raikote
References
- Neural Attention Model for Abstractive Sentence Summarization: https://arxiv.org/pdf/1509.00685.pdf
- Extractive summarization document: https://en.wikipedia.org/wiki/Formula_One