Hello and welcome to the NLP tutorials series. As promised, we are doing a hands-on exercise in this blog and taking a much needed break from studying architectures! In this article, we shall attempt to classify text — IMDB movie reviews using a pre-trained Transformer architecture.
Defining Data
Let’s get started with the dataset by downloading the IMDB dataset and unzipping it (or download it from this Kaggle website). It contains a total of 50k reviews which are labelled as either positive or negative split into 50:50 ratio for train and test sets.
import pandas as pd
data = pd.read_csv('IMDB_Reviews.csv')
data = data.sample(frac = 1.).reset_index(drop = True)
data[one_hot] = data['sentiment'].apply(lambda x: [0, 1] if x == 'positive' else [1, 0])
data.head(10)
We are one-hot encoding the label values and shuffling the rows in this step.

Transformer & Preprocessing Functions
We will use the BERT pre-trained model from the transformer library and fine-tune it for a few epochs on our dataset. For doing so, we need to process our input data to match the format that was used to train the BERT model — Input Ids, Mask and Labels. Let’s get started on preparing the data for training.
from transformers import BertTokenizer
X_ids = np.zeros((50000, 512))
X_mask = np.zeros((50000, 512))
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
for i, review in enumerate(data['review']):
tokens = tokenizer.encode_plus(review, max_length = 512, truncation = True, padding = 'max_length', add_special_tokens = True, return_tensors = 'tf')
X_ids[i, :] = tokens['input_ids']
X_mask[i, :] = tokens['attention_mask']
labels = list(data.one_hot.values)
We first create arrays to hold the text encodings and the attention mask according to the dataset length and sequence length. We also create a list for our one-hot encoded labels. Together, these form the dataset for us.
Everything is not finished yet and we have to create a tf.Dataset and map the values with the right dictionary keys so as to accommodate them for training. Lets also divide the dataset into batches of 8 (to train on Google Colab free version).
dataset = tf.data.Dataset.from_tensor_slices((X_ids, X_mask, labels))
def map_f(input_ids, masks, labels):
return {'input_ids' : input_ids, 'attention_mask' : masks}, labels
dataset = dataset.map(map_f)
dataset = dataset.batch(8, drop_remainder = True)
size = int((50000/16)*0.9)
train_ds = dataset.take(size)
test_ds = dataset.skip(size)
Training & Testing
We import the BERT cased model from the transformer library and use TensorFlow layers for creating the input stream of attention mask and the input ids. Post this, we append 2 fully connected layers with the No of output neurons == No of logits for that particular dataset, 2 in our cases — one for positive sentiment and one for negative sentiment.
from transformers import TFAutoModel
import tensorflow as tf
from tf.keras.layers import Input, Dense
from tf.keras.optimizers import Adam
bert = TFAutoModel.from_pretrained('bert-base-cased')
inp_id = Input(shape = (512,)
mask = Input(shape = (512,)
emb = bert.bert(inp_id, attention_mask = mask)[1]
fc_1 = Dense(512, activation = 'relu')(emb)
fc_2 = Dense(2, activation = 'softmax')(fc_1)
model = tf.keras.Model(inputs = [inp_id, mask], outputs = fc_2)
# Freezing or unfreezing BERT embedding layer
model.layers[2].trainable = False
opt = Adam(learning_rate = 1e-5, decay = 1e-6)
model.compile(optimizer = opt, loss = 'categorical_crossentropy', metrics = ['accuracy'])
We defined our TensorFlow model with two input streams — Input ids and the Mask values. Both are entering the BERT embedding weights’ layer, which we loaded using the TFAutoModel. We shall train the model for 3 epochs and observe the results. If needed, we can train further for a few more epochs depending on the accuracy required.
history = model.fit(train_ds, epochs = 3)
# It is gonna take some time, feel free to have a cup of coffee and a strech and come back to real good results
Since this is bundled as a TensorFlow model, it will be very easy to save the model using model.save() function and reuse it whenever required. Now, it’s time to create a function which we can use to test any text input of our choice (real-time predictions).
def predictions(text):
tokens = tokenizer.encode_plus(review, max_length = 512, truncation = True, padding = 'max_length', add_special_tokens = True, return_tensors = 'tf')
feed_dict = {'input_ids' : tf.cast(tokens['input_ids'], tf.float64), 'attention_mask' : tf.cast(tokens['attention_mask'], tf.float64)}
preds = model.predict(feed_dict)[0]
return preds
Given here is the complete code in a single python script.
Results and Conclusion
As we observe, the results are really good for just 3 epochs, that’s the power of pretrained BERT! I hope you were able to follow all the steps and train your BERT models. Play around with various hyperparameters and optimize to the maximum limit possible. Using this template, it becomes easy now to use BERT for classification tasks of all types — Binary, multi-class and multi-label.
Author
Pranav Raikote