Welcoming you to an article on BERT. Yes, you heard it right! What a journey we have had starting right from the basics all the way till BERT. Finally we are at the proficiency required to understand one of the highly capable models on a variety of NLP tasks like Text Classification, Question Answering, Named Entity Recognition with very little training. Bidirectional Encoder Representations from Transformers or BERT is a semi-supervised language model trained on huge corpus of data and then fine-tuned on custom data to achieve SOTA results. Without wasting much time let’s jump straight into the technicalities of BERT.


Application of bi-directional training to a Transformer was missing and BERT did just that. Training any NLP model bidirectionally is going to yield better results all day long because of the ability to learn and model context both in conventional left-to-right and right-to-left fashion. Let us see a small example which will throw more light on the importance of bidirectional encoding and training.

Sentence 1: I went to the river bank

Sentence 2: I had been to the bank to withdraw some money

Both sentences involve bank but the meaning and context is completely different. If we want to predict the next words in a sentence, both left and right context is important to fill in the right word. Let’s now look into the architecture in detail. 


BERT’s pretraining and fine-tuning illustrated. Image Credits — BERT paper

BERT as we saw earlier is a bidirectional encoder which is solely used for representation only (at least for the time being). We don’t need the decoders at all and BERT is an encoder only architecture. BERT’s architecture consists of stacks of encoder blocks (refer to the Transformers article) with bidirectional connections. You can argue that the vanilla transformer also sort of encoded the entire sentence only, then what’s the difference? The difference is in Transformers, the entire sequence is encoded at once and we can’t say that it’s bidirectional. Moreover BERT has a novel technique to make sure it is handling the bidirectionality properly. The BERT Base model has 12 encoder blocks and the BERT Large model has 24 blocks. The larger model achieved SOTA results but is ridiculously large in terms of parameters — 340 million!

Intuitively, it is reasonable to believe that a deep bidirectional model is strictly more powerful than either a left-to-right model or the shallow concatenation of a left-to-right and a right-to-left model. 

BERT Authors

BERT is trained in single or pairs of sequences which can be used to fine-tune on a variety of downstream tasks. In BERT’s lingo, we consider input as a sequence which may contain a single sentence or a pair of question and answer sequence together in something like this — <Question, Answer>. Let’s look closely at how the data is fed into the model, rest of the things are more or less simple (stock encoder block operation).

BERT’s input representation. Image Credits — BERT paper

The first token is a special token [SPL] which in the above diagram is [CLS] or class token (Used in classification as a label). We can see another special token — [SEP] which is used to inform that this token is end of a sequence. Another way to differentiate the sequences is by a learned embedding which is added to every token indicating its belonging to sequence A or sequence B. 

The Token Embeddings are learned for each specific token from the training data — WordPiece vocabulary. Segment Embeddings are unique embeddings which are mainly employed to distinguish the sequences (Ea and Eb in the above example). Finally, the Positional Embeddings are used to indicate the position of tokens in a sequence. The final Input Representation is a summation of all these 3 embeddings. Now that we have learned a bit about the Input aspect of BERT, let’s see how it is trained.


Masked Language Model (MLM)

The first task BERT was trained on is the Masked Language Model which is simply a task of predicting the missing word in a given incomplete sentence. For this task a special token [MASK] is used to randomly mask upto 15% of the words in a given sequence. This is known as the Masked Language Model. This was a novel way to learn bi-directional context but posed a problem — What do we do with the [MASK] while fine-tuning? Fine-tuning is a process where you use the BERT trained weights and feed your custom training data for specific NLP tasks. To solve this problem, the authors replaced the selected mask tokens only 80% of the time and the remaining 20% was replaced by a random token or left unchanged. 

Example Sentence: I love reading blogs on [MASK] Singularity

The model should output “Applied” once trained well with proper data. 

Next Sentence Prediction (NSP)

This task is all about learning whether a given pair of sequences appear one after the other. This allows the model to learn the relationship between words and sentences. The input data is a concatenation of two sentences — Sentence A and Sentence B. 50% of the time, A is preceding B and the other 50% of the time B is preceding A (Balanced data-split). This is a classification task with two labels — IsNext and NotNext with a cross-entropy loss function. We have already seen the special tokens [CLS] and [SEP] for indicating the label and separator respectively. 

Now it’s time to leverage the SOTA performance of BERT via fine-tuning.


An extra layer is added on top of a BERT pre-trained model in the case of Classification task. Just give your custom data exactly in the way it was given while training the original BERT and bam you get excellent results! Coming to the Question Answering task, the model is trained by learning two extra vectors which mark the beginning and end of the answer. Also, the hyperparameters during fine-tuning largely remain the same as during the training of BERT. 

Head over to the Appendix section of the paper to get more experimental and hyperparameters’ details.


Coming to the results, a set of models — OpenAI SOTA, BiLSTM + Attention, OpenAI GPT were compared against the BERT Base and BERT Large over a host of language modelling tasks grouped under a common name GLUE (General Language Understanding Evaluation). BERT performed better than all the models in most of the tasks.

GLUE results. Image Credits — BERT paper


BERT is no doubt a game changer in NLP and is an important architecture in the Modern NLP era. It is not that complex to understand but the experiments and various ablations to get it right are really solid (Appendix). The authors have open sourced the code and pretrained weights also enabling the usage of the legendary architecture by the entire community. The fine-tuning is very fast and within a matter of a few epochs we get amazing accuracy on any given NLP task and data. BERT can also be used as an Embedding Generator to generate embeddings for any given sequence and used in a basic Deep Learning model too. Although BERT was better than all the models, it comes at a cost — Huge number of computational parameters for training. Better versions of BERT — DistilBERT, RoBERTa, AlBERT came into existence with a performance matching BERT but with a significantly less number of computational parameters.

We have now completed some really cool architectures like Seq2Seq, Transformers, Transformer XL and now BERT. The road now onwards is very exciting as the research happening in this field is rapid and hot! Hang in there and we shall look into exciting architectures in my next articles.


  1. BERT: https://arxiv.org/pdf/1810.04805.pdf


Pranav Raikote

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s