Hello and welcome back to yet another interesting article in the NLP tutorials series! We are here to explore a model which is an improvement over the massively famous NLP language model — BERT. Robustly Optimized BERT Pretraining approach or RoBERTa performs a good 15–20% better than BERT due to careful hyperparameter tuning and bigger datasets. The authors thought that the BERT is very under-trained and if given more data with hyperparameter tuning, its full potential of performance can be achieved. Let’s quickly get started and understand how the authors were able to achieve the performance bump over conventional BERT.


The model which is used is vanilla BERT which is a bi-directional transformer trained on Books Corpus and Wikipedia datasets (around 16 GB of data, most of it was unlabeled). BERT was the state-of-the-art in its times and outperformed all major NLP models in tasks like Named Entity Recognition, Question Answering, Text Summarization, Next Sentence Prediction, etc. It also employed novel techniques like Masked Language Modelling and Next Sequence Prediction for pre-training purposes. Make sure to read our previous article to get a thorough understanding of BERT. 

How is RoBERTa better than BERT? Is there any increase in computational parameters? Any novel techniques used? All questions answered in the next section.

Improvements over BERT

Vanilla BERT was trained on 16 GB of data which seemed very big at that time, but it was still very very small for a massive model like BERT. The authors added more data from various sources and the size went up to 160 GB! That’s a really massively sized training data and right from the word go it will result in a jump in accuracy. The data was collected from — BookCorpus + Wikipedia (16 GB) [BERT was trained on this], CC-News (76 GB) [Common-Crawl english news articles], OpenWebText (38 GB) [This was used to train the GPT language model] and Stories (31 GB) [Subset of Common-Crawl which included story-type data].

In addition to the extra training data, the authors introduced a new Masking mechanism — Dynamic Masking which generated a masking pattern for each input sequence. The reason for removal of the original Masking technique used in BERT is that the masking is applied only once and the same pattern was used for a particular sequence during the entirety of training. They also experimented with a static masking method which involved duplication of training data n times to obtain n different mask patterns. This did not yield better results and the authors turned to the Dynamic masking technique which gave better results.

In RoBERTa the Next Sentence Prediction methodology of training was done away and replaced with Full-Sentences and Doc-Sentences techniques. In Full-Sentences method the input sequences are complete sentences with a length of sequence upto 512. If a sentence ends before this 512 mark, the next sentence is appended and at the 512 mark a special separator token is appended. Doc-Sentences involves documents and makes sure the sequence is not crossing across documents. The authors observed that upon removing the NSP (Next Sentence Prediction) loss, the performance was either on par with the vanilla BERT or outperformed slightly across various GLUE tasks. 

Moving on to one of the important hyperparameters which is the batch size, vanilla BERT’s training batch size was 256 trained over a million steps. In RoBERTa, the batch size was increased upto 8000 with 31K training steps. Larger batch size increased the model perplexity for the masked language modelling and downstream tasks.

The original BERT used a Byte-Pair Encoding (character-level) with the vocab size of 30K. Here, in RoBERTa the size was increased to 50K which added around 15M computational parameters to the model. But, this also contributed to that increase in performance.

All these techniques were instrumental for the better performance of RoBERTa.


Performance comparisons of RoBERTa and other models over various GLUE tasks. Image Credits — RoBERTa paper

As we can see in the above image, RoBERTa in a single-task model outperforms BERT(large) and XLNET(large) on all tasks! In an ensemble setup, RoBERTa outperforms BERT and other models on 4 tasks and on an average of all tasks also. 


RoBERTa is an improvised version of BERT which offers better performance on the downstream NLP tasks than BERT. There is a small increase in computational parameters but the training time is 3–4 times that of BERT’s. This is the only major disadvantage. There are few more models which emerged from BERT, like DistilBERT and AlBERT where the model size is several times smaller than BERT enabling easier deployment with lower memory and RAM requirements. Stay tuned to the NLP Tutorials series for more articles on exciting and groundbreaking NLP architectures! 


  1. RoBERTa: https://arxiv.org/pdf/1907.11692.pdf
  2. RoBERTa Official implementation: https://github.com/pytorch/fairseq/tree/master/examples/roberta
  3. BERT: https://arxiv.org/pdf/1810.04805.pdf


Pranav Raikote

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s