Hello and welcome back to the NLP tutorials series where we inspire you to go through the ranks of NLP expertise all the way to Expert. If you follow all our articles in this tutorial series, no doubt you will gain valuable technical knowledge in the NLP domain. In our previous articles, we had an in-depth look at BERT and one of its improvements (accuracy). In this article we shall rather address a huge problem coming our way — the computational requirement for training these massive language models is going out of hands.
Background
In the recent years we have seen a lot of research and advancements happening in the NLP domain, especially language modelling architectures which are trained on massive datasets for days together. While the accuracy is impeccable, it comes at a cost — Difficulty in training and serving the models. Models and datasets are getting bigger and bigger, now at a scale of billion parameters (24x larger than BERT, 5x larger than GPT-2). Training such a model requires data in the scale of hundreds of GBs (RoBERTa — 160GB)! Once trained, deploying it in a scalable environment is very costly due to the requirement of multiple GPUs to load and respond to the requests.

As we can in the above graph, the parameters are increasing very rapidly. But, amidst all the noise of big models, there is one wonder architecture which is quite small and offers a performance that is equivalent to 97% of BERTs accuracy. How cool is this! The DistilBERT with just 66M parameters reaching that level of accuracy and is 60% faster than BERT has made it popular, with huggingface (popular NLP & Transformers library for python) reporting more than 400000 installs of DistilBERT! Time for a deep dive into DistilBERT.
DistilBERT
There are many techniques to reduce the parameters of a Deep Learning model. Quantization and Weight Pruning are the prominent and most used methods due to their effectiveness and simple-to-implement nature. The authors used a method called Distillation which is transferring the knowledge of a neural network to another smaller network. DistilBERT is in fact a short form of the term “Distilled BERT”. How does this distillation work will be your next question. A simple answer to that — A smaller model called the student is trained in such a way that it is able to acquire the generalization capabilities of the bigger teacher model. The student model is trying to reproduce the behavior and results of the teacher model. Let’s see how this technique was applied on BERT which gave birth to the student model of BERT — DistilBERT.
The Student Architecture (DistilBERT) is very similar to BERT with a few modifications,
- Number of layers are halved
- Token-type embeddings and pooler are removed
The student architecture has the same number of neurons in a layer as that of the teacher (only the number of layers are reduced) enabling the initialization of the student from the teachers layers. Every other layer in the teacher is used to initialize the student network which was fundamental in achieving convergence (successful training which does not encounter stalling in training). The training data was unchanged from that of BERT.
Coming to the loss function used in this distilled training, the student is trained over soft target probabilities instead of the normal minimization of cross entropy between predicted and ground truth labels. The reason is to make sure the model’s generalization capabilities are found at the maximum level possible. This distillation loss function L(ce) is given below,

where t(i) is the probability estimated by the teacher and s(i) is the probability estimated by the student. Coming to the activation function, a softmax-temperature is used with which we can control the smoothness of the output distribution.
The T value controls the smoothness and z(i) is the model score for the class i. At inference, T is set to 1 to change the behavior back to the standard softmax function. The training objective is a linear combination of distillation loss and the supervised training loss. The DistilBERT was trained for roughly 3.5 days on 8 Nvidia V100 16 GB GPUs. On the contrary, the RoBERTa model required more than 1000 Nvidia V100 32 GB GPUs to train it in a single day.
Results

As we can see in the above table, DistilBERT’s score of 77.0 is very close to 79.5 of the BERT. DistilBERT has around 66M parameters as opposed to 110M and 180M of BERT and ELMo respectively. Due to the low computational parameters, the inference time of DistilBERT is ~410ms to that of ~660 of the BERT whilst retaining close to 97% accuracy. Phenomenal, isn’t it?
Conclusion
DistilBERT enabled Edge deployment of the model on smartphones for a variety of downstream NLP tasks. It was found to be 71% faster than BERT and the model size was just 207 MB (equal to a mid-range gaming application)! The paper discusses a few more model compression techniques and research areas the authors explored. I would strongly recommend a read of the research paper. It is a small and very easy to understand research paper once you have fully understood BERT. To conclude, DistilBERT is 40% smaller, 60% faster and holds upto 97% of the performance of BERT. Try a hand at using DistilBERT for a few popular NLP tasks and you will be surprised at the speed and accuracy of the model. We will come back soon with another exciting article in the NLP Tutorials series, until then put down your thoughts on DistilBERT and share this article.
References
- DistilBERT: https://arxiv.org/pdf/1907.11692.pdf
- DistilBERT huggingface: https://huggingface.co/transformers/model_doc/distilbert.html
- PyTorch implementation: https://analyticsindiamag.com/python-guide-to-huggingface-distilbert-smaller-faster-cheaper-distilled-bert/
- BERT: https://arxiv.org/pdf/1810.04805.pdf
Author
Pranav Raikote