Hello and welcome back to the NLP Tutorials Blog series! We are finally discussing the biggest language model — GPT-3 (at the time of its release). GPT-3 was a massive model of 175 billion parameters, way more than GPT-2, Google’s T5 and Microsoft’s Turing NLG model. The main objective of GPT-3 was to improve the few-shot and zero-shot tasks with a large training data and computational parameters. The GPT-3 did not fail in achieving this objective and blew away all other language models in a plethora of language modelling tasks. Let’s dive deep into the world of GPT-3
Background
In the recent era of language modelling, the GPTs and BERTs performed exceptionally well but still had one disadvantage in general — they were task-agnostic. There was a need for task-specific fine-tuning to achieve good performance on a desired task. As and when a new task came up, big datasets were required for fine-tuning the pre-trained model, which was not that efficient considering the lack of labelled datasets and the computational requirements on top of the massive time and money spent on pre-training. There was also an observation that the pre-trained model’s generalization was not that good because of different distributions of data in pre-training and fine-tuning.
Time to understand how GPT-3 overcame these problems with the emphasis now on the architecture, dataset and training details
Architecture
GPT-3 fundamentally is a GPT-2 with modifications and add-ons. GPT-3 includes modified initialization, normalization and reversible tokenization. Also, GPT-3 uses alternating dense and local sparse attention layers in the transformer blocks (similar to sparse transformers). There are 8 versions of GPT-3 with iterative increases in size from 125M parameters all the way to 175B parameters.

GPT-3 incorporates a total of 96 layers with each having 96 attention heads. The size of word embedding used was upto 12888 from a mere 1600 from GPT-2. There was an increase in context length too from 1024 to 2048 which helped in modelling very long range sequences.
There were a total of 5 dataset used to train the GPT-3 with weightage in training.

The model was evaluated for tasks like general Fine-Tuning, Few-Shot, One-Shot and Zero-Shot. Have a look at the below image for a clearer understanding of these concepts.
One more important concept discussed in the paper is In-Context Learning, where the model learns to do a downstream task simply by conditioning on a prompt consisting of input-output examples. This is achieved by using the text input of a pre-trained language model as a form of task specification i.e, the model is conditioned on a natural language instruction of the task and is then expected to complete further instances of the task simply by predicting what comes next. Also, it is shown that larger models are making use of in-context methodology for a better performance.

Results
GPT-3 can do well on specific tasks without any special fine-tuning as opposed to elaborate fine-tuning with a reasonably-sized dataset for good performance observed in BERT. GPT-3 performed well on a variety of tasks like Closed Book Question Answering, Schema Resolution, Translation etc, but wasn’t able to beat the zero-shot SOTA performance (came very close to beating). GPT-3 was very very strong in few-shot and one-shot tasks.
GPT-3 excelled in tasks like arithmetic addition, unscrambling, and news article generation. It showcased phenomenal results in code generation and building web page layouts. Now Microsoft has struck a deal with OpenAI for an exclusive license of GPT-3 and we are seeing the exceptional results in spreadsheet tasks.
Conclusion
As we have seen, GPT-3 is a very very powerful language model which performs excellently in a variety of tasks. We have seen various instances of GPT-3 in real-life applications like spreadsheets, article generation for a topic, and now GitHub’s Co-pilot and OpenAI’s Codex. Make sure to check them out. GPT-3 did have a few shortcomings. Fails at times in generating coherent long sequences and at times might repeat the shorter sequence again and again (may be due to unidirectional context). One more limitation is the high cost involved in inference and hosting the model on cloud due to its enormous size. All in all, it’s an enormous model which is being used in more and more real-life applications.
We have now come to an end for this article. Stay tuned for future posts in the NLP tutorials series where we do deep dives on instrumental NLP models!
References
- GPT-3: https://arxiv.org/pdf/2005.14165.pdf
- GPT-2: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
- GPT-3 Demo & Showcase: https://gpt3demo.com/
- OpenAI’s API: https://openai.com/blog/openai-api/
Author
Pranav Raikote