Hello and welcome back to the NLP Tutorials Blog series! In this article we will understand the model which is a successor to the GPT model i.e GPT-2. GPT-2 was trained with a very simple objective: generate text and build coherent essays and paragraphs. GPT-2 is a huge model — 1.5 billion parameters! GPT-2 has more than 10x times parameters and 10x times training data than GPT-1 making it a scaled up version of GPT. GPT-2 was so good that the authors did not release the original trained models due to concerns about misuse of the AI. It famously became,
The AI that was too Dangerous to Release
While GPT was famous and topped the leaderboard in performance, it was soon knocked down by Google’s BERT model. One disadvantage of GPT was that it was trained like traditional language modelling i.e left to right context and predicting the next word. BERT implemented something called masked language modelling i.e fill in the blank or completing the sentence given some blanks. This made BERT even more powerful because of bidirectional contextual representation and learning.
GPT-2 was released in 2019 and was very good in generating paragraphs which made a lot of sense. But on the dark side, it was able to generate fake news and create havoc also. Let’s look into the architecture and few interesting results of GPT-2.
The authors did not share the original GPT-2 model and instead open sourced a smaller version which is equivalent to the size of GPT. GPT-2 originally had around 1.5 billion parameters which is almost 10 times larger than GPT-1 (117M) and 4 times larger than BERT (340M). GPT-2 used 48 layers and 1600 word embedding size for inputs. Also, a larger vocabulary of more than 50000 tokens was used along with a larger batch size of 512 and context window of 1024. The authors trained 4 versions of GPT-2–117M, 345M, 763M and 1.5B parameters named GPT-2 small, medium, large and extra-large respectively.
The dataset created was scraped from Reddit users having a good rating on the platform and also from the outbound links of highly up-voted articles. This dataset, WebText, is 40GB in size and contains more than 8 million documents. One more highlight of WebText is that all wikipedia articles were removed since many test sets of evaluating the language models contained a lot of wiki articles. As we have seen, GPT-2 is a more or less expanded version of GPT-1 + More data + More GPUs for training.
GPT-2 also introduced a concept of task conditioning, which is modification of the existing task of GPT-1 with an added condition: it is expected to produce different output for different tasks with the same input. This also throws light on Zero-shot transfer where the model is expected to understand the task without any explicit instruction. For example,
Hi, how are you French:
The model understands that the task is to translate the sentence into French language. This is called zero-shot task transfer.
GPT-2 was evaluated on several datasets and it outscored 7 of the 8 previous state-of-the-art language modelling tasks in zero shot setting. Given below are the few examples of how the model performed.
Also, there is an API implementation of GPT-2 where you can freely input various prompts and get some really cool results, check this out: https://demo.allennlp.org/next-token-lm
We shall see GPT-3 in detail in our next article in the NLP-Tutorials series. Until then, share your thoughts on this article and also check out some cool implementation examples of GPT-2 (links in references).
- GPT-2 paper: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
- GPT paper: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
- Text Generator using GPT-2 (Python): https://www.analyticsvidhya.com/blog/2019/07/openai-gpt2-text-generator-python/
- Hugging Face pre-trained GPT-2 model: https://huggingface.co/transformers/model_doc/gpt2.html