Welcome back to yet another interesting article in the NLP Tutorials series wherein we will be advancing our proficiency from a Beginner to an expert in NLP. In this blog, we will be looking at an architecture which took the industry by storm. That’s right, it’s the GPT (Generative Pre Training)! The GPT was published by OpenAI in 2018 and achieved an incredible state of the art performance in the majority of the popular NLP tasks. GPT is a way of training language models and comes under the category of semi-supervised learning. This means, it is trained on unlabeled text data and then fine-tuned on supervised (labelled) data for the downstream NLP tasks. Let’s dig deep and understand GPT in detail.
Background
At the time of its introduction, language models were evolving and there was a search for an architecture which could perform excellently on major NLP tasks and also be scalable and task-agnostic. The core idea revolving around the GPT’s mechanism is Attention mechanism and Unsupervised pre-training. The reason for unsupervised learning is simply the lack of massive datasets which were labelled accurately. It requires a lot of time, effort and monetary resources to get this done and we are talking at a scale of tens of thousands of Gigabytes! In the recent past unsupervised methods were successful in the form of Word2Vec and GloVe. Also, fine-tuning on a model which was trained on a huge corpus of unlabeled data yielded excellent results. GPT is also known for its impressive performance on Zero-shot. Zero-shot learning is a scenario wherein at test time the samples provided were not observed while training.
Now that we have some basic information on which GPT was built on, it’s time to look into the architecture.
Architecture
Coming to the architecture, GPT uses the decoder stack of the Transformer architecture (BERT uses the encoder stack of the Transformer architecture). The Decoder is a unidirectional self attention model which is using the context of the preceding tokens only. GPT works very well in real time text translation and word prompting as it is not waiting for the sequence to be completed so as to factor the context in a bidirectional way. The Encoder is a bidirectional self attention model which uses all the tokens in a sequence as observed in the BERT architecture.

The model is trying to optimize the maximum likelihood of the probability of tokens (u) appearing in the given context window (k). The formula is given below. Theta stands for the overall parameters of the architecture.
For the fine-tuning stage the inputs are passed through all the layers of the network and then passed onto a final output layer (linear) with parameters which is used to predict y for a given x.
The authors of GPT also found out that keeping the unsupervised language modelling task as an auxiliary objective in the fine-tuning task helped in two ways:
- Improved generalization
- Accelerated convergence
GPT was trained in an unsupervised manner on the BookCorpus dataset which contains over 7000 unique published books from a variety of genres like Fantasy, Adventure and Romedy. The dataset worked well since it contained lengthy sequences of contiguous text from which the GPT was able to model long range contextual information.
Byte Pair Encoding (BPE) vocabulary with the size 40000 was used along with an Adam optimizer (learning rate of 2.5e-4 or 0.00025) for 100 epochs on mini-batches of 64 with a sequence length of 512. The embedding dimension during encoding was set to 768. The model was a 12-layered decoder-only transformer with 12 attention heads in each layer. To prevent overfitting, regularization was used in the form of residual, embedding and attention dropouts.
For fine-tuning, most of the hyperparameters were retained with the exception of learning rate and batch-size. I recommend a detailed read of the paper for more insights on these aspects. Usually the model is able to fit to the supervised task very quickly in the order of 3–4 epochs only!
Results

GPT outperforms the previous SOTA 9 out of 12 times. That is phenomenal! We can see a variety of datasets like SNLI (image captions), SciTail (science exams), QNLI (wiki articles), MNLI (transcribed speech and govt reports), RTE (news articles), MRPC (paraphrase corpus), QQP (quora question pairs) and a host of tasks like NLI (Natural Language Inference), Question Answering, Common sense reasoning, sentiment analysis and the GLUE benchmark.
The effect of pre-training was also evaluated and turns out that without pre-training, the performance decreases by ~14.8% across all tasks. The auxiliary language modelling objective function’s inclusion indicated benefits for large datasets.
Conclusion
The GPT was and is the foundation of the current language models GPT-2 & GPT-3. The GPT-3 has been impeccable in a variety of tasks which are just mind-blowing. More on this in the future blog posts in the NLP Tutorials series. The GPT proved that we can achieve very good performance on various NLP tasks with an unsupervised pre-training + supervised fine-tuning. It is task agnostic and generalizes well to a number of well known tasks discussed in the results section. GPT also showed us that the idea showed promising results in zero-shot learning.
Hope you were able to grasp the core fundamentals of GPT which is essential for advancing to models like GPT-2 and GPT-3. Try some hands-on implementation by Hugging Face libraries which have made using GPT and other NLP models easier to use and fine-tune on our custom datasets. See you in the next article 🙂
References
- GPT paper: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
- Transformers (Attention is All You Need): https://arxiv.org/pdf/1706.03762
- Hugging face pre-trained weights for GPT: https://huggingface.co/transformers/model_doc/gpt.html
Author
Pranav Raikote