Paper Summary: Language Models are Few-Shot Learners (GPT-3 Paper)

This paper is one of the important foundational papers from OpenAI, published in 2020.

TL;DR: A typical approach for developing a powerful transformer based Large Language model (LLM) in a specific domain involves two phases: pretraining and fine-tuning, while the former refers to training, utilizing a large volume of data, using self-supervised learning, the latter refers to the adaptation of pretrained model to a task at hand, e.g., sentiment analysis. While this typical approach seems to be powerful, achieving SOTA results on many benchmarks, it still requires task-specific fine-tuning datasets which involve 1000s of examples. In contrast, we human adapt and learn quicker with zero or a few examples. This paper demonstrates that an LLM that is pretrained with large volume of data, and large in size, has the ability to learn very quickly with few examples (few-shot), possibly cutting the need for fine-tuning in many tasks if not all. It also demonstrates that increasing model size, in terms of parameters, also increases the model's learning capacity.

GPT as an acronym refers to "Generative Pre-trained Transformers". Before GPT-3 (this paper), there were previous versions. Even though there are differences between GPT family models, in essence, it all depends on Transformer architecture: Attention Is All You Need.

In this paper, the authors emphasize that no gradient update or fine tuning applied to pretrained LLM: "For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model".

The paper attempts to clarify Meta-learning, by defining outer loop and inner loop. Outer loop includes learning via backpropagation and SGD, while inner loop includes learning via forward pass. Inner loop can include zero-shot, one-shot and few-shot (or in-context learning) learning as introduced in the paper.

GPT-3 paper in essence explores the in-context learning, with increasing parameter size. Every use case is evaluated in 3 conditions: zero-shot, one-shot and few-shot. In most use cases, model performance increases with addition of natural language task description, and number of examples given to GPT-3 in model's context. There are also tasks which few-shot performance struggles: ML inferences tasks like ANLI dataset, reading comprehension like RACE or QuAC.

GPT-3 is pretrained with multiple datasets. Datasets like CommonCrawl may include samples from test and evaluation sets and hence authors develop a systematic tool to measure data contamination and quantify its effect on the performance. However, they have found a bug in their filtering approach later after pretraining and they did not rerun the pretraining with the concern on the cost: "Unfortunately, a bug in the filtering caused us to ignore some overlaps, and due to the cost of training it was not feasible to retrain the model". Instead they tried to remove the overlapping examples from the evaluation dataset, if matched by 13-gram.

Notes from the paper

Gap between zero, one and few shot performance often grows with model capacity. Larger models are more proficient meta learners.
Context size of GPT-3 is 2048 tokens, meaning there is a limit how much few shot examples one can give to the model.
One-shot resembles to how human learns, that is why authors include this in the comparisons.
Data quality is an important aspect of pretraining an LLM. Authors take reasonable steps to make sure that the data quality is high: "3 steps to improve the average quality of our datasets: (1) we downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora, (2) we performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting, and (3) we also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity."
In total dataset used for pretraining nearly contains 3 trillion words. The size of CommonCrawl after filtering is 570GB, roughly equivalent to 400 billion byte-pair encoded tokens.
Not all datasets are seen by GPT equally. Higher quality curated datasets (such as Wikipedia) are sampled 2–3 times than lower quality datasets (such as CommonCrawl).
They found out that potential contamination in the dataset is high, but they see no effect in the performance.
A lesson for pretraining: "larger models can typically use a larger batch size, but require a smaller learning rate".
GPT-3 did not excel in all scenarios: "GPT-3 appears to be weak in the few-shot or one-shot setting at some tasks that involve comparing two sentences or snippets".
Another concern for practical adoption of GPT-3 is its size, even for inference. However, authors mention that distillation techniques might be a way forward to reduce the size of the model for practical deployment.

Evaluation Notes

GPT-3 is evaluated in many datasets and tasks:

Language modeling, Cloze and Completion Tasks

Penn Tree Bank (PTB) dataset — GPT-3 achieves a new SOTA by 15pt margin achieving 20.50 perplexity.
LAMBADA dataset — tests modeling of long-range dependencies. GPT-3 achieves 76%, 8% gain over SOTA.
HellaSwag dataset — GPT-3 achieves 79.3% accuracy in few-shot learning and outperforms 1.5B fine-tuned model.
StoryCloze dataset — GPT-3 achieves 87.7% in few-shot (K=70) while fine-tuned BERT model outperforms GPT-3 by 4.1%.

Closed book question answering tasks

GPT-3 has either better or close results to fine-tuned Google's T5 11B models on 3 different datasets.

Language translation

The dataset that is used to pretrain GPT-3 has 93% English and 7% other languages samples.
GPT-3 is found to be good at translating to English but suffers in the other direction.

Winograd Schema-like tasks

GPT-3 performs slightly worse than SOTA or human performance.
On more challenging dataset, GPT-3 achieves 77.7% while RoBERTa achieves 79% and SOTA is 84.6%.

Common sense reasoning or Q/A

PIQA dataset — GPT-3 achieves SOTA beating RoBERTa fine-tuned model (possible data contamination disclaimer).
ARC dataset — Similar results to fine-tuned RoBERTa baseline. Worse than SOTA by 27% and 22%.
OpenBookQA dataset — GPT-3 is short of 20 points from SOTA, but performs similar to BERT-large fine-tuned model.

Reading comprehension

Evaluated using 5 datasets. Results are on par with baselines like BERT but short of SOTA performances.

SuperGLUE benchmark

GPT-3 outperforms a fine-tuned BERT-large on four of eight tasks.
Few-shot score steadily improves with the number of examples, scaling to 32 before hitting context size limit.

NLI

RTE dataset — GPT-3 results are close to random, and less than SOTA results.

Synthetic and qualitative tasks

GPT-3 performed poorly on correcting reverse chars word, which authors believe is due to BPE tokenization.

Conclusion

GPT-3 is the first paper that demonstrates that LLMs can achieve SOTA results and learn from few examples, and hence open the door for generic LLMs that does not even need fine-tuning for specific tasks.