MEASURING MASSIVE MULTITASK LANGUAGE UNDERSTANDING

Abstract

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have nearrandom accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

1. INTRODUCTION

Natural Language Processing (NLP) models have achieved superhuman performance on a number of recently proposed benchmarks. However, these models are still well below human level performance for language understanding as a whole, suggesting a disconnect between our benchmarks and the actual capabilities of these models. The General Language Understanding Evaluation benchmark (GLUE) (Wang et al., 2018) was introduced in 2018 to evaluate performance on a wide range of NLP tasks, and top models achieved superhuman performance within a year. To address the shortcomings of GLUE, researchers designed the SuperGLUE benchmark with more difficult tasks (Wang et al., 2019) . About a year since the release of SuperGLUE, performance is again essentially human-level (Raffel et al., 2019) . While these benchmarks evaluate linguistic skills more than overall language understanding, an array of commonsense benchmarks have been proposed to measure basic reasoning and everyday knowledge (Zellers et al., 2019; Huang et al., 2019; Bisk et al., 2019) . However, these recent benchmarks have similarly seen rapid progress (Khashabi et al., 2020) . Overall, the near human-level performance on these benchmarks suggests that they are not capturing important facets of language understanding. Transformer models have driven this recent progress by pretraining on massive text corpora, including all of Wikipedia, thousands of books, and numerous websites. These models consequently see extensive information about specialized topics, most of which is not assessed by existing NLP benchmarks. It consequently remains an open question just how capable current language models are at learning and applying knowledge from many domains. To bridge the gap between the wide-ranging knowledge that models see during pretraining and the existing measures of success, we introduce a new benchmark for assessing models across a diverse set of subjects that humans learn. We design the benchmark to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics (Hendrycks et al., 2020) . The granularity and breadth of the subjects makes the benchmark ideal for identifying a model's blind spots. We find that meaningful progress on our benchmark has only become possible in recent months. In particular, few-shot models up to 13 billion parameters (Brown et al., 2020) achieve random chance performance of 25% accuracy, but the 175 billion parameter GPT-3 model reaches a much higher 43.9% accuracy (see Figure 1b ). On the other hand, unlike human professionals GPT-3 does not excel at any single subject. Instead, we find that performance is lopsided, with GPT-3 having almost 70% accuracy for its best subject but near-random performance for several other subjects. Our results indicate that while recent advances have been impressive, state-of-the-art models still struggle at learning and applying knowledge from pretraining. The tasks with near-random accuracy include calculation-heavy subjects such as physics and mathematics and subjects related to human values such as law and morality. This second weakness is particularly concerning because it will be important for future models to have a strong understanding of what is legal and what is ethical. Worryingly, we also find that GPT-3 does not have an accurate sense of what it does or does not know since its average confidence can be up to 24% off from its actual accuracy. We comprehensively evaluate the breadth and depth of a model's text understanding by covering numerous topics that humans are incentivized to learn. Since our test consists in 57 tasks, it can be used to analyze aggregate properties of models across tasks and to track important shortcomings. The test and code is available at github.com/hendrycks/test.

2. RELATED WORK

Pretraining. The dominant paradigm in NLP is to pretrain large models on massive text corpora including educational books and websites. In the process, these models are exposed to information about a wide range of topics. Petroni et al. (2019) found that recent models learn enough information from pretraining that they can serve as knowledge bases. However, no prior work has comprehensively measured the knowledge models have across many real-world domains. Until recently, researchers primarily used fine-tuned models on downstream tasks (Devlin et al., 2019) . However, larger pretrained models like GPT-3 (Brown et al., 2020) have made it possible to achieve competitive performance without fine-tuning by using few-shot learning, which removes the need for a large fine-tuning set. With the advent of strong zero-shot and few-shot learning, it is now possible to curate a diverse set of tasks for evaluation and remove the possibility of models on "spurious cues" (Geirhos et al., 2020; Hendrycks et al., 2019b) in a dataset to achieve high performance. Benchmarks. Many recent benchmarks aim to assess a model's general world knowledge and basic reasoning ability by testing its "commonsense." A number of commonsense benchmarks have been



Shot Prompt and Predicted Answer How many numbers are in the list 25, 26, ..., 100? (A) 75 (B) 76 (C) 22 (D) 23 Answer: B Compute i + i 2 + i 3 + ••• + i 258 + i 259 . (A) -1 (B) 1 (C) i (D) -i Answer: A If 4 daps = 7 yaps, and 5 yaps = 3 baps, how many daps equal 42 baps? (A) 28 (B) 21 (C) 40 (D) 30 Answer: C␣The following are multiple choice questions about high school mathematics. (a) An example of few-shot learning and inference using GPT-3. The blue underlined bold text is the autocompleted response from GPT-3, while the preceding text is the user-inputted prompt. In this 2-shot learning example, there are two instruction examples and one initially incomplete example. On average, GPT-3 has low accuracy on high school mathematics questions.

