TASK AMBIGUITY IN HUMANS AND LANGUAGE MODELS

Abstract

Language models have recently achieved strong performance across a wide range of NLP benchmarks. However, unlike benchmarks, real world tasks are often poorly specified, and agents must deduce the user's intended behavior from a combination of context, instructions, and examples. We investigate how both humans and models behave in the face of such task ambiguity by proposing AmbiBench, a new benchmark of six ambiguously-specified classification tasks. We evaluate humans and models on AmbiBench by seeing how well they identify the intended task using 1) instructions with varying degrees of ambiguity, and 2) different numbers of labeled examples. We find that the combination of model scaling (to 175B parameters) and training with human feedback data enables models to approach or exceed the accuracy of human participants across tasks, but that either one alone is not sufficient. In addition, we show how to dramatically improve the accuracy of language models trained without large-scale human feedback training by finetuning on a small number of ambiguous in-context examples, providing a promising direction for teaching models to generalize well in the face of ambiguity.

1. INTRODUCTION

Language models have recently been applied to a wide range of NLP benchmarks, ranging from question answering, summarization, and logical reasoning, to solving riddles, dark humor detection, and ASCII word recognition (Brown et al., 2020; Srivastava et al., 2022) . Performance across tasks has improved as models and datasets have grown in size, raising the prospect of a route towards generalist NLP models with broad utility. However, one feature many of these benchmarks share is that they are carefully designed to make the desired task very clear to the language model, since this is a prerequisite for establishing performance on that task. Unfortunately, real-world uses of language models are not likely to feature such Figure 1 : Complex tasks are often hard to specify precisely, leaving important pieces of information missing. Agents should be able to fill in the blanks by combining information from instructions and examples in order to identify the intended behavior. thought and clarity in their task specification. Rather than iterating over and perfecting a specification for their tasks, everyday users of language models may wish to define tasks on an as-needed basis, without worrying that they will be misunderstood. More pressingly, in complex domains featuring high-dimensional inputs and outputs (e.g. programming, verification, generation) it is unlikely that even a thoughtful task specification will manage to perfectly capture all the features of an input and output which are salient or not salient to the task. This is especially important for safe and robust deployment of language models, as such undesirable dependencies can be hidden hazards that are only revealed when a model fails catastrophically in a new setting (Geirhos et al., 2020) . To operationalize this problem, we introduce AmbiBench, a new benchmark of six ambiguouslyspecified tasks. Each input in AmbiBench is a sentence (e.g. The dog is in the meadow) that has multiple associated classification tasks based on different linguistic features (e.g. contains an animal, contains an outdoor location). Task ambiguity arises when more than one task is consistent with the provided instructions or labeled examples. 1We establish how well different models and humans perform on ambiguously-specified tasks, given a wide range of task specifications including clear vs unclear instructions and zero vs multiple examples. We find that the largest models trained with human feedback data (HFD) match or outperform human participants across all specifications we try, though all underperform a Bayesian oracle. We also show how to improve standard language models' performance by finetuning them on a small set of in context examples that demonstrate the desired generalization. This form of meta-learning dramatically improves a model's ability to learn new ambiguously-specified tasks. This suggests a possible mechanism for why the HFD models outperform standard language models (discussed in Section 4.4), as well as a promising direction for improving how models learn in ambiguous contexts. To summarize our contributions, we: 1. Introduce and motivate the problem of studying task ambiguity in large language models 2. Evaluate humans and models on a new benchmark of ambiguously-specified tasks, demonstrating that while pure language models fail to disambiguate the intended task well, sufficiently-large models trained with human feedback data are able to approach or even exceed the performance of our human participants to resolve the ambiguity between tasks 3. Show how finetuning on ambiguous in-context prompts and examples can enable traditional language models to surpass the performance of HFD models when evaluated on unseen tasks, providing a promising route towards models that capably manage task ambiguity 2 RELATED WORK et al., 2019; Guo et al., 2021; Aliannejadi et al., 2021; Sun et al., 2022; Wu et al., 2022) . Our work differs from these prior streams of work by studying task ambiguity (Finn et al., 2018; Tamkin et al., 2022c) , where the task the agent is being asked to perform is ambiguous, rather than an ambiguous input for a clear task. This is of special relevance for self-supervised learning models that are trained for adaptation to a broad range of downstream tasks (Bommasani et al., 2021; Tamkin et al., 2022b) . In these settings, models must infer the correct task from a user's specification, as opposed to a possibly unsafe or undesirable task that is also consistent with that specification.



Importantly, task ambiguity is distinct from clearly-specified tasks with ambiguous inputs, e.g. determining the referent of the pronoun in sentences like the nurse handed the doctor her phone. Here, the task is clear (determine who her refers to), but there is not enough information in the input to answer it.

