Q: DO LARGE LANGUAGE MODELS UNDERSTAND IMPLICATURE? A: DO PIGS FLY?

Abstract

Despite widespread use of LLMs as conversational agents, evaluations of performance fail to capture a crucial aspect of communication: interpreting language in context. Humans interpret language using beliefs and prior knowledge about the world. For example, we intuitively understand the response "I wore gloves" to the question "Did you leave fingerprints?" as meaning "No". To investigate whether LLMs have the ability to make this type of inference, known as an implicature, we design a simple task and evaluate widely used state-of-the-art models. We find that, despite only evaluating on utterances that require a binary inference (yes or no), most perform close to random. Models adapted to be "aligned with human intent" perform much better, but still show a significant gap with human performance. We present our findings as the starting point for further research into evaluating how LLMs interpret language in context and to drive the development of more pragmatic and useful models of human discourse.

1. INTRODUCTION

User: "Have you seen my phone?" InstructGPT: "Yes, I have seen your phone." InstructGPT's responsefoot_0 is a perfectly fine answer to the question, but a human might answer differently. They might respond "it's in your bag," bypassing the obvious follow-up question ("where is it?"). Giving such a helpful and efficient answer is an example of pragmatic language usage that goes beyond the semantic meaning of utterances. Meaning is not only determined by a combination of words, but also context, beliefs, and social institutions (Wittgenstein, 1953; Grice, 1975; Huang, 2017) . Consider another exchange where Esther asks her friend Juan "Can you come to my party on Friday?" and Juan responds "I have to work.". We resolve Juan's response into a decline by using the contextual commonsense knowledge that having to work on a Friday night precludes attendance. Both these exchanges contain an implicature-utterances that convey something other than their literal meaningfoot_1 . Implicatures illustrate how context contributes to meaning; distinguishing writing and speaking from communicating (Green, 1996) . We cannot fully understand utterances without understanding their implications, nor can a computational model. Indeed, the term "communication" presupposes the speaker's implications are understood by the addressee. Although communication encompasses much more than implicatures, such as assertives and other illocutionary acts, we view implicature understanding as a necessary condition for communicating with humans. Being able to resolve seemingly completely novel implicatures and-more broadly-engage in pragmatic understanding constitutes an essential and ubiquitous aspect of our every day usage of language. Large language models (LLMs) have demonstrated remarkable ability on a variety of downstream tasks such as planning (Huang et al., 2022a ), commonsense reasoning (Kojima et al., 2022) , information retrieval (Lewis et al., 2020; Kim et al., 2022) and code completion (Austin et al., 2021; Biderman & Raff, 2022) , to name just a few. When finetuned with human feedback, LLMs obtain higher ratings on desiderata like helpfulness (Ouyang et al., 2022; Bai et al., 2022) , and are proposed as conversational agents (Thoppilan et al., 2022) . Despite the widespread use and deploy-Figure 1 : A schematic depiction of the protocol we propose to evaluate whether language models can interpret language in context. Each example in the test set gets wrapped in templates and transformed into an incoherent example by swapping "yes" and "no". The model is said to understand the implicature if it assigns a higher likelihood to the coherent text than the incoherent text. ment of LLMs as conversational agents, there has been limited evaluation of their ability to navigate contextual commonsense knowledge. This raises an important question: to what extent do large language models understand conversational implicature? To answer this question we use a publicly available dataset of conversational implicatures and propose an evaluation protocol on top of it (Figure 1 ). We evaluate a range of stateof-the-art models that can be categorised into four distinct groups; base LLMs (like OPT (Zhang et al., 2022) ), instructable LLMs finetuned on downstream tasks (like Flan-T5 (Chung et al., 2022) ), LLMs finetuned on conversational data (like BlenderBot (Ng et al., 2019) ), and instructable LLMs finetuned with an unknown method (i.e. the latest versions of OpenAI's InstructGPT-3 seriesfoot_2 ). We evaluate both zero-shot and test whether performance improves by presenting in-context examples (few-shot evaluation). Our results suggest that implicature resolution is a very challenging task for LLMs. Most models obtain around 60% accuracy on the test set, whereas humans obtain 86% and random performance is 50%. InstructGPT-3 consistently outperforms other models across almost all model sizes considered, but even here zero-shot evaluation leaves a gap of 14% with the average human. In-context prompting can shrink this gap to 6% for the best of OpenAI's models. However, it does not help much for other models; at 30-shot they still all perform worse than instructGPT-3 does at zero-shot. We do a comprehensive error analysis by manually grouping the test examples into categories and uncover that the performance increase for the largest models seems driven by the simplest examples in the dataset that require no context to be resolved. For these examples the conventional meaning of the words entails a proposition, e.g. "some people came to the party" implying "not all people came". When isolating the best model's performance on implicatures that do require commonsense knowledge to be resolved (like the one in Figure 1 ), the gap between zero-shot and the human average becomes 24%, and the gap between few-shot and the human average becomes 9%. Furthermore, scaling analysis shows that most of the model classes we evaluate do not exhibit increased performance when scaled up. Based on this result, we hypothesise it is unlikely further scaling alone will lead to significant improvements. The main contributions of this work are as follows i) we motivate implicature understanding as a crucial aspect of communication that is currently missing from evaluations of LLMs, ii) we design an implicature resolution task and propose a comprehensive evaluation protocol on which we evaluate both humans and LLMs to find that it poses a significant challenge for state-of-the-art LLMs, and (iii) we perform a comprehensive error analysis and identify opportunities for future work.



Appendix A contains details on how this answer was obtained from InstructGPT-3. In Appendix B we present a comprehensive introduction to implicature. The method is unpublished and might differ from the original instructGPT(Ouyang et al., 2022).

