Q: DO LARGE LANGUAGE MODELS UNDERSTAND IMPLICATURE? A: DO PIGS FLY?

Abstract

Despite widespread use of LLMs as conversational agents, evaluations of performance fail to capture a crucial aspect of communication: interpreting language in context. Humans interpret language using beliefs and prior knowledge about the world. For example, we intuitively understand the response "I wore gloves" to the question "Did you leave fingerprints?" as meaning "No". To investigate whether LLMs have the ability to make this type of inference, known as an implicature, we design a simple task and evaluate widely used state-of-the-art models. We find that, despite only evaluating on utterances that require a binary inference (yes or no), most perform close to random. Models adapted to be "aligned with human intent" perform much better, but still show a significant gap with human performance. We present our findings as the starting point for further research into evaluating how LLMs interpret language in context and to drive the development of more pragmatic and useful models of human discourse.

1. INTRODUCTION

User: "Have you seen my phone?" InstructGPT: "Yes, I have seen your phone." InstructGPT's response 1 is a perfectly fine answer to the question, but a human might answer differently. They might respond "it's in your bag," bypassing the obvious follow-up question ("where is it?"). Giving such a helpful and efficient answer is an example of pragmatic language usage that goes beyond the semantic meaning of utterances. Meaning is not only determined by a combination of words, but also context, beliefs, and social institutions (Wittgenstein, 1953; Grice, 1975; Huang, 2017) . Consider another exchange where Esther asks her friend Juan "Can you come to my party on Friday?" and Juan responds "I have to work.". We resolve Juan's response into a decline by using the contextual commonsense knowledge that having to work on a Friday night precludes attendance. Both these exchanges contain an implicature-utterances that convey something other than their literal meaningfoot_1 . Implicatures illustrate how context contributes to meaning; distinguishing writing and speaking from communicating (Green, 1996) . We cannot fully understand utterances without understanding their implications, nor can a computational model. Indeed, the term "communication" presupposes the speaker's implications are understood by the addressee. Although communication encompasses much more than implicatures, such as assertives and other illocutionary acts, we view implicature understanding as a necessary condition for communicating with humans. Being able to resolve seemingly completely novel implicatures and-more broadly-engage in pragmatic understanding constitutes an essential and ubiquitous aspect of our every day usage of language. Large language models (LLMs) have demonstrated remarkable ability on a variety of downstream tasks such as planning (Huang et al., 2022a) , commonsense reasoning (Kojima et al., 2022) , information retrieval (Lewis et al., 2020; Kim et al., 2022) and code completion (Austin et al., 2021; Biderman & Raff, 2022) , to name just a few. When finetuned with human feedback, LLMs obtain higher ratings on desiderata like helpfulness (Ouyang et al., 2022; Bai et al., 2022) , and are proposed as conversational agents (Thoppilan et al., 2022) . Despite the widespread use and deploy-Figure 1 : A schematic depiction of the protocol we propose to evaluate whether language models can interpret language in context. Each example in the test set gets wrapped in templates and transformed into an incoherent example by swapping "yes" and "no". The model is said to understand the implicature if it assigns a higher likelihood to the coherent text than the incoherent text. ment of LLMs as conversational agents, there has been limited evaluation of their ability to navigate contextual commonsense knowledge. This raises an important question: to what extent do large language models understand conversational implicature? To answer this question we use a publicly available dataset of conversational implicatures and propose an evaluation protocol on top of it (Figure 1 ). We evaluate a range of stateof-the-art models that can be categorised into four distinct groups; base LLMs (like OPT (Zhang et al., 2022)), instructable LLMs finetuned on downstream tasks (like Flan-T5 (Chung et al., 2022)), LLMs finetuned on conversational data (like BlenderBot (Ng et al., 2019) ), and instructable LLMs finetuned with an unknown method (i.e. the latest versions of OpenAI's InstructGPT-3 seriesfoot_2 ). We evaluate both zero-shot and test whether performance improves by presenting in-context examples (few-shot evaluation). Our results suggest that implicature resolution is a very challenging task for LLMs. Most models obtain around 60% accuracy on the test set, whereas humans obtain 86% and random performance is 50%. InstructGPT-3 consistently outperforms other models across almost all model sizes considered, but even here zero-shot evaluation leaves a gap of 14% with the average human. In-context prompting can shrink this gap to 6% for the best of OpenAI's models. However, it does not help much for other models; at 30-shot they still all perform worse than instructGPT-3 does at zero-shot. We do a comprehensive error analysis by manually grouping the test examples into categories and uncover that the performance increase for the largest models seems driven by the simplest examples in the dataset that require no context to be resolved. For these examples the conventional meaning of the words entails a proposition, e.g. "some people came to the party" implying "not all people came". When isolating the best model's performance on implicatures that do require commonsense knowledge to be resolved (like the one in Figure 1 ), the gap between zero-shot and the human average becomes 24%, and the gap between few-shot and the human average becomes 9%. Furthermore, scaling analysis shows that most of the model classes we evaluate do not exhibit increased performance when scaled up. Based on this result, we hypothesise it is unlikely further scaling alone will lead to significant improvements. The main contributions of this work are as follows i) we motivate implicature understanding as a crucial aspect of communication that is currently missing from evaluations of LLMs, ii) we design an implicature resolution task and propose a comprehensive evaluation protocol on which we evaluate both humans and LLMs to find that it poses a significant challenge for state-of-the-art LLMs, and (iii) we perform a comprehensive error analysis and identify opportunities for future work.

2. RELATED WORK

LLMs have demonstrated remarkable performance on tasks for which they were not explicitly trained (Brown et al., 2020) . Building on the hypothesis that these abilities arise due to implicit multitask learning (Radford et al., 2019) , the recent works of Sanh et al. (2022) and Wei et al. (2022) explicitly train LLMs in a supervised multitask fashion, leading to models that are better zero-shot learners with fewer parameters. Besides rapidly saturating language understanding benchmarks (Kiela et al., 2021) , these advancements make LLMs beneficial foundations for agents performing a plethora of tasks (Adolphs et al., 2022; Reed et al., 2022) . The trend towards using these models as agents brings along with it increased urgency for alignment with human values (Kenton et al., 2021) . However, larger models trained with next-word prediction are generally more toxic and unhelpful (Gehman et al., 2020; Bender et al., 2021; Lin et al., 2022) . Recent work mitigates this with approaches like prompting and finetuning on human-annotated outputs (Askell et al., 2021; Ouyang et al., 2022; Thoppilan et al., 2022) . The produced models are more aligned on desiderata such as informativeness when evaluated by dedicated benchmarks and humans. We argue, however, that there is still something missing in these benchmarks. What is helpful and informative, as Kasirzadeh & Gabriel (2022) also point out, depends on the context in which a conversation is held. Consequently, any application of language models that requires communicating with humans will rely on pragmatic communication skills-something that is not explicitly captured by the benchmarks used to evaluate the alignment of LLMs. The standard set of benchmarks LLMs are further evaluated on covers tasks like question answering (Berant et 2021) are the first to fill this gap with a dataset of conversational implicatures, called GRICE. This is important pioneering work highlighting the difficulty of implicature for language models, but their evaluations require task-specific training. In contrast, our evaluation protocol is applicable outof-the-box and is much more comprehensive, evaluating models up to 176 billion parameters and using in-context prompting. Additionally, Zheng et al. (2021) benchmark synthetic data whereas this work evaluates performance on naturally occurring implicatures (George & Mamidi, 2020). We believe this to be a better representation of the true distribution of implicatures in natural dialogue. 2022) introduce an extensive evaluation suite for planning and find that "GPT-3 is, as of right now, pretty ineffective in reasoning about actions and change."

3. THE EVALUATION PROTOCOL

In this section we outline the full evaluation protocol we use to answer the research question "To what extent do large language models understand conversational implicature?". We focus on simple binary implicatures that require inferring "yes" or "no" (like the one in Figure 1 ). As a proxy for "understanding", we say a model understands an utterance if it assigns higher likelihood to a coherent utterance than a similar but incoherent one, detailed below. Zero-shot evaluation. Consider the example from the introduction packed into a single utterance: Esther asked "Can you come to my party on Friday?" and Juan responded "I have to work", which means no. We can transform this example to be incoherent (in the sense that it will become pragmatically inconsistent with expected use) by replacing the word "no" with "yes": Esther asked "Can you come to my party on Friday?" and Juan responded "I have to work", which means yes. If the model understands the implicature, it should assign higher likelihood to the first of the two sentences above, namely the most coherent one. Importantly, both sentences have exactly the same words except for the binary implicature "yes" or "no", making the assigned likelihood scores directly comparable. Formally, let the coherent prompt be x and the augmented, incoherent prompt be x. A model outputs a likelihood p parameterized by weights θ. We say a model pragmatically understands an example x when it assigns p θ (x) > p θ (x). This is equivalent to evaluating whether the model assigns a higher likelihood to the correct continuation of the two options. Note that this is a more lenient evaluation protocol than sometimes used for language models, where models are evaluated on on their ability to generate the correct continuation, in this case "no". However, "no" is not the only coherent continuation here, and marginalising over all possible correct continuations is intractable. The more lenient evaluation does capture implicature understanding, because the choice of "no" versus "yes" is only determined by the resolution of the implicature. We use a dataset of conversational implicatures curated by George & Mamidi (2020). It contains conversational implicatures that, like in Figure 1 , are presented in utterance-response-implicature tuples. Of these, 718 are binary implicatures that we can convert into an incoherent sentence. We randomly sample 600 examples for the test set and keep the remaining 118 as a development set to improve implicature understanding after pretraining through in-context prompting or finetuning. Few-shot in-context evaluation. We add k examples of the task to the prompt, e.g. with k = 2: The following examples are coherent sentences: Esther asked "Have you found him yet?" and Juan responded "They're still looking", which means no. Esther asked "Are you having fun?" and Juan responded "Is the pope Catholic?", which means yes.

Finish the following sentence:

Esther asked "Can you come to my party on Friday?" and Juan responded "I have to work", which means no. We evaluate the models' k-shot capabilities for k ∈ {1, 5, 10, 15, 30} by randomly sampling k examples from the development set for each test example. We opt for a random sampling approach in place of the predominant approach in prior work which leverages the same ordered set of k prompts for each test example. This change in protocol allows us to control for two sources of randomness. Firstly, examples have different levels of informativeness. Secondly, recent work found that the order in which examples are presented matters (Lu et al., 2022) . Ideally, to marginalise over these random factors, we would evaluate each test example with all permutations of k examples from the development set. This requires 118! (118-k)! evaluations for each test example, which is intractable. Instead, we estimate performance per test example by randomly sampling from the development set. In this way we control for some of the variance in performance, but avoid extra evaluations. Controlling for prompt sensitivity. It has been shown language models are sensitive to the wording of the prompt (Efrat & Levy, 2020; Tan et al., 2021; Reynolds & McDonell, 2021a; Webson & Pavlick, 2021) . To control for this factor of randomness we manually curate six different template prompts and measure performance across these different wordings. One of the templates has already been presented in the examples in this section, namely "Esther asked <utterance> and Juan responded <response>, which means <implicature>". Another prompt template is: "Question: <ut-terance>, response: <response>, meaning: <implicature>". The former we call natural prompts and the latter structured prompts. Each group has three templates that only differ slightly in wording. This grouping allows us to look at the variance due to slight changes in wording as well as performance difference due to a completely different way of presenting the example. The full list of prompts can be found in Table 4 . As Perez et al. (2021) point out, for the few-shot evaluation to be truly few-shot, we formulate these prompt templates before any evaluation is done and never use more than k examples from the development set for a test example.

4. EXPERIMENTS

The set of large language model classes we evaluate can be grouped into four distinct categories: (1) base models (namely RoBERTa (Liu et 4) instructable LLMs finetuned with an unknown method (OpenAI's API models). Each group contains one or more model classes for which we evaluate a range of model sizes. A detailed categorization of the models and the attributes we discuss in the results can be found in appendix D. 4 We make use of the OpenAI and Cohere APIs as well as the pretrained models in the transformers library (Wolf et al., 2020 ) and EleutherAI's framework to evaluate them (Gao et al., 2021) . All code used for this paper can be found on GitHub 5 and the dataset is made publicly available on HuggingFacefoot_5 . We separately treat zero-shot and few-shot in-context evaluation, discussing performance for different model sizes of each model class and the variance over the prompt templates. Additionally, we manually group the test examples into categories and analyse what type of examples are difficult for the models. We contrast the models' performance with human performance. To this end, each test example gets annotated by five humans. We split the test set in four and assign each annotator a subset, giving us twenty annotators in total. Details on the human experiment can be found in the Appendix E. Detailed performance broken down by model and prompt template can be found in Appendix F.5.

4.1. ZERO-SHOT EVALUATION

The best performing model classes overall. Table 1 shows the best zero-shot accuracy each model class achieved on the implicature task. The OpenAI models ("UNK FT") perform significantly better than any other. The best accuracy is achieved by InstructGPT-3-175B (i.e. text-davinci-001, a 175 billion parameter modelfoot_6 ) at 72% ± 2.8. This leaves a gap of 13.9% with human average performance. Text-davinci-002 comes second with a zero-shot accuracy of 70.6% ± 2.3, but the difference with InstructGPT-3-175B is not significant. All models in the other groups obtain performance closer to random than to humans (between 53.4% by BlenderBot-2.7B and 63.3% by Flan-T5-780M), showing a gap of at least 23% with the average human. We hypothesise that instruction finetuning as done for OpenAI's API models is especially important for this task, but we do not know the method and cannot say anything about it. In Appendix F.1 we reframe the task such that models can contrast the coherent and incoherent prompt, but this did not improve performance. Moreover, in Appendix F.3 we go into the stochasticity due to the fact that OpenAI's and Cohere's models are behind an API. After running the zero-shot experiment ten times through each API we conclude there is some stochasticity, but it is too small to impact the conclusions. Sensitivity to prompt wording. As detailed in Table 4 , each example in the test set is wrapped in six different prompt templates. The standard deviation in Table 1 shows the estimated sensitivity to different prompt wording. The standard deviation ranges from 0.3 for BlenderBot to 7.0 for T0-11B when looking at all templates. This variation is often much smaller when separating the performance over structured and natural prompts. Cohere-52B and BLOOM-7B1 are better at naturally worded Column "all templates" has the mean performance on all templates. The std is over prompt templates for the models and over annotators for humans. The rightmost two columns hold a breakdown into the mean performance on the templates of the groups "structured" and "natural" respectively. prompts (template 2, 5, and 6 in Table 4 ), whereas OpenAI's models, T0-11B, and OPT-30B are better at structured prompts (template 1, 3, and 4 in Table 4 ). All in all, the sensitivity to prompt wording does not seem to be a problem for this task; the best and worst evaluations for each model do not change the fact that InstructGPT-3-175B perform best, but significantly worse than humans.

Model

The effect of scaling. The left plot in Figure 2 shows the scaling laws we obtained from the model classes for which we know the number of non-embedding parameters. We again observe that Ope-nAI's instructable models perform significantly better than almost all other models on this task. Surprisingly, for many models the slope of the line is either near zero or decreasing. The only model Generalised implicatures require little or no context to be understood. They are the simplest type of example in the test set, and generally imply the same thing ("some" almost always implies "not all"). Particularised implicatures, by contrast, do require context to be resolved. For example, from Table 2 , we need the context that it is undesirable to stay up late drinking when one has to get up early (see in Appendix B more generalised vs. particularised). The type world knowledge requires knowledge of the physical world to be resolved. From the example in Table 2 ; we need to know that you cannot leave fingerprints when wearing gloves to resolve this implicature. Idiom types contain an idiom or a metaphor that one needs to know or understand to resolve the implicature, and finally Rhetorical question types contain a question like "Is the Pope Catholic?", often requiring factual knowledge to be resolved. In Figure 3 the relative accuracy difference with the mean is shown for model classes Cohere and InstructGPT-3 (an absolute plot can be found in Appendix F.4 The performance increase for larger models seems driven by the simple examples in the dataset that require no context to be resolved. We hypothesise that scaling up model size alone will not help with more complex implicature resolution. Moreover, as mentioned in Section 1, even though particularised implicatures do require context to be resolved, they are all implying a simple "yes" or "no". We conjecture that implicatures entailing several non-binary propositions are unlikely to be resolved by current SOTA language models. On prompting. There is a narrative around large language models that if they fail a task, it might be that the prompt was not the right one (through works like Reynolds & Mc-Donell (2021b); Kojima et al. ( 2022)). The idea is that they can be prompted to simulate almost anything, if you set them up correctly. Because implicature resolution is a ubiquitous result of learning language, we hold the view that a model should be able to do this task if a prompt is given in coherent natural language. Nonetheless, in an additional effort to find the "let's think step-by-step" (Kojima et al., 2022) of zero-shot implicature resolution we try three more prompt templates. We evaluate a base large language model and the two best performing instructable models: GPT-3-175B, InstructGPT-3-175B, and text-davinci-002. Table 3 : Zero-shot accuracy over three additional prompt templates for a base LLM and two instructable models.

Model Templates

GPT-3-175b 59.2% ± 4.5 InstructGPT-3-175b 66.1% ± 3.2 text-davinci-002-? 67.7% ± 9.6 The prompts we use are taken from recent work that proposes a dialogue agent trained with human feedback (Glaese et al., 2022) , but adapted to the task of implicature resolution. The full prompts are presented in Table 5 and Table 3 shows the results. The new templates do not improve performance for any of these models. The variance over the prompt templates for text-davinci-002 is very high, and the best prompt template of these three does achieve a slightly higher accuracy than the others: 74.5%. These results do not change the picture sketched so far. Of course, we will never claim a black swan does not exist, but given the breadth of our experiments we can conclude that using current LLMs to interpret language in context is non-trivial and advancements are needed.

4.2. FEW-SHOT IN-CONTEXT EVALUATION

The effect of larger k. We prompt the models with in-context examples from the development set to prime them for the task (detailed results in Appendix F.5). The highest accuracy we obtain is 80.6% ± 1.22, by text-davinci-002 for k = 30. This shrinks the gap with the average human to 5.6% and with the best human to 9.2%. Note that humans were tested zero-shot. When looking at the structured prompts, the accuracy is even slightly higher at 81.7% ± 0.9. The best performance due to in-context prompting of the other model groups is obtained by OPT-13B with 67.4% ± 2.1. Note that this is a worse accuracy than OpenAI's instructable models achieve zero-shot. The right plot in Figure 2 shows the relative performance increase due to prompting for the models InstructGPT-3-175B, Cohere-52B, and OPT-175B. In-context prompting boosts performance up to k = 5, for higher k the performance barely increases. For OPT-175B there is a large variance in the effect. We stopped at k = 30 because the models' context windows could not handle more examples. Regardless, from Figure 2 it seems like larger k would not increase performance significantly. In Appendix F.2 a small experiment is done to estimate the variance over prompt order for text-davinci-002, where the variance is again low enough to conclude this will not impact the results. The effect of in-context examples on sensitivity to prompt wording. Figure 4 shows the relative performance increase due to in-context prompting broken down per prompt template. For InstructGPT-3-175B, most templates benefit similarly from more in-context examples, except for template 1. Perhaps surprisingly, we see that this template already achieves a performance of 76.5% at the zero-shot evaluation and does not improve much with few-shot prompting. For Cohere-52B and OPT-175B we see a clear grouping between the structured prompts (dashed lines) and natural prompts (dotted lines). Cohere struggles significantly more with the structured prompts than with the natural prompts in the zero-shot evaluation, and few-shot prompting can mitigate that, lowering the standard deviation over prompt templates to 1.89 at k = 30 from 4 at k = 0. OPT benefits from prompting for the natural prompts, but not for the structured prompts. Breaking down performance per example type. We observe again that the context-heavy examples are more difficult for the best performing model text-davinci-002 at k = 30. Recall that humans obtain a performance of 83.2% on the particularised examples. The model text-davinci-002 obtains a performance of 74.4% performance, leaving a gap of 8.8% with the average human.

5. CONCLUSION AND FUTURE WORK

Large language models have made remarkable progress on fluency and coherence in recent years. We argue however that a central aspect of language understanding is still missing. To understand language means to understand its pragmatics: its usage in context. We design a protocol that evaluates LLMs on binary implicature resolution and establish a significant gap with human understanding. The best performing models leave a gap of 13.9% with the average human in the zero-shot setting, and of 5.6% when k = 30. All other models obtain performance closer to random than to human performance. Model scaling plots and few-shot evaluations show increasing model size and prompt size is unlikely to close the gap. Moreover, when isolating performance on a context-heavy subset of the test set the gap becomes more pronounced. On context-heavy examples the gap with the average human for the best model is 23.5% in the zero-shot setting, and 8.8% when k = 30. We conjecture that a large part of the zero-shot performance increase for larger models is driven by simple examples in the dataset that require no context to be resolved. We further conjecture that the large difference in performance between OpenAI's text-davinci models and all other LLMs can be explained by the type of instruction finetuning they apply. However, without access to other instructable models ( The type of implicatures we study is a simple type of conversational implicature that can be resolved to a yes or a no. This leaves ample room for the design of benchmarks with complex implicatures entailing more interesting propositions. Humans resolve much more complex propositions intuitively in conversation. For example, imagine Esther now asking "Can I use your stapler?" and Juan responding "Here's the key to my office.". Juan is implicating that (1) Esther can use the stapler, (2) the stapler is located in the office, and (3) the office is currently locked. Additionally, an interesting question for future work is for which accuracy models will be indistinguishable from humans. This could be answered with a type of Turing test in which a human must distinguish a LLM from another human by prompting both with a sequence of implicatures. We believe substantial work needs to be done to move beyond fluent text generation towards communication with autonomous agents and we hope this work will allow researchers to measure progress towards this goal.

6. REPRODUCIBILITY STATEMENT

We share all the data, human annotations, code used for the evaluations, and the raw results in the supplementary material. Additionally, in Appendix F.3 we estimate the variance due to stochasticity in the API's of OpenAI and Cohere. Of course, if either OpenAI or Cohere decides to change the models behind the API, the results might look different. We publish the exact date and time each API was queried for the results in Appendix G. Finally, in Appendix F.2 we estimate the variance over the prompt order of the in-context examples.

7. ETHICS STATEMENT

In this work, we conduct a study with human subjects (see Appendix E for details). To get matched with participants, we used the platform Prolific. Prolific complies with ethical standards according to UK law (e.g. complying with the GDPR). We compensated participants with a UK living wage at 15 GBP an hour, which is 6 GBP an hour more than Prolific recommends at 9 GBP per hour. Implicature is an aspect of pragmatics, and pragmatic language impairments are universal in Autism Spectrum Disorder (ASD) (American Psychiatric Association, 2013). Difficulties in understanding scalar implicatures are claimed to be present in people with ASD (Volden, 2017) , although the nature of the relation has proven hard to establish and has recently been debated (Katsos et al., 2011; Schaeken et al., 2018) . For the purposes of this work, whether or not implicature understanding relates to ASD is not important. We took the following steps to make sure no sensitive data is collected or published. The human annotations we obtain are anonymous, related to a participant only by their Prolific ID for the purposes of compensation. In publishing the human annotations, we will not publish the Prolific ID of participants or anything else related to the participants. Additionally, we did not collect or request any personal or demographic characteristics of the participants apart from that they are all native English speakers. Juan's response in the introduction seemingly violates this maxim-after all, he starts talking about work when Esther asks him about a party. However, because Juan agreed to be relevant he must be implying that having to work means he cannot come to the party. Grice contrasts conversational implicatures that arise through context with conventional implicatures. These are implicatures where the conventional meaning of the word determines what is implicated. An example given by Grice is the following sentence: "he is an Englishman; he is therefore brave.". Grice notes that this sentence does not literally state that an Englishman being brave is a direct consequence of him being English, but it's implied by the conventional meaning of the word 'therefore'. Since then, issues with the Gricean cooperative principle have been pointed out by many ( In this work, we focus on conversational implicatures and not on conventional implicatures. All conversational implicatures are negotiable by context, but the way they depend on context can be different. Grice (1975) identifies generalised conversational implicatures and particularised conversational implicatures. The former require little or no context to be resolved. For example, "some athletes smoke" can imply "not all athletes smoke", but might also imply "I do not know whether all athletes smoke" when it is a response to the question "do you know whether all athletes smoke?" (Davis, 2019) . The latter only arise in certain contexts. For example, the response "I have an early morning" to the question "do you want to stay for a drink?". 

C DETAILED PROMPT TEMPLATES

Finish the following text: Karen asked "<utterance>" and William responded "<response>", which means <implicature> 6 Finish the following text: Bob asked "<utterance>" and Alice responded "<response>", which means <implicature>

D MODEL CATEGORIZATION

Table 6 contains details on the model classes that are a part of each group of models we evaluate, along with their model sizes.

E HUMAN EVALUATION

The participants for the human evaluation in this paper were recruited using Prolific (www. prolific.co). The setup of the experiment is as follows. We divide the test set of 600 examples into four non-overlapping subsets of 150 examples. Each set of 150 examples was given to five unique annotators. This means each example in the test set is labeled five times by different people, and we have in total twenty annotators for the whole test set (five different ones for each of the four subsets). The only constraint for the annotators is that they are native English speakers. In 4 ) and presented in a Google form. The reason to wrap all examples in prompt template 2, as opposed to a mixture of all six templates is that although models have been shown to be very sensitive to prompt wording, humans are less likely to perform differently for different prompt templates. All templates are coherent natural language that any native English speaker will understand. That said, to confirm this hypothesis 

# Prompt template 7

The following text shows an interaction between two humans called Esther and Juan. In the interaction, Esther will ask Juan a question, and Juan will give an answer that contains an implicature. An implicature is an utterance that means something other than the literal meaning of the words. The implicature of Juan's response is yes or no. You, the AI assistant, are asked to finish the text with yes or no. The task begins: Esther asked "<utterance>" and Juan responded "<response>", which means <implicature> 8 The following text shows an interaction between two humans called Esther and Juan. In the interaction, Esther will ask Juan a question, and Juan will give an answer that has a meaning besides the literal meaning of the words. That meaning is either yes or no. You, the AI assistant, are asked to finish the text with the correct meaning, either yes or no. The task begins: Esther asked "<utterance>" and Juan responded "<response>", which means <implicature> 9 The following text shows an interaction between two humans called Esther and Juan. In the interaction, Esther will ask Juan a question, and Juan will give an answer that has a meaning besides the literal meaning of the words. That meaning is either yes or no. You, a highly intelligent and knowledgeable AI assistant, are asked to finish the text with the correct meaning, either yes or no. The task begins: Esther asked "<utterance>" and Juan responded "<response>", which means <implicature> The participants are asked to choose the correct continuation, yes or no (see Figure 6a ). As recommended by Prolific, we subject the participants to an attention test (see Figure 6b ). At three random places in the form, we add a question that does not contain an implicature and obviously maps to "yes". In this way, if the participants fails at least two of these questions, we can conclude they were not paying attention and remove their answers from the result. In practice, this happened once and we decided to pay the participant regardless, but discard their results, which were close to random. Table 7 shows the performance of each annotator on the subset they annotated. The average human performance across subsets and annotators is 86.2% ± 2.3, the best performance is 89.8% ± 2.2, and the worst performance is 83.5% ± 1.5. The column "IAA" shows the average Cohen's Kappa coefficient which is the pairwise inter-annotator agreement for each annotator per subset. All agreements are substantial according to the interpretation guidelines for Cohen's Kappa (between 0.61-0.80). In this section we reframe the implicature resolution task to a contrastive one, allowing the model to contrast the coherent to the incoherent sentence in a single prompt. Contrastive task. In the ranking task the model is required to assign higher likelihood to the coherent utterance than the incoherent one (p θ (x) > p θ (x)). In assigning a likelihood to x, the model has no knowledge of x, and vice-versa. We hypothesize that the task might become easier if we reformulate it as a contrastive task. Consider the following prompt p.

Which of the following sentences is coherent:

A: Esther asked "Can you come to my party on Friday?" and Juan responded "I have to work", which means no. B: Esther asked "Can you come to my party on Friday?" and Juan responded "I have to work", which means yes.

Answer:

We can now evaluate the models' ability to understand which is the coherent sentence by evaluating whether it assigns p θ (A | p) > p θ (B | p). Note that this can again be framed in a ranking task of assigning a higher likelihood to the coherent prompt. If we finish the above prompt p by adding "A" to make a coherent prompt x and "B" to make an incoherent prompt x we can again formulate the task by p θ (x) > p θ (x). The difference is that within both the coherent and the incoherent prompt, the model can contrast the coherent and incoherent utterance to each other. We randomise the assignment of A and B to the utterances. We do a small experiment with the contrastive task with the best performing model overall, OpenAI's text-davinci-002, for k = {0, 1, 5}. We use two prompt templates and for each template try three different multiple choice answers: A and B like above, one and two, or the full text of the answer. For the last option the coherent prompt x would look as follows: Which of the following sentences is coherent: A: Esther asked "Can you come to my party on Friday?" and Juan responded "I have to work", which means no. B: Esther asked "Can you come to my party on Friday?" and Juan responded "I have to work", which means yes. Answer: Esther asked "Can you come to my party on Friday?" and Juan responded "I have to work", which means no. Throughout experiments, we kept this randomly sampled order the same, meaning if you re-run the 5-shot evaluation you get exactly the same orderings. The reason for this is that we want evaluate each model equally. In this section we ask how the performance chances for the best performing model if we select another random order. We do this for the 5-shot evaluation, because the results show that adding more in-context examples barely helps performance. 9 shows the results of this experiment. Some prompt templates seem to be more sensitive to prompt example ordering than others, but for none of them the variance is high enough to change any conclusions.

F.3 VARIANCE OVER API RUNS

In this section we comment on the reproducibility of research done using APIs. Two of the model classes we evaluate have their models behind an API, meaning we do not have control over what happens to the prompt before the model processes it. We run the main evaluation, which is zero-shot, ten more times for the largest models of OpenAI and Cohere, text-davinci-002 and Cohere-52B. The results from this experiment are shown in Table 10 and 11. From this we can conclude that there is some stochasticity in the API that we have no control over, a bit more for OpenAI than for Cohere, but again we can be relatively confident that the conclusion will not be different because of it. The results from this work are therefore reproducible with access to the same models behind the API now. Unfortunately, when OpenAI or Cohere changes the models behind the API, these results are not exactly reproducible anymore. For completeness, we add the timestamp that each result was obtained below (Appendix G). std 0.05 0.08 0.10 0.05 0.09 0.08 -

F.4 ABSOLUTE TYPE LABEL ANALYSIS

Figure 7 shows the absolute accuracy for the type labels (from Section 4.1) that show a significant pattern; particularised and generalised. We observe increasing performance for generalised implicatures with scale, and decreasing or random performance for particularised implicatures.

F.5 DETAILED RESULTS PER MODEL

This section contains the results used for the zero-shot and few-shot evaluation in the main text in Section 4, broken down per prompt template. See Table 12 until Table 58 .

G TIMESTAMPS API CALLS

For reproducibility purposes, Table 59 and 60 contain the dates and times the APIs from OpenAI and Cohere were queries for the results. 



Appendix A contains details on how this answer was obtained from InstructGPT-3. In Appendix B we present a comprehensive introduction to implicature. The method is unpublished and might differ from the original instructGPT(Ouyang et al., 2022). Note that there are several important aspects unknown for models behind APIs, like Cohere and OpenAI. Supplied in supplementary material. When anonymity period is over link will appear here. For all OpenAI's API models except text-davinci-002 the size is assumed to align with the GPT-3 paper. There is reasonable evidence for this to be true https://blog.eleuther.ai/gpt3-model-sizes/



of language modelling benchmarks are widespread (Raji et al., 2021; Bender et al., 2021; Bender & Koller, 2020; Raji et al., 2022). These works question whether the evaluation protocols measure what researchers claim they do. In similar spirit to our work, Valmeekam et al. (2022) point out that despite the fact that many works claim to use LLMs to "plan" (Ahn et al., 2022; Shah et al., 2022; Huang et al., 2022b) they either do not evaluate whether LLMs can do planning or use limited benchmarks that cannot justify the claims being made. Valmeekam et al. (

Figure 2: Left: The zero-shot accuracy for different sizes of the model classes. The error bars show standard deviation over prompt templates. OpenAI's instructable models perform better than most other models. For all models there is a significant gap between best accuracy and human accuracy. Right: Relative to zero-shot performance increase due to in-context examples, shown for the largest models of classes InstructGPT, Cohere, and OPT (note they are of a different size). The error bars show std. dev. over prompt templates. Performance increases strictly up to k = 5, and only slightly after. For OPT-175B there is a large variance over prompt templates.

Figure 3: Relative accuracy (w.r.t. mean accuracy) for each example type for Cohere and InstructGPT-3. A point above the dotted line means the model gets that type right more often than the average performance on the test set. Particularised (context-heavy) examples are significantly more difficult than generalised (context-free) examples for both model classes. The type labels World knowledge, Idiom, and Rhetorical question do not show a significantly meaningful pattern.

Figure 4: Relative performance increase over 0-shot due to in-context prompting. Structured prompt templates are dashed lines (1, 3, 4) and natural prompt templates dotted lines (2, 5, 6).

the screen shown to potential participants on Prolific is shown. Participants are paid 15 pounds an hour, which was the living wage at the time of the experiment and more than the 12 dollars an hour Prolific recommends. The 150 test examples are wrapped in prompt template 2 (see Table

Figure 5: A screenshot of how the experiment is presented to potential annotators on Prolific (www. prolific.co).

The start of the Google form participants are asked to fill out for the human study.(b) Part of the Google form the participants are asked to fill out. The second question in this image is part of the attention test. Juan's response does not contain an implicature but simply gives away the correct answer.

Figure 6: Screenshots of the Google form participants fill out as part of the implicature study.

Figure 7: The absolute accuracy for each example type for model classes Cohere and InstructGPT-3. Particularised (context-heavy) examples are significantly more difficult than generalised (contextfree) examples for both model classes. The type labels World knowledge, Idiom, and Rhetorical question do not show a significantly meaningful pattern and are left out of this plot. The error bars are standard deviation over prompt templates.

al., 2013; Joshi et al., 2017; Kwiatkowski et al., 2019), language completion (Levesque et al., 2012; Paperno et al., 2016; Mostafazadeh et al., 2016; Zellers et al., 2019; Sakaguchi et al., 2021), common-sense reasoning (Mihaylov et al., 2018; Clark et al., 2018; Bisk et al., 2020; Bhakthavatsalam et al., 2021), reading comprehension (Lai et al., 2017; Choi et al., 2018; Reddy et al., 2019; Dua et al., 2019), natural language inference (Rajpurkar et al., 2018; Nie et al., 2020), and more (Wang et al., 2019; Srivastava et al., 2022). Even though implicature is one of the most important aspects of language pragmatics (Levinson, 1983), none of these benchmarks explicitly evaluate implicature understanding. Reddy et al. (2019) evaluate implicit coreference among other aspects of conversation. This may indirectly measure performance on implicatures. However, unlike our work, it fails to decouple performance on implicatures from other aspects of pragmatics. Zheng et al. (

The zero-shot accuracy for the best performing model of each class. The largest model does not always perform the best (i.e. for EleutherAI, BLOOM, OPT, GPT-3, BlenderBot, and Flan-T5).

An example from the dataset for each type of implicature found in the test set. The rightmost column shows the amount of that type we manually found in the test set.

Cohere-52B obtains a mean performance of 58.5% whereas for generalised examples it is 73.9% and for particularised examples it is 51.5%, which is close to random performance. For InstructGPT-3-175B the mean performance is 72.3%, whereas for generalised examples it is 79.3% and for particularised examples it is 59.7%. Humans also do worse on the particularised examples (83.2%), but the gap with the mean is smaller. Comparing the accuracy on these examples with humans uncovers a larger gap of 23.5% for InstructGPT-3-175B and 31.7% for Cohere-52B.

Thoppilan et al., 2022; Chowdhery et al., 2022) it is impossible to substantiate this hypothesis. Additionally, to substantiate the hypothesis that model size will not close the gap future work with larger model sizes is needed.

Levinson, 1983;Sperber & Wilson, 1986;Davis, 1998;Lepore & Stone, 2014). The most influential alternative theory is relevancy theory bySperber & Wilson (1986). They do away with the cooperative principle and instead theorise implicatures arise because speakers try to produce utterances that are both as relevant as possible and require the least effort to process. Another point of contention is the incorporation of conventional implicatures on the pragmatics side.Bach (1999) argues that there is no such thing as conventional implicatures, and they are simply instances of something else. Based on a thorough treatment of what Grice calls conventional implicatures, Bach argues all examples of it can be filed under other concepts within semantics, like utterance modifiers (called "utterance modifiers" instead of "sentence modifiers" because they go against the semantic content of the rest of the sentence).Potts (2005) also argues that to explain conventional implicatures we can stay on semantic turf. Indeed, even Grice himself says conventional implicatures derive from the meaning of the words, not from conversational context. However, Potts does not claim conventional implicatures do not exist, but instead argues they arise by a combination of lexical meaning and novel ways of combining words-the latter being the well-known principle of compositionality, an important part of semantics, not of pragmatics. Potts provides us with an illuminating demarcation between conventional and conversational implicatures. Conventional implicatures are never negotiable by context, whereas conversational implicatures are context-dependent and can always be cancelled without causing incoherent discourse. Consider again the sentence "he is an Englishman; he is therefore brave." and the sentence "Eddie has three bicycles" (implicating that Eddie has exactly three bicycles and not more). The former sentence can not be cancelled by new context without contradiction, whereas for the latter, if we continue saying "In fact, Eddie has 10 bicycles, he is a bicycle junkie", we have cancelled the implicature. This demarcation clearly puts conventional implicatures on the semantic side, and conversational implicatures on the pragmatic side. Potts goes on by providing a formal theory for conventional implicatures.In later work,Potts (2006) describes how pragmatic pressures interacting with context cause conversational implicature to arise. He shows how sensitive conversational implicatures are to small changes in the context. Novel information about a speaker's belief state might completely change what is implied. There are many more models of implicature that aim to explain how humans understand language in context. Most notably, Frank & Goodman (2012) formalise the view that speakers produce utterances that are helpful and not longer than necessary with a Bayesian model called the rational speech act (RSA). Many variants on the RSA framework have since been proposed. For example,Goodman & Frank (2016) extend it to handle nonliteral uses of language, like irony, and metaphor. In the context of computational models, prior work uses insights from pragmatics to show that the use of certain words can make a language model produce biased completions(Patel & Pavlick (2021), e.g. saying someone "claimed" something rather than "said" something), and inform bias and sentiment classifiers(Greene & Resnik, 2009;Recasens et al., 2013).

contains the full prompt templates we used for the main evaluation and Table5contains the extra prompt templates.

The six templates we wrap the test examples in to present to the models. Template 1, 3, and 4 are of the type structured, and 2, 5, and 6 of the type natural. Within the type of prompt template they only differ slightly in wording.

The three additional templates we wrap the test examples in to present to the models, adapted from(Glaese et al., 2022).

Model categorization for each of the models. UNK stands for unknown, FT for finetuning, MT for multitask, and DL for dialogue.

The performance of the human annotators on the subsets of the test set. Subset 1 through 4 are non-overlapping and cover the whole test set. Annotator X for subset Y might be a different human than annotator X for subset Z. IAA is the average pairwise inter-annotator agreement (Cohen's kappa coefficient) between annotators per subset.

Performance on the implicature task framed contrastively by OpenAI's text-davinci-002. The mean and standard deviation are reported over two different prompt templates (template 1 and 2).In Table8, perhaps surprisingly, we can see that the contrastive task is much more difficult than the original ranking task. For k = 0, the result is random except for the prompt where the multiple choice options are A and B. For k = {1, 5} the full text ranking does best, but is still significantly worse than the original ranking setup. Because of these disappointing results, we did not evaluate the other models contrastively. Future work must establish whether the contrastive setup is worse across all model classes and sizes.F.2 VARIANCE OVER PROMPT ORDERINGAs mentioned in Section 3, models are sensitive to the ordering of the k examples in the prompt. Instead of marginalising over this random factor by evaluating all possible prompt orderings, we randomly sampled an ordered set of examples from the development set for each test example.





Accuracy per prompt template for BERT-cased.

Accuracy per prompt template for RoBERTa-base.

Accuracy per prompt template for RoBERTa-large.

Accuracy per prompt template for GPT-2-medium.

Accuracy per prompt template for GPT-2-large.

Accuracy per prompt template for GPT-2-xl.

Accuracy per prompt template for EleutherAI-125M.

Accuracy per prompt template for EleutherAI-1.3B.

Accuracy per prompt template for EleutherAI-2.7B.

Accuracy per prompt template for EleutherAI-6B.

Accuracy per prompt template for EleutherAI-20B.

Accuracy per prompt template for BLOOM-560M.

Accuracy per prompt template for BLOOM-1B1.

Accuracy per prompt template for BLOOM-1B7.

Accuracy per prompt template for BLOOM-3B.

Accuracy per prompt template for BLOOM-7B1.

Accuracy per prompt template for BLOOM-176B.

Accuracy per prompt template for OPT-125M.

Accuracy per prompt template for OPT-350M.

Accuracy per prompt template for OPT-1.3B.

Accuracy per prompt template for OPT-2.7B.

Accuracy per prompt template for OPT-6.7B.

Accuracy per prompt template for OPT-13B.

Accuracy per prompt template for OPT-30B.

Accuracy per prompt template for OPT-66B.

Accuracy per prompt template for OPT-175B.

Accuracy per prompt template for Cohere-409.3M (Cohere-small).

Accuracy per prompt template for Cohere-6.067B (Cohere-medium).

Accuracy per prompt template for Cohere-13.12B (Cohere-large).

Accuracy per prompt template for Cohere-52B (Cohere-xl).

Accuracy per prompt template for GPT-3-350M (ada).

Accuracy per prompt template for GPT-3-1.3B (babbage).

Accuracy per prompt template for GPT-3-6.7B (curie).

Accuracy per prompt template for GPT-3-175B (davinci).

Accuracy per prompt template for BlenderBot-90M.

Accuracy per prompt template for BlenderBot-2.7B.

Accuracy per prompt template for BlenderBot-9.4B.

Accuracy per prompt template for T0-3B.

Accuracy per prompt template for T0-11B.

Accuracy per prompt template for Flan-T5-780M.

Accuracy per prompt template for Flan-T5-3B.

Accuracy per prompt template for Flan-T5-11B.

Accuracy per prompt template for InstructGPT-3-350M (text-ada-001).

Accuracy per prompt template for InstructGPT-3-1.3B (text-babbage-001).

Accuracy per prompt template for InstructGPT-3-6.7B (text-curie-001).

Accuracy per prompt template for InstructGPT-3-175B (text-davinci-001).

Accuracy per prompt template for text-davinci-002-unknown.

annex

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. With temperatures t = {0, 0.7, 1}. All three of text-davinci-002's responses were similar to:User: "Have you seen my phone?" InstructGPT: "Yes, I have seen your phone."The model text-davinci-001 consistently generates:User: "Have you seen my phone?" InstructGPT: "No I have not seen your phone."We tried extending the prompt, which gave similar results for text-davinci-002.The following is a request from a user. InstructGPT is a helpful and friendly conversational agent that tries to assist its users. User: "Have you seen my phone?" InstructGPT: "Yes, I have seen your phone."The same approach makes text-davinci-001 a bit more helpful:The following is a request from a user. InstructGPT is a helpful and friendly conversational agent that tries to assist its users. User: "Have you seen my phone?" InstructGPT: "I haven't seen your phone, what type of phone is it?" This is just a small experiment to illustrate a point, which half of the time goes wrong, even when prompted to be a helpful assistant. Of course, InstructGPT-3 cannot see, so the only "truthful" response is no.

B BACKGROUND ON IMPLICATURE

The first influential consideration of implicature is Grice (1975) . In his work, Grice continues the trend of moving away from purely logical accounts of language started by Wittgenstein (1921) by hypothesising implicatures arise in conversation when some mutually agreed upon maxims seem to be violated. For example, if we agree on only making relevant contributions to conversation,

