UNPACKING LARGE LANGUAGE MODELS WITH CON-CEPTUAL CONSISTENCY

Abstract

If a Large Language Model (LLM) answers "yes" to the question "Are mountains tall?" then does it know what a mountain is? Can you rely on it responding correctly or incorrectly to other questions about mountains? The success of Large Language Models (LLMs) indicates they are increasingly able to answer queries like these accurately, but that ability does not necessarily imply a general understanding of concepts relevant to the anchor query. We propose conceptual consistency to measure a LLM's understanding of relevant concepts. This novel metric measures how well a model can be characterized by finding out how consistent its responses to queries about conceptually relevant background knowledge are. To compute it we extract background knowledge by traversing paths between concepts in a knowledge base and then try to predict the model's response to the anchor query from the background knowledge. We investigate the performance of current LLMs in a commonsense reasoning setting using the CSQA dataset and the ConceptNet knowledge base. While conceptual consistency, like other metrics, does increase with the scale of the LLM used, we find that popular models do not necessarily have high conceptual consistency. Our analysis also shows significant variation in conceptual consistency across different kinds of relations, concepts, and prompts. This serves as a step toward building models that humans can apply a theory of mind to, and thus interact with intuitively.

1. INTRODUCTION

Large Language Models (LLMs) have had many exciting recent successes. These include high performance and even emergent capabilities using just zero or few-shot prompting (Brown et al., 2020; Wei et al., 2022a) , but overall performance is still low compared to humans on a wide range of tasks for even the largest models (Srivastava et al., 2022) , and our understanding of these models work is still limited. A popular explanation of low performance and inconsistencies is that LLMs are simply learning to mimic the data used to train them, and this basic pattern recognition limits generalizability, in the case of LLMs exposing the limits of any understanding (Zhang et al., 2022a; Bender & Koller, 2020) . We would use a similar line of reasoning to guess how a LLM would answer the following question: "Can GPT-3 see?" If it performed well on examples from the same distribution we would say it is likely to get it right or vice-versa if performed poorly on those examples. Though valid, this explanation is incomplete because it is completely agnostic to the specific content of the statement. We would apply the exact same reasoning and come to the same conclusion for similar statements about say blood banks or mock trials, as long as they were from the same distribution (in this example, the CSQA2 dataset (Talmor et al., 2021) ). This is in contrast to our day to day life, where our Theory of Mind allows us to understand other agents (people) by attributing beliefs, intentions, and desires to them (Premack & Woodruff, 1978) in a way that allows us to usefully predict their behavior (Rabinowitz et al., 2018; Dennett, 1991) . Beliefs are most relevant here, and should be conceptual in order to best support human understanding (Yeh et al., 2021) . Ideally we would also be able to apply this kind of understanding to LLMs, predicting that the model is more likely correct about GPT-3's sight if it knows about GPT-3 generally than if it does not. This would be a conceptual model of the LLM that allows us to predict its behavior. Figure 1 : Example target question with relevant and irrelevant background knowledge. A model is conceptually consistent when its knowledge of relevant background information -sharing concepts with the target -is consistent with its ability to answer questions correctly. We want to build models for which that kind of understanding is possible, so in this work we take a step toward that goal by modeling the conceptual knowledge of a LLM and predicting when the LLM will be correct from that model. Our conceptual model is based on a LLM's answers to questions about background knowledge relevant to a particular anchor task (e.g., question answering), which we assume to be a reasonable though imperfect measurement of what the LLM can be said to "know." From this and a measurement of question answering performance we compute conceptual consistency (Figure 1 ), quantifying whether a model's knowledge of relevant background is consistent with its ability to perform the anchor task. Unlike standard approaches to evaluation this approach relies on example specific context. Defining background knowledge is key to our approach because we need it to be relevant enough to establish meaningful consistency. Given the target query "Can GPT-3 see?" a relevant background query might be "Was GPT-3 built by OpenAI?" while "Is the sky blue?" would be an irrelevant query. Instead requiring a background fact to logically support the target query in some way, we say a background fact is relevant if it can tell us something about how a typical human language user would respond. Given a ground truth response Y to the target with human response Ŷ and respective responses Y K and ŶK to a potentially relevant background fact we define relevance using a conditional probability. If P (Y = Ŷ |Y K = ŶK ) ̸ = P (Y = Ŷ ) then the background fact is relevant because it is not independent of the target and gives us information about whether the speaker will be right or wrong. Knowing GPT-3 was built by OpenAI makes it more likely that someone will also know GPT-3 cannot see because it is true and involves relevant concepts. While knowing the color of the sky is laudable, there's no conceptual overlap with GPT-3's ability to see. In this paper, we focus on this kind of conceptual relevance. After extracting background knowledge we use prompting to measure how a given LLM handles the background knowledge and how it performs at the anchor task. For this we study three varieties of generative language models across multiple scales up to 66 billion parameters and use a majority vote style prompting procedure to maximize the robustness of our approach to linguistic variations. Our core contributions are • We extract conceptually relevant background knowledge with respect to examples from an anchor task and map them onto background knowledge questions. • We use a novel majority vote style zero-shot prompting procedure applied to generative language models to measure LLM performance. • We measure conceptual consistency, focusing on generative language models and showing that consistency is low though it does increase with model size up to 66 billion parameters. • We report conceptual patterns in model behavior that fall out of our analysis.

2. RELATED WORKS

Much work has been devoted to studying the limits of large language models beginning with BERT Devlin et al. (2019) . Typical evaluation of language models measures performance on datasets of labeled examples across a range of tasks, such as those that constitute the amalgamated BIG Bench benchmark (Srivastava et al., 2022) . In general they are explained as simply mimicking the data they were trained with. Low level critiques question the ability of LLMs to reason logically (Zhang et al., 2022a) or pass adapted psychological tests of language in humans (Ettinger,

