MOCA: COGNITIVE SCAFFOLDING FOR LANGUAGE MODELS IN CAUSAL AND MORAL JUDGMENT TASKS

Abstract

Human commonsense understanding of the physical and social world is organized around intuitive theories. These theories support making causal and moral judgments. When something bad happened, we naturally ask: who did what, and why? A rich literature in cognitive science has studied people's causal and moral intuitions. These works have revealed a number of factors that systematically influence people's judgments, such as the presence of norms, and whether or not the protagonist in a scenario was aware of their action's potential consequences. Here, we investigate whether large language models (LLMs) make causal and moral judgments about text-based scenarios that align with those of human participants. We find that without any annotations, LLMs and human participants are not well aligned (17%-39% agreement). However, LLMs can accurately annotate what relevant factors are present in a scenario with simple expert-written instructions. We demonstrate how these annotations can be used to bring LLMs in closer alignment with people (36.3%-47.2% agreement). These results show how insights from cognitive science can help scaffold language models to more closely match human intuitions in challenging commonsense evaluation tasks.

1. INTRODUCTION

We live in a complex world where most things that happen are the result of a multitude of factors. How do humans handle this complexity? Cognitive scientists have proposed that we do so by organizing our understanding of the world into intuitive theories (Gerstenberg & Tenenbaum, 2017; Wellman & Gelman, 1992) . Accordingly, people have intuitive theories of the physical and social world with which we reason about how objects and agents interact with one another (Battaglia et al., 2013; Ullman et al., 2017; Gerstenberg et al., 2021; Baker et al., 2017) . Intuitive theories support a variety of cognitive functions that include making predictions about the future, drawing inferences about the past, and giving explanations of the present (Davis & Marcus, 2015; Lake et al., 2017) . Concepts related to causality and morality form key ingredients of people's physical and social theories. For instance, people's causal knowledge helps them parse the things that happen into actions, events, and their relationships (Davis & Marcus, 2015) . People's moral knowledge helps them tell apart good from bad (Alicke et al., 2015) . Over the last few years, large language models (LLMs) have become increasingly successful in emulating certain aspects of human commonsense reasoning ranging from tasks such as physical reasoning (Tsividis et al., 2021 ), visual reasoning (Buch et al., 2022 ), moral reasoning (Hendrycks et al., 2020 ), and text comprehension (Brown et al., 2020; Liu et al., 2019b; Bommasani et al., 2021) . Here, we investigate to what extent pretrained LLMs align with human intuitions about the role of objects and agents in text-based scenarios, using judgments about causality and morality as two case studies. We show how insights from cognitive science can help to align LLMs more closely with the intuitions of human participants. Prior work on alignment between LLMs and human intuitions usually collected evaluation datasets in two stages. In the first phase, participants write stories with open-ended instructions. In the second phase, another group of participants labels these participant-generated stories (e.g. Hendrycks et al., 2020) . The upside of this approach is the ease of obtaining a large number of examples in a short period of time. The downside of this approach is that the crowd-sourced stories are often not carefully written, and that they lack experimental control. Here, we take a different approach. Instead of relying on participant-generated scenarios, we collected two datasets from the existing literature in cognitive science: one on causal judgments, and another one on moral judgments. These scenarios were carefully created by researchers with the intention of systematically manipulating one or a few factors that have been theorized to influence people's judgments. Using these scenarios, we can design a series of experiments that aim to measure the LLMs' alignment with human intuition and use the scientific framework around human judgments to propose a step-by-step reasoning process to help LLMs align more closely with humans. Causal Judgments Judging which out of several events was "the" cause of an outcome often comes naturally to people. Making causal judgments goes beyond simply indentifying that two events in a story are linked by a causal verb (such as "A caused B"). People are sensitive to additional aspects of the story, such as the normality of the events, whether the outcome came about due to action or omission, as well as the time course of how the events unfolded. Thus, making causal judgments like humans do requires to go at least one stup further than what typical natural language understanding tasks assess (Wang et al., 2018) . Moral Judgments To better understand people's moral intuitions, cognitive scientists ask people to judge how permissible an action is in a moral dilemma. In recent years, a particular type of moral dilemma: the trolley problem, has received much attention in language understanding (Hendrycks et al., 2020; Emelin et al., 2021; Jiang et al., 2021) . However, some of these trolley problems (Thomson, 1985) can be "solved" solely based on numerical comparison ("killing one person is more morally permissible than killing five people"). In real life, judging moral permissibility is much more complex: is harm inevitable or avoidable? How is the action causally linked to the harm? What are the alternative actions a person could have taken? By systematically variying these factors across scenarios, cognitive scientists have begun to uncover how people's moral judgments work (Cushman & Young, 2009; Winfield et al., 2019) . Our Contributions We summarized the main experimental findings of 24 cognitive science papers into factors that have been shown to systematically influence people's judgments on moral and causal stories (Table 1 ). Relying on these factors, we evaluate and inspect LLM's performance on a new richly annotated dataset with human judgments to gain a better understanding of when and why LLMs and humans align. We ask the following research questions: (R1): Do LLMs make the same causal and moral judgments as people? (Finding: No.) (R2): Do LLMs improve when the relevant causal and moral factors in the story are made explicit and highlighted? (Finding: Yes. After a process called Thought-as-Text Translation, alignment is increased.) (R3): Can LLMs identify systematic factors related to moral and causal judgments? (Finding: Yes. We treat factor identification as a few-shot natural language understanding task and evaluate LLMs on 11 factors relevant to causal and moral judgments.) (R4): Can LLMs produce more human-aligned judgments? (Finding: Yes, by combining the insights from R2 and R3.)

2. RELATED WORK

For causal reasoning, there is an active line of research at the intersection of natural language processing (NLP) and commonsense reasoning that involves extracting and representing causal relationships among entities in text. In some cases, these relationships are based on commonsense knowledge of how objects behave in the world (Sap et al., 2019; Bosselut et al., 2019; Talmor et al., 2019) . In other cases, models identify scenario-specific causal relations and outcomes. Sometimes, the causal relationship between entities is explicitly stated (e.g. those cancers were caused by radiation) (Hendrickx et al., 2010) , while at other times the relationship is left implicit and needs to be inferred (Mostafazadeh et al., 2016) . Causal reasoning is also included in broad benchmarks for language understanding, such as the Choice of Plausible Alternatives (COPA) in SuperGLUE (Roemmele et al., 2011; Wang et al., 2019) . For moral reasoning, tasks have focused on evaluations of agents in narrative-like text. These tasks and datasets vary in the amount of structure they provide, ranging from pairs of free-form anecdotes and judgment labels (Lourie et al., 2021; Hendrycks et al., 2020) , to inputs with components separated

