WIKIWHY: ANSWERING AND EXPLAINING CAUSE-AND-EFFECT QUESTIONS

Abstract

As large language models (LLMs) grow larger and more sophisticated, assessing their "reasoning" capabilities in natural language grows more challenging. Recent question answering (QA) benchmarks that attempt to assess reasoning are often limited by a narrow scope of covered situations and subject matters. We introduce WIKIWHY 1 , a QA dataset built around a novel auxiliary task: explaining why an answer is true in natural language. WIKIWHY contains over 9,000 "why" question-answer-rationale triples, grounded on Wikipedia facts across a diverse set of topics. Each rationale is a set of supporting statements connecting the question to the answer. WIKIWHY serves as a benchmark for the reasoning capabilities of LLMs because it demands rigorous explicit rationales for each answer to demonstrate the acquisition of implicit commonsense knowledge, which is unlikely to be easily memorized. GPT-3 baselines achieve only 38.7% human-evaluated correctness in the end-to-end answer & explain condition, leaving significant room for future improvements.

1. INTRODUCTION

Error analyses of practical NLP systems in recent history demonstrate that some of the mistakes made by state-of-the-art models would be avoided by basic human intuition (Shuster et al., 2022) , and some of the most challenging tasks for models are the same ones that might be trivial to human children. With modern systems' impressive performance on tasks such as grammar correction showing that manipulating language is not the issue, LLMs seem to face a fundamental lack of common sense-an understanding of everyday phenomena and how they interact with each other and the world at large. As striking gains in subjective performance on summarization, creative text generation, and apparent language understanding continue to be called into question, the development of strong benchmarks to assess reasoning capabilities for these LLMs grows more important. One popular approach to measuring reasoning capability is through performance on question answering (QA) benchmark tasks where direct queries for information act as a straightforward examination of a system's "understanding." Classic QA datasets, however, are primarily concerned with retrieving factoids to answer questions of "Who", "What", "When", and "Where". These questions have been shown to be answerable (with high accuracy) by simple pattern-matching approaches (Wadhwa et al., 2018) , thereby limiting their ability to measure the aforementioned reasoning capability. Looking to maintain the breadth of topics covered while increasing the difficulty of the QA task, researchers introduced multi-hop QA datasets like HotpotQA (Yang et al., 2018) . While challenging, the task's extra complexity mostly leads to unnatural questions that can be addressed with iterated factoid retrieval and entity resolution, rather than a necessary understanding of how different entities interact. Noticeably absent in these prior datasets are "why" questions, which prompt for not factoids, but explanations-reasoning made explicit. The task of explanation uses reasoning and produces explicit, interpretable "thought" processes. Capitalizing on these properties, this paper introduces WIKIWHY, a novel dataset containing "why" question-answer pairs. Each WIKIWHY entry contains a rationale explaining the QA pair's causal relation (Figure 1 ), summing to a total of 14,238 explanation elements. In the context of recent multimodal, self-supervised approaches aiming to capture intuitions unlearnable from text alone (Chadha & Jain, 2021), WIKIWHY presents an opportunity to investigate a specific kind of information absent in text: implicit commonsense assumptions. Compared to other QA datasets with rationales, WIKIWHY covers a significantly broader range of 11 topics which may prove valuable for developing the skill of applied reasoning on various specific situations. Our experiments in explanation generation and human evaluation demonstrate that state-of-the-art generative models struggle with producing satisfying explanations for WIKIWHY cause-effect relations. Our experiments also demonstrate how our proposed task might be used to diagnose a lack of "understanding" in certain relations. Our key contributions are thus: • We propose explanation within cause-effect relations as a novel problem formulation for exploring LLM reasoning ability. • We create WIKIWHY, the first question-answering dataset focusing on reasoning within causal relations, spanning 11 topics. • We perform experiments on state-of-the-art, generative models to investigate various settings and establish baseline results with sizable room for improvement. • We introduce idea-level evaluation metrics for free-form text (explanation) generation and a human judgment correlation analysis, demonstrating that (1) reference similarity is strongly correlated with explanation correctness, and (2) the metrics we introduced correlate with this proxy. 



Figure1: A simple example of an entry from WIKIWHY; a cause and effect sourced from a Wikipedia passage, a "why" question and its answer about this relation, and most importantly rationale that explains why cause leads to effect.

Cause and Effect. Causality has been a subject of rigorous work in various fields. In science philosophy, Pearl (2009) has contributed seminal work relating to causal models, Bayesian networks, and causal strength via interventions and counterfactuals. These ideas have even been incorporated into QA tasks through Knowledge Graph approaches, such as filtering spurious latent correlations(Sui et al., 2022). While our work emphasizes cause-and-effect, we are unconcerned with causal strength. Starting with Wikipedia-grounded relations ensures valid relations. Instead, we are interested in the information encoded into LLMs rather than augmented structures such as knowledge graphs.Multi-hop Question Answering. While datasets such as HotpotQA(Yang et al., 2018)  andHy- bridQA (Chen et al., 2020)  are instrumental in gauging models' ability to handle multiple sources and modalities, they are focused on iterated factoid retrieval. Although chaining multiple facts into

