MIND'S EYE: GROUNDED LANGUAGE MODEL REA-SONING THROUGH SIMULATION

Abstract

Successful and effective communication between humans and AI relies on a shared experience of the world. By training solely on written text, current language models (LMs) miss the grounded experience of humans in the real-world-their failure to relate language to the physical world causes knowledge to be misrepresented and obvious mistakes in their reasoning. We present Mind's Eye, a paradigm to ground language model reasoning in the physical world. Given a physical reasoning question, we use a computational physics engine (DeepMind's MuJoCo) to simulate the possible outcomes, and then use the simulation results as part of the input, which enables language models to perform reasoning. Experiments on 39 tasks in a physics alignment benchmark demonstrate that Mind's Eye can improve reasoning ability by a large margin (27.9% zero-shot, and 46.0% few-shot absolute accuracy improvement on average). Smaller language models armed with Mind's Eye can obtain similar performance to models that are 100× larger. Finally, we confirm the robustness of Mind's Eye through ablation studies.

1. INTRODUCTION

"In questions of science, the authority of a thousand is not worth the humble reasoning of a single individual." --Galileo Galilei, 1632 "Do objects fall proportionately to their weight?" This famous question was once controversial until Galileo's Leaning Tower of Pisa experimentfoot_0 -Galileo dropped two balls of different masses from the same height (i.e., experiment) and concluded that their time of descent was independent of their mass (i.e., inductive reasoning). Such an experiment-reasoning paradigm has been used by humans for centuries to ground reasoning on complicated problems (Newell, 1980) and transfer learned knowledge to unfamiliar domains (Novak & Gowin, 1984) . Current language models (LMs) follow a different path-by training on natural language, they attempt to reverse engineer the physical world, so that they are able to reason about it. Largescale pre-trained LMs have achieved revolutionary performance on many tasks, such as solving math word problems (Roy & Roth, 2015; Ling et al., 2017; Cobbe et al., 2021) and commonsense reasoning (Talmor et al., 2022; Geva et al., 2021) . However, these models do not experience firsthand the situations that are described by the language (McClelland et al., 2020) , and lack the ability to find the correct answers by performing experiments like humans. As a consequence, when asked the same free fall question, one of the most widely-used LMs, GPT-3foot_1 (Brown et al., 2020)-though achieving superhuman performance in many reasoning tasks-will generate the wrong answer: "The heavier object will fall faster." (as shown in Figure 1 ). Due to the lack of grounded reasoning, current LMs also have issues in truthfulness (Lin et al., 2021) and factuality (Petroni et al., 2020) .

Question:

Two baseballs X and Y are released from rest at the same height. X is heavier than Y. Which baseball will fall to the ground faster? Answer from Vanilla LM:

Answer:

The heavier baseball will fall to the ground faster because it has more mass and therefore more gravity.

Question:

Two baseballs X and Y are released from rest at the same height. X is heavier than Y. Which baseball will fall to the ground faster?  Two baseballs X and Y are released from rest at the same height. X is heavier than Y. Which baseball will fall to the ground faster? + Chain-of-Thought on Similar Questions Answer from CoT + LM: Answer: Since the acceleration of an object can be computed as a = F/m, the heavier one will have more gravity, so a = mg/m > g. The heavier one will fall to the ground faster. Answer from Mind's Eye + LM: Answer: Hints: X and Y have the same acceleration. So the answer is: they will fall at the same rate. Both baseballs will fall to the ground at the same time.

Vanilla

Figure 1 : Current language models are still challenged by simple questions that require a good understanding of the physical world. The answer elicited by Chain-of-Thought can still be wrong if the required knowledge is missing or misrepresented in LMs. Mind's Eye, instead, enables grounded LM reasoning by directly simulating the scene in the given question. Then the LM can reason over the injected ground-truth rationale to generate the correct answers. To tackle these problems, existing remedies include using improved prompting techniques, such as inserting hand-written decomposed reasoning steps in few-shot demonstrations (Wei et al., 2022; Zhou et al., 2022) . These methods are inherently limited as their reasoning ability completely relies on the knowledge perpetuated in the LM-their performance could suffer if the knowledge learnt by the LM is incorrect (Petroni et al., 2019) or outdated (Dhingra et al., 2022) . To incorporate external knowledge, retrieval-augmented LMs such as REALM (Guu et al., 2020 ), RAG (Lewis et al., 2020) and RETRO (Borgeaud et al., 2022) , retrieve relevant documents as additional evidence for given questions, and may also fine-tune the LM on the question-document-answer triplets. However, the knowledge presented in written language is known to have reporting bias (Bisk et al., 2020) , whereby some everyday unspoken facts or rarely seen (but practically possible) compositions are commonly missing in text (Paik et al., 2021) . Correct and complete understanding of properties and interactions in the physical world is not only essential to achieve human-level reasoning (Lake et al., 2017) , but also fundamental to build a general-purpose embodied intelligence (Huang et al., 2022) . In this work, we investigate to what extent current LMs understand the basic rules and principles of the physical world, and describe how to ground their reasoning with the aid of simulation. Our contributions are three-fold: • We propose a new multi-task physics alignment dataset, UTOPIA, whose aim is to benchmark how well current LMs can understand and reason over some basic laws of physics ( §2). The dataset contains 39 sub-tasks covering six common scenes that involve understanding basic principles of physics (e.g., conservation of momentum in elastic collisions), and all the ground-truth answers are automatically generated by a physics engine. We find that current large-scale LMs are still quite limited on many basic physics-related questions (24% accuracy of GPT-3 175B in zero-shot, and 38.2% in few-shot). • We explore a paradigm that adds physics simulation to the LM reasoning pipeline ( §3) to make the reasoning grounded within the physical world. Specifically, we first use a model to transform the given text-form question into rendering code, and then run the corresponding simulation on a physics engine (i.e., MuJoCo (Todorov et al., 2012) ). Finally we append the



In Physics, Aristotle (384-322 BC) claims that the speed at which two identically shaped objects fall is directly proportional to their weights, which was later challenged by Aristotelian commentator John Philoponus. Specifically, we use text-davinci-002, which is the "most capable GPT-3 model" at the time of writing from OpenAI: https://beta.openai.com/docs/models/overview.

