MIND'S EYE: GROUNDED LANGUAGE MODEL REA-SONING THROUGH SIMULATION

Abstract

Successful and effective communication between humans and AI relies on a shared experience of the world. By training solely on written text, current language models (LMs) miss the grounded experience of humans in the real-world-their failure to relate language to the physical world causes knowledge to be misrepresented and obvious mistakes in their reasoning. We present Mind's Eye, a paradigm to ground language model reasoning in the physical world. Given a physical reasoning question, we use a computational physics engine (DeepMind's MuJoCo) to simulate the possible outcomes, and then use the simulation results as part of the input, which enables language models to perform reasoning. Experiments on 39 tasks in a physics alignment benchmark demonstrate that Mind's Eye can improve reasoning ability by a large margin (27.9% zero-shot, and 46.0% few-shot absolute accuracy improvement on average). Smaller language models armed with Mind's Eye can obtain similar performance to models that are 100× larger. Finally, we confirm the robustness of Mind's Eye through ablation studies.

1. INTRODUCTION

"In questions of science, the authority of a thousand is not worth the humble reasoning of a single individual." --Galileo Galilei, 1632 "Do objects fall proportionately to their weight?" This famous question was once controversial until Galileo's Leaning Tower of Pisa experimentfoot_0 -Galileo dropped two balls of different masses from the same height (i.e., experiment) and concluded that their time of descent was independent of their mass (i.e., inductive reasoning). Such an experiment-reasoning paradigm has been used by humans for centuries to ground reasoning on complicated problems (Newell, 1980) and transfer learned knowledge to unfamiliar domains (Novak & Gowin, 1984) . Current language models (LMs) follow a different path-by training on natural language, they attempt to reverse engineer the physical world, so that they are able to reason about it. Largescale pre-trained LMs have achieved revolutionary performance on many tasks, such as solving math word problems (Roy & Roth, 2015; Ling et al., 2017; Cobbe et al., 2021) and commonsense reasoning (Talmor et al., 2022; Geva et al., 2021) . However, these models do not experience firsthand the situations that are described by the language (McClelland et al., 2020) , and lack the ability to find the correct answers by performing experiments like humans. As a consequence, when asked the same free fall question, one of the most widely-used LMs, GPT-3foot_1 (Brown et al., 2020)-though achieving superhuman performance in many reasoning tasks-will generate the wrong answer: "The heavier object will fall faster." (as shown in Figure 1 ). Due to the lack of grounded reasoning, current LMs also have issues in truthfulness (Lin et al., 2021) and factuality (Petroni et al., 2020) .



In Physics, Aristotle (384-322 BC) claims that the speed at which two identically shaped objects fall is directly proportional to their weights, which was later challenged by Aristotelian commentator John Philoponus. Specifically, we use text-davinci-002, which is the "most capable GPT-3 model" at the time of writing from OpenAI: https://beta.openai.com/docs/models/overview.

