EMPIRICALLY VERIFYING HYPOTHESES USING REINFORCEMENT LEARNING

Abstract

This paper formulates hypothesis verification as an RL problem. Specifically, we aim to build an agent that, given a hypothesis about the dynamics of the world, can take actions to generate observations which can help predict whether the hypothesis is true or false. Existing RL algorithms fail to solve this task, even for simple environments. We formally define this problem, and develop environments to test different algorithms' performance. We analyze methods which are given additional pre-training rewards and find that the most effective of these is one that exploits the underlying structure of many hypotheses, factorizing them as {pre-condition, action sequence, post-condition} triplets. By leveraging this structure we show that RL agents are able to succeed. Furthermore, subsequent fine-tuning of the policies allows the agent to correctly verify hypotheses not amenable to this factorization.

1. INTRODUCTION

Empirical research on early learning Gopnik (2012); Kushnir & Gopnik (2005) shows that infants build an understanding of the world by constantly formulating hypotheses about how some physical aspect of the world might work and then proving or disproving them through deliberate play. Through this process the child builds up a consistent causal understanding of the world. This contrasts with manner in which current ML systems operate. Both traditional i.i.d and interactive learning settings use a single user-specified objective function that codifies a high-level task, and the optimization routine finds the set of parameters (weights) which maximizes performance on the task. The learned representation (knowledge of how the world works) is embedded in the weights of the model -which makes it harder to inspect, hypothesize or even enforce domain constraints that might exist. On the other hand, hypothesis generation and testing is a process explored in classical approaches to AI Brachman & Levesque (2004) . In this paper we take a modest step towards the classical AI problem of building agents capable of testing hypotheses about its world using modern ML approaches. The problem we address is illustrated in Figure 1 . Agents are placed in a world which has several interactive elements. They are provided with a hypothesis (an "action sentence" Pearl ( 2009)) about the underlying mechanics of the world via a text string (e.g. "A will be true if we do B"). The task is to determine if the hypothesis is true or not. This problem cannot be solved without interaction with a dynamic world (comparing the state before and after taking action B). A key novelty in our work is formulating the task in a manner that permits the application of modern RL methods, allowing raw state observations to be used rather than abstract Boolean expressions of events. To do this, we use a model composed of two different deep parametric functions which are learned through interaction: (i) a policy that generates observations relevant to verification of the hypothesis and (ii) a prediction function which uses the observations to predict whether it is true. We first show that agents trained end-to-end using deep RL cannot learn policies that can generate observations to verify the hypothesis. To remedy this, we exploit the underlying structure of hypotheses -they can often be formulated as a triplet of a pre-condition (P), an action sequence (collectively B), and a post-condition (A) that is causally related to the pre-condition and actions. Using this structure, we can seed our action policy to learn behaviors which alter the truth of the pre-condition and post-condition. This allows agents to learn policies that can generate meaningful observations for training the prediction function. We further demonstrate that these policies can be adapted to learn to verify more general hypotheses that do not necessarily fit into the triplet structure. Our experiments show that this approach outperforms naive RL and several flavors of intrinsic motivation designed to encourage the agent to interact with the objects of interest.

2. RELATED WORK

Knowledge representation and reasoning (KRR) Brachman & Levesque ( 2004) is a central theme of traditional AI. Commonsense reasoning Davis (1990) ; Davis & Marcus (2015); Liu & Singh (2004) approaches, e.g. CYC Lenat (1995) , codify everyday knowledge into a schema that permits inference and question answering. However, the underlying operations are logic-based and occur purely within the structured representation, having no mechanism for interaction with an external world. Expert systems Giarratano & Riley (1998) instead focus on narrow domains of knowledge, but are similarly self-contained. Logic-based planning methods Fikes & Nilsson (1971) ; Colaco & Sridharan (2015) generate abstract plans that could be regarded as action sequences for an agent. By contrast, our approach is statistical in nature, relying on Reinforcement Learning (RL) to guide the agent. Our approach builds on the recent interest Mao et al. (2019); Garcez et al. (2012) in neural-symbolic approaches that combine neural networks with symbolic representations. In particular, some recent works Zhang & Stone (2015) ; Lu et al. (2018) have attempted to combine RL with KRR, for tasks such as navigation and dialogue. These take the world dynamics learned by RL and make them usable in declarative form within the knowledge base, which is then used to improve the underlying RL policy. In contrast, in our approach, the role of RL is to verify a formal statement about the world. Our work also shares some similarity with Konidaris et al. (2018) , where ML methods are used to learn mappings from world states to representations a planner can use. Causality and RL: There are now extensive and sophisticated formalizations of (statistical) causality Pearl (2009) . These provide a framework for an agent to draw conclusions about its world, and verify hypothesis as in this work. This is the approach taken in Dasgupta et al. (2019) , where RL is used to train an agent that operates directly on a causal Bayesian network (CBN) in order to predict the results of interventions on the values on its nodes. In contrast, the approach in this work is to sidestep this formalization with the hope of training agents who test hypotheses without building explicit CBNs. Unlike Dasgupta et al. (2019) , our agents intervene on the actual world (where interventions may take many actions), rather than the abstract CBN. Nevertheless, we find that it is necessary to add inductive bias to the training of the agent; here we use the pretraining on (P, B, A) triplets. These approaches are complementary; one could combine explicit generation and analysis of CBNs as an abstract representation of an environment with our training protocols. Our work is thus most similar to Denil et al. (2016) , which uses RL directly on the world, and the agent gets reward for answering questions that require experimentation. However, in that work (and in Dasgupta et al. (2019) ), the "question" in each world is the same; and thus while learning to interact led to higher answer accuracy, random experimental policies could still find correct answers. On the other hand, in this work, the space of questions possible for any given world is combinatorial, and random experimentation (and indeed vanilla RL) is insufficient to answer questions. Cognitive development: Empirical research on early learning Gopnik (2012); Kushnir & Gopnik (2005) shows infants build an understanding of the world in ways that parallel the scientific process: constantly formulating hypotheses about how some physical aspect of the world might work and then proving or disproving them through deliberate play. Through this process the child builds up an abstract consistent causal understanding of the world. Violations of this understanding elicit measurable surprise Spelke et al. (1992) .



when you are at craftingtable and you have stick and then you craft then torch is made"[answer: true]

Figure1: Example "crafting" world. The agent verifies a hypothesis (provided a text). Acting according a learned policy, the agent manipulates the observation to one that allows a learned predictor to determine if the hypothesis is true. The learning of policy and predictor is aided by a pretraining phase, during which an intermediate reward signal is provided by utilizing hypotheses that factor into {pre-condition state, action sequence, post-condition state}.

