EMPIRICALLY VERIFYING HYPOTHESES USING REINFORCEMENT LEARNING

Abstract

This paper formulates hypothesis verification as an RL problem. Specifically, we aim to build an agent that, given a hypothesis about the dynamics of the world, can take actions to generate observations which can help predict whether the hypothesis is true or false. Existing RL algorithms fail to solve this task, even for simple environments. We formally define this problem, and develop environments to test different algorithms' performance. We analyze methods which are given additional pre-training rewards and find that the most effective of these is one that exploits the underlying structure of many hypotheses, factorizing them as {pre-condition, action sequence, post-condition} triplets. By leveraging this structure we show that RL agents are able to succeed. Furthermore, subsequent fine-tuning of the policies allows the agent to correctly verify hypotheses not amenable to this factorization.

1. INTRODUCTION

Empirical research on early learning Gopnik (2012); Kushnir & Gopnik (2005) shows that infants build an understanding of the world by constantly formulating hypotheses about how some physical aspect of the world might work and then proving or disproving them through deliberate play. Through this process the child builds up a consistent causal understanding of the world. This contrasts with manner in which current ML systems operate. Both traditional i.i.d and interactive learning settings use a single user-specified objective function that codifies a high-level task, and the optimization routine finds the set of parameters (weights) which maximizes performance on the task. The learned representation (knowledge of how the world works) is embedded in the weights of the model -which makes it harder to inspect, hypothesize or even enforce domain constraints that might exist. On the other hand, hypothesis generation and testing is a process explored in classical approaches to AI Brachman & Levesque (2004) . In this paper we take a modest step towards the classical AI problem of building agents capable of testing hypotheses about its world using modern ML approaches. The problem we address is illustrated in Figure 1 . Agents are placed in a world which has several interactive elements. They are provided with a hypothesis (an "action sentence" Pearl ( 2009)) about the underlying mechanics of the world via a text string (e.g. "A will be true if we do B"). The task is to determine if the hypothesis is true or not. This problem cannot be solved without interaction with a dynamic world (comparing the state before and after taking action B). A key novelty in our work is formulating the task in a manner that permits the application of modern RL methods, allowing raw state observations to be used rather than abstract Boolean expressions of events. To do this, we use a model composed of two different deep parametric functions which are learned through interaction: (i) a policy that generates observations relevant to verification of the hypothesis and (ii) a prediction function which uses the observations to predict whether it is true. We first show that agents trained end-to-end using deep RL cannot learn policies that can generate observations to verify the hypothesis. To remedy this, we exploit the underlying structure of hypotheses -they can often be formulated as a triplet of a pre-condition (P), an action sequence (collectively B), and a post-condition (A) that is causally related to the pre-condition and actions. Using this structure, we can seed our action policy to learn behaviors which alter the truth of the pre-condition and post-condition. This allows agents to learn policies that can generate meaningful observations for training the prediction function. We further demonstrate that these policies can be adapted to learn to verify more general hypotheses that do not necessarily fit into the triplet structure. Our experiments

