CAUSAL CURIOSITY: RL AGENTS DISCOVERING SELF-SUPERVISED EXPERIMENTS FOR CAUSAL REPRESENTATION LEARNING

Abstract

Humans show an innate ability to learn the regularities of the world through interaction. By performing experiments in our environment, we are able to discern the causal factors of variation and infer how they affect the dynamics of our world. Analogously, here we attempt to equip reinforcement learning agents with the ability to perform experiments that facilitate a categorization of the rolled-out trajectories, and to subsequently infer the causal factors of the environment in a hierarchical manner. We introduce a novel intrinsic reward, called causal curiosity, and show that it allows our agents to learn optimal sequences of actions, and to discover causal factors in the dynamics. The learned behavior allows the agent to infer a binary quantized representation for the ground-truth causal factors in every environment. Additionally, we find that these experimental behaviors are semantically meaningful (e.g., to differentiate between heavy and light blocks, our agents learn to lift them), and are learnt in a self-supervised manner with approximately 2.5 times less data than conventional supervised planners. We show that these behaviors can be re-purposed and fine-tuned (e.g., from lifting to pushing or other downstream tasks). Finally, we show that the knowledge of causal factor representations aids zero-shot learning for more complex tasks.

1. INTRODUCTION

Discovering causation in environments an agent might encounter remains an open and challenging problem for causal reinforcement learning (Schölkopf (2015) , Bengio et al. (2013) , Schölkopf (2019)). Most approaches take the form of BAMDPs (Bayes Adaptive Markov Decision Processes) (Zintgraf et al. (2019) ) or Hi-Param MDP (Hidden Parameter MDPs) (Doshi-Velez & Konidaris (2016) ; Yao et al. (2018) ; Killian et al. (2017) ; Perez et al. (2020) ) which condition the transition p(s t+1 |s t , a t ; H) and/or reward function R(r t+1 |s t , a t , s t+1 ; H) of each environment on hidden parameters (also referred to as causal factors in some of the above studies). Let s ∈ S, a ∈ A, r ∈ R, H ∈ H where S, A, R, and H are the set of states, actions, rewards and feasible hidden parameters. In the physical world and in the case of mechanical systems, examples of the parameter h j ∈ H include gravity, coefficients of friction, masses and sizes of objects. Typically, H is treated as a latent variable for which an embedding is learned during training, using variational methods (Kingma et al. (2014) ; Ilse et al. ( 2019)). Let s 0:T be the entire state trajectory of length T . Similarly, a 0:T is the sequence of actions applied during that trajectory by the agent that results in s 0:T . In an environment parameterized by these causal factors, these latent variable approaches define a probability distribution over the entire sequence of (rewards, states, actions) conditioned on a latent z as p(r 0:T , s 0:T , a 0:T -1 ; z) that factorizes as T -1 i=1 p(r t+1 |s t , a t , s t+1 , z)p(s t+1 |s t , a t , z)p(a t |s t , z) (1) due to the Markov assumption. At test time, the agent infers the causal factor associated with its environment by observing the trajectories produced by its initial actions that can be issued by any policy such as model-based reinforcement learning. trajectories. More specifically, at each time step, the transition function is affected by a subset of global causal factors. This subset is implicitly defined on the basis of the current state and the action taken. For example, if a body in an environment loses contact with the ground, the coefficient of friction between the body and the ground no longer affects the outcome of any action that is taken. Likewise, the outcome of an upward force applied by the agent to a body on the ground is unaffected by the friction coefficient. We can therefore take advantage of this natural discontinuity to discern causal factors. Without knowledge of how independent causal mechanisms affect the outcome of a particular action in a given state in an environment, it becomes impossible for the agent to conclude where the variation it encountered came from. Unsurprisingly, Hi-Param and BAMDP approaches fail to learn a disentangled embedding for the causal factors, making their behaviors uninterpretable (Perez et al. (2020)). For example, if, in an environment, a body remains stationary under a particular force, the Hi-Param or BAMDP agent may apply a higher force to achieve its goal of perhaps moving the body, but will be unable to conclude whether the "un-movability" was caused by high friction or high mass of the body. Additionally, these approaches require human-supervised reward engineering, making it difficult to apply them outside of the simulated environments they are tested in. Our goal is, instead of focusing on maximizing reward for some particular task, to allow agents to discover causal processes through exploratory interaction. During training, our agents discover self-supervised experimental behaviors which they apply to a set of training environments. These behaviors allow them to learn about the various causal mechanisms that govern the transitions in each environment. During inference in a novel environment, they perform these discovered behaviors sequentially and use the outcome of each behavior to infer the embedding for a single causal factor (Figure 1 ). The main challenge while learning a disentangled representation for the causal factors of the world is that several causal factors may affect the outcome of behaviors in each environment. For example, when pushing a body on the ground, the outcome, i.e., whether the body moves, or how far the body is pushed, depends on several factors, e.g., mass, shape and size, frictional coefficients, etc. However, if, instead of pushing on the ground, the agent executes a perfect grasp-and-lift behavior, only mass will affect whether the body is lifted off the ground or not. Thus, it is clear that not all experimental behaviors are created equal and that the outcomes of some behaviors are caused by fewer causal factors than others. Our agents learn these behaviors without supervision using causal curiosity, an intrinsic reward. The outcome of a single such experimental behavior is then used to infer a binary quantized embedding describing the single isolated causal factor. Even though causal factors of variation in a physical world are easily identifiable to humans, a concrete definition is required to back up our proposed method. We conjecture that the causality of a factor of variation depends on the available actions to the agent. If the set of actions that an agent can take is very limited, there is no way for it to discern a diverse set of causal factors in the environment.  ) : o 0:T ∈ O, o 0:T ∈ O } > (2) and that h j is the cause of the trajectory of states obtained i.e., p(o 0:T |do(h j = k), a 0:T ) = p(o 0:T |do(h j = k ), a 0:T ) ∀k = k (3) Intuitively, a factor of variation affecting a set of environments is called causal if there exists a sequence of actions available to the agent where the resultant trajectories are clustered into two or more sets (for simplicity here we assume binary clusters). This is analogous to the human ability to conclude whether objects are heavy or light, big or small. For a gentle introduction to the intuition about this definition, we refer the reader to Appendix D.



Definition 1 (Causal factors). Consider the POMDP (O, S, A, p, r) with observation space O, state space S, action space A, the transition function p, and the reward function r. Let o 0:T ∈ O T denotes a trajectory of observations and T be the length of such trajectories. Let d(•, •) : O T × O T → R + be a distance function defined on the space of trajectories of length T . The set H = {h 1 , h 2 , . . . , h k } is called a set of -causal factors if for every h j ∈ H, there exists a unique sequence of actions a 0:T that clusters the state trajectories into two sets S and S such that min{d(o 0:T , o 0:T

