CAUSAL CURIOSITY: RL AGENTS DISCOVERING SELF-SUPERVISED EXPERIMENTS FOR CAUSAL REPRESENTATION LEARNING

Abstract

Humans show an innate ability to learn the regularities of the world through interaction. By performing experiments in our environment, we are able to discern the causal factors of variation and infer how they affect the dynamics of our world. Analogously, here we attempt to equip reinforcement learning agents with the ability to perform experiments that facilitate a categorization of the rolled-out trajectories, and to subsequently infer the causal factors of the environment in a hierarchical manner. We introduce a novel intrinsic reward, called causal curiosity, and show that it allows our agents to learn optimal sequences of actions, and to discover causal factors in the dynamics. The learned behavior allows the agent to infer a binary quantized representation for the ground-truth causal factors in every environment. Additionally, we find that these experimental behaviors are semantically meaningful (e.g., to differentiate between heavy and light blocks, our agents learn to lift them), and are learnt in a self-supervised manner with approximately 2.5 times less data than conventional supervised planners. We show that these behaviors can be re-purposed and fine-tuned (e.g., from lifting to pushing or other downstream tasks). Finally, we show that the knowledge of causal factor representations aids zero-shot learning for more complex tasks.



)) which condition the transition p(s t+1 |s t , a t ; H) and/or reward function R(r t+1 |s t , a t , s t+1 ; H) of each environment on hidden parameters (also referred to as causal factors in some of the above studies). Let s ∈ S, a ∈ A, r ∈ R, H ∈ H where S, A, R, and H are the set of states, actions, rewards and feasible hidden parameters. In the physical world and in the case of mechanical systems, examples of the parameter h j ∈ H include gravity, coefficients of friction, masses and sizes of objects. Typically, H is treated as a latent variable for which an embedding is learned during training, using variational methods (Kingma et al. ( 2014); Ilse et al. ( 2019)). Let s 0:T be the entire state trajectory of length T . Similarly, a 0:T is the sequence of actions applied during that trajectory by the agent that results in s 0:T . In an environment parameterized by these causal factors, these latent variable approaches define a probability distribution over the entire sequence of (rewards, states, actions) conditioned on a latent z as p(r 0:T , s 0:T , a 0:T -1 ; z) that factorizes as T -1 i=1 p(r t+1 |s t , a t , s t+1 , z)p(s t+1 |s t , a t , z)p(a t |s t , z) (1) due to the Markov assumption. At test time, the agent infers the causal factor associated with its environment by observing the trajectories produced by its initial actions that can be issued by any policy such as model-based reinforcement learning. In practice, however, discovering causal factors in a physical environment is prone to various challenges that are caused by the disjointed nature of the influence of these factors on the produced 1



environments an agent might encounter remains an open and challenging problem for causal reinforcement learning (Schölkopf (2015), Bengio et al. (2013), Schölkopf (2019)). Most approaches take the form of BAMDPs (Bayes Adaptive Markov Decision Processes) (Zintgraf et al. (2019)) or Hi-Param MDP (Hidden Parameter MDPs) (Doshi-Velez & Konidaris (2016); Yao et al. (2018); Killian et al. (2017); Perez et al. (

