ACTION AND PERCEPTION AS DIVERGENCE MINIMIZATION

Abstract

We introduce a unified objective for action and perception of intelligent agents. Extending representation learning and control, we minimize the joint divergence between the combined system of agent and environment and a target distribution. Intuitively, such agents use perception to align their beliefs with the world, and use actions to align the world with their beliefs. Minimizing the joint divergence to an expressive target maximizes the mutual information between the agent's representations and inputs, thus inferring representations that are informative of past inputs and exploring future inputs that are informative of the representations. This lets us explain intrinsic objectives, such as representation learning, information gain, empowerment, and skill discovery from minimal assumptions. Moreover, interpreting the target distribution as a latent variable model suggests powerful world models as a path toward highly adaptive agents that seek large niches in their environments, rendering task rewards optional. The framework provides a common language for comparing a wide range of objectives, advances the understanding of latent variables for decision making, and offers a recipe for designing novel objectives. We recommend deriving future agent objectives the joint divergence to facilitate comparison, to point out the agent's target distribution, and to identify the intrinsic objective terms needed to reach that distribution.



Figure 1 : Overview of methods connected by the introduced framework of action and perception as divergence minimization. Each latent variable leads to a mutual information term between said variable and the data. The mutual information with past inputs explains representation learning. The mutual information with future inputs explains information gain, empowerment, and skill discovery. By leveraging multiple latent variables for the decision making process, agents can naturally combine multiple of the objectives. This figure shows the methods that drive from the well-established KL divergence and analogous method trees can be derived by choosing different divergence measures.

1. INTRODUCTION

To achieve goals in complex environments, intelligent agents need to perceive their environments and choose effective actions. These two processes, perception and action, are often studied in isolation. Despite the many objectives that have been proposed in the fields of representation learning and reinforcement learning, it remains unclear how the objectives relate to each other and which fundamentally new objectives remain yet to be discovered. Based on the KL divergence (Kullback and Leibler, 1951) , we propose a unified framework for action and perception that connects a wide range of objectives to facilitate our understanding of them while providing a recipe for designing novel agent objectives. Our findings are conceptual in nature and this paper includes no empirical study. Instead, we offer a unified picture of a wide range of methods that have been shown to be successful in practice in prior work. The contributions of this paper are described as follows. Unified objective function for perception and action We propose joint KL minimization as a principled framework for designing and comparing agent objectives. KL minimization was proposed separately for perception as variational inference (Jordan et al., 1999; Alemi and Fischer, 2018) and for actions as KL control (Todorov, 2008; Kappen et al., 2009) . Based on this insight, we formulate action and perception as jointly minimizing the KL from the world to a unified target distribution. The target serves both as the model to infer representations and as reward for actions. This extends variational inference to controllable inputs, while extending KL control to latent representations. We show a novel decomposition of joint KL divergence that explains several representation learning and exploration objectives. Divergence minimization additionally connects deep reinforcement learning to the free energy principle (Friston, 2010; 2019) , while simplifying and overcoming limitations of its active inference implementations (Friston et al., 2017) that we discuss in Appendix B. Understanding latent variables for decision making Divergence minimization with an expressive target maximizes the mutual information between inputs and latents. Agents thus infer representations that are informative of past inputs and explore future inputs that are informative of the representations. For the past, this yields reconstruction (Hinton et al., 2006; Kingma and Welling, 2013) or contrastive learning (Gutmann and Hyvärinen, 2010; Oord et al., 2018) . For the future, it yields information gain exploration (Lindley et al., 1956) . Stochastic skills and actions are realized over time, so their past terms are constant. For the future, they lead to empowerment (Klyubin et al., 2005) and skill discovery (Gregor et al., 2016) . RL as inference (Rawlik et al., 2010) does not maximize mutual information because its target is factorized. To optimize a consistent objective across past and future, latent representations should be accompanied by information gain exploration. Expressive world models for large ecological niches The more flexible an agent's target or model, the better the agent can adapt to its environment. Minimizing the divergence between the world and the model, the agent converges to a natural equilibrium or niche where it can accurately predict its inputs and that it can inhabit despite external perturbations (Schrödinger, 1944; Wiener, 1948; Haken, 1981; Friston, 2013; Berseth et al., 2019) . While surprise minimization can lead to trivial solutions, divergence minimization encourages the niche to match the agent's model class, thus visiting all inputs proportionally to how well they can be understood. This suggests designing expressive world models of sensory inputs (Ebert et al., 2017; Hafner et al., 2018; Gregor et al., 2019) as a path toward building highly adaptive agents, while rendering task rewards optional.

2. FRAMEWORK

This section introduces the framework of action and perception as divergence minimization (APD). To unify action and perception, we formulate the two processes as joint KL minimization with a shared target distribution. The target distribution expresses the agent's preferences over system configurations and is also the probabilistic model under which the agent infers its representations. Using an expressive model as the target maximizes the mutual information between the latent variables and the sequence of sensory inputs, thus inferring latent representations that are informative of past inputs and exploring future inputs that are informative of the representations. We assume knowledge of basic concepts from probability and information theory that are reviewed in Appendix D. 2.1 JOINT KL MINIMIZATION Consider a stochastic system described by a joint probability distribution over random variables. For example, the random variables for supervised learning are the inputs and labels and for an agent they are the sequence of sensory inputs, internal representations, and actions. More generally, we combine

