ACTION AND PERCEPTION AS DIVERGENCE MINIMIZATION

Abstract

We introduce a unified objective for action and perception of intelligent agents. Extending representation learning and control, we minimize the joint divergence between the combined system of agent and environment and a target distribution. Intuitively, such agents use perception to align their beliefs with the world, and use actions to align the world with their beliefs. Minimizing the joint divergence to an expressive target maximizes the mutual information between the agent's representations and inputs, thus inferring representations that are informative of past inputs and exploring future inputs that are informative of the representations. This lets us explain intrinsic objectives, such as representation learning, information gain, empowerment, and skill discovery from minimal assumptions. Moreover, interpreting the target distribution as a latent variable model suggests powerful world models as a path toward highly adaptive agents that seek large niches in their environments, rendering task rewards optional. The framework provides a common language for comparing a wide range of objectives, advances the understanding of latent variables for decision making, and offers a recipe for designing novel objectives. We recommend deriving future agent objectives the joint divergence to facilitate comparison, to point out the agent's target distribution, and to identify the intrinsic objective terms needed to reach that distribution.

1. INTRODUCTION

To achieve goals in complex environments, intelligent agents need to perceive their environments and choose effective actions. These two processes, perception and action, are often studied in isolation. Despite the many objectives that have been proposed in the fields of representation learning and reinforcement learning, it remains unclear how the objectives relate to each other and which fundamentally new objectives remain yet to be discovered. Based on the KL divergence (Kullback and Leibler, 1951) , we propose a unified framework for action and perception that connects a wide range of objectives to facilitate our understanding of them while providing a recipe for designing novel agent objectives. Our findings are conceptual in nature and this paper includes no empirical study. Instead, we offer a unified picture of a wide range of methods that have been shown to be successful in practice in prior work. The contributions of this paper are described as follows. Unified objective function for perception and action We propose joint KL minimization as a principled framework for designing and comparing agent objectives. KL minimization was proposed separately for perception as variational inference (Jordan et al., 1999; Alemi and Fischer, 2018) and for actions as KL control (Todorov, 2008; Kappen et al., 2009) . Based on this insight, we formulate action and perception as jointly minimizing the KL from the world to a unified target distribution. The target serves both as the model to infer representations and as reward for actions. This extends variational inference to controllable inputs, while extending KL control to latent representations. We show a novel decomposition of joint KL divergence that explains several representation learning and exploration objectives. Divergence minimization additionally connects deep reinforcement learning to the free energy principle (Friston, 2010; 2019) , while simplifying and overcoming limitations of its active inference implementations (Friston et al., 2017) that we discuss in Appendix B. Understanding latent variables for decision making Divergence minimization with an expressive target maximizes the mutual information between inputs and latents. Agents thus infer representations that are informative of past inputs and explore future inputs that are informative of the representations. For the past, this yields reconstruction (Hinton et al., 2006; Kingma and Welling, 2013) or contrastive learning (Gutmann and Hyvärinen, 2010; Oord et al., 2018) . For the future, it yields information gain exploration (Lindley et al., 1956) . Stochastic skills and actions are realized over time, so their past terms are constant. For the future, they lead to empowerment (Klyubin et al., 2005) and skill discovery (Gregor et al., 2016) . RL as inference (Rawlik et al., 2010) does not maximize mutual information because its target is factorized. To optimize a consistent objective across past and future, latent representations should be accompanied by information gain exploration. Expressive world models for large ecological niches The more flexible an agent's target or model, the better the agent can adapt to its environment. Minimizing the divergence between the world and the model, the agent converges to a natural equilibrium or niche where it can accurately predict its inputs and that it can inhabit despite external perturbations (Schrödinger, 1944; Wiener, 1948; Haken, 1981; Friston, 2013; Berseth et al., 2019) . While surprise minimization can lead to trivial solutions, divergence minimization encourages the niche to match the agent's model class, thus visiting all inputs proportionally to how well they can be understood. This suggests designing expressive world models of sensory inputs (Ebert et al., 2017; Hafner et al., 2018; Gregor et al., 2019) as a path toward building highly adaptive agents, while rendering task rewards optional.

2. FRAMEWORK

This section introduces the framework of action and perception as divergence minimization (APD). To unify action and perception, we formulate the two processes as joint KL minimization with a shared target distribution. The target distribution expresses the agent's preferences over system configurations and is also the probabilistic model under which the agent infers its representations. Using an expressive model as the target maximizes the mutual information between the latent variables and the sequence of sensory inputs, thus inferring latent representations that are informative of past inputs and exploring future inputs that are informative of the representations. We assume knowledge of basic concepts from probability and information theory that are reviewed in Appendix D. 2.1 JOINT KL MINIMIZATION Consider a stochastic system described by a joint probability distribution over random variables. For example, the random variables for supervised learning are the inputs and labels and for an agent they are the sequence of sensory inputs, internal representations, and actions. More generally, we combine

Formulation

Preferences Latent Entropy Input Entropy

Divergence Minimization Active Inference Expected Reward

Table 1 : High-level comparison of different agent objectives. All objectives express preferences over system configurations as a scalar value. Active inference additionally encourages entropic latents. Divergence minimization additionally encourages entropic inputs. Active inference makes additional choices about the optimization, as detailed in Appendix B, and the motivation for our work is in part to offer a simpler alternative to active inference. We show that when using expressive models as preferences, the entropy terms result in a wide range of task-agnostic agent objectives. all input variables into x and the remaining variables that we term latents into z. We will see that different latents correspond to different representation learning and exploration objectives. The random variables are distributed according to their generative process or actual distribution p φ . Parts of the actual distribution can be unknown, such as the data distribution, and parts can be influenced by varying the parameter vector φ, such as the distribution of stochastic representations or actions. As a counterpart to the actual distribution, we define the desired target distribution τ over the same support. It describes our preferences over system configurations and can be unnormalized, Actual distribution: x, z ∼ p φ (x, z) Target distribution: τ (x, z). We formulate the problem of joint KL minimization as changing the parameters φ to bring the actual distribution of all random variables as close as possible to the target distribution, as measured by the KL divergence (Kullback and Leibler, 1951; Li et al., 2017; Alemi and Fischer, 2018) , min φ KL p φ (x, z) τ (x, z) . All expectations and KLs throughout the paper are integrals under the actual distribution, so they can be estimated from samples of the system and depend on φ. Equation 2 is the reverse KL or information projection used in variational inference (Csiszár and Matus, 2003) . Examples For representation learning, p φ is the joint of data and belief distributions and τ is a latent variable model. Note that we use p φ to denote not the model under which we infer beliefs but the generative process of inputs and their representations. For control, p φ is the trajectory distribution under the current policy and τ corresponds to the utility of the trajectory. The parameters φ include everything the optimizer can change directly, such as sufficient statistics of representations, model parameters, and policy parameters. Target parameters There are two ways to denote deterministic values within our framework, also known as MAP estimates in the probabilistic modeling literature (Bishop, 2006) . We can either use a fixed target distribution and use a latent variable that follows a point mass distribution (Dirac, 1958) , or we explicitly parameterize the target using a deterministic parameter as τ φ . In either case, τ refers to the fixed model class. The two approaches are equivalent because in both cases the target receives a deterministic value that has no entropy regularizer. For more details, see Appendix A.1. Assumptions Divergence minimization uses only two inductive biases, namely that the agent optimizes an objective and that it uses random variables to represent uncertainty. Choosing the wellestablished KL as the divergence measure is an additional assumption. It corresponds to maximizing the expected log probability under the target while encouraging high entropy for all variables in the system to avoid overconfidence, as detailed in Appendix C. Common objectives with different degrees of entropy regularization are summarized in Table 1 . Generality Alternative divergence measures would lead to different optimization dynamics, different solutions if the target cannot be reached, and potentially novel objectives for representation learning and exploration. Nonetheless, the KL can describe any converged system, trivially by choosing its actual distribution as the target, and thus offers a simple and complete mathematical perspective for comparing a wide range of specific objectives that correspond to different latent variables and target distributions. that can be interpreted as a learning probabilistic model of the system. Given the target, perception aligns the agent's beliefs with past inputs while actions align future inputs with its beliefs. There are many ways to specify the target, for example as a latent variable model that explains past inputs and predicts future inputs and an optional reward factor that is shown as a filled square.

2.2. INFORMATION BOUNDS

We show that for expressive targets that capture dependencies between the variables in the system, minimizing the joint KL increases both the preferences and the mutual information between inputs x and latents z. This property allows divergence minimization to explain a wide range of existing representation learning and exploration objectives. We use the term representation learning for inferring deterministic or stochastic variables from inputs, which includes local representations of individual inputs and global representations such as model parameters. Latent preferences The joint KL can be decomposed in multiple ways, for example into a marginal KL plus a conditional KL or by grouping marginal with conditional terms. To reveal the mutual information maximization, we decompose the joint KL into a preference seeking term and an information seeking term. The decomposition can be done either with the information term expressed over inputs and the preferences expressed over latents or the other way around, KL p φ (x, z) τ (x, z) joint divergence = E KL p φ (z | x) τ (z) realizing latent preferences -E ln τ (x z) -ln p φ (x) information bound . All expectations throughout the paper are over all variables, under the actual distribution, and thus depend on the parameters φ. The first term on the right side of Equation 3 is a KL regularizer that keeps the belief p φ (z | x) over latent variables close to the marginal latent preferences τ (z). The second term is a variational bound on the mutual information I x; z (Barber and Agakov, 2003) . The bound is expressed in input space. Maximizing the conditional ln τ (x | z) seeks latent variables that accurately predict inputs while minimizing the marginal ln p φ (x) seeks diverse inputs. Variational free energy When the agent cannot influence its inputs, such as when learning from a fixed dataset, the input entropy E -ln p φ (x) is not parameterized and can be dropped from Equation 3. This yields the free energy or ELBO objective used by variational inference to infer approximate posterior beliefs in latent variable models (Hinton and Van Camp, 1993; Jordan et al., 1999) . The free energy regularizes the belief p φ (z | x) to stay close to the prior τ (z) while reconstructing inputs via τ (x | z). However, in reinforcement and active learning, inputs can be influenced and thus the input entropy should be kept, which makes the information bound explicit. Input preferences Analogously, we decompose the joint KL the other way around. The first term on the right side of Equation 4 is a KL regularizer that keeps the conditional input distribution p φ (x | z) close to the marginal input preferences τ (x). This term is analogous to the objective in KL control (Todorov, 2008; Kappen et al., 2009) , except that the inputs now depend upon latent variables via the policy. The second term is again a variational bound on the mutual information I x; z , this time expressed in latent space. Intuitively, the bound compares the belief τ (z | x) after observing the inputs and the belief p φ (z) before observing any inputs to measure the gained information, KL p φ (x, z) τ (x, z) joint divergence = E KL p φ (x | z) τ (x) realizing input preferences -E ln τ (z x) -ln p φ (z) information bound . ( ) The information bounds are tighter the better the target conditional approximates the actual conditional, meaning that the agent becomes better at maximizing mutual information as it learns more about the relation between the two variables. This requires an expressive target that captures correlations between inputs and latents, such as a latent variable model or deep neural network. Maximizing the mutual information accounts for both learning latent representations that are informative of inputs as well as exploring inputs that are informative of the latent representations.

2.3. MODELS AS PREFERENCES

The target distribution defines our preferences over system configurations. However, we can also interpret it as a probabilistic model, or energy-based model if unnormalized (LeCun et al., 2006) . This is because minimizing the joint KL infers beliefs over latent variables that approximate the posteriors under the model, as shown in Section 2.2. Because the target is not parameterized, it corresponds to the fixed model class, with parameters being inferred as latent variables, optionally using point mass distributions. As the agent brings the actual distribution closer to the target, the target also becomes a better predictor of the actual distribution. Divergence minimization thus emphasizes that the model class simply expresses preferences over latent representations and inputs and lets us interpret inference as bringing the joint of data and belief distributions toward the model joint. Input preferences Minimizing the joint divergence also minimizes the divergence between the agent's input distribution p φ (x) and the marginal input distribution under its target or model τ (x). The marginal input distribution of the model is thus the agent's preferred input distribution, that the agent aims to sample from in the environment. Because τ (x) marginalizes out all latent variables and parameters, it describes how well an input sequence x can possibly be described by the model class, as used in the Bayes factor (Jeffreys, 1935; Kass and Raftery, 1995) . Divergence minimizing agents thus seek out inputs proportionally to how their models can learn to predict them through inference, while avoiding inputs that are inherently unpredictable given their model class. Because the target can be unnormalized, we can combine a latent variable model with a reward factor of the form exp(r(x)) to create a target that incorporates task rewards. The reward factor adds preferences for certain inputs without affecting the remaining variables in the model. We describe examples of such reward factors this in Appendix A.4 and Section 3.1. Action perception cycle Interpreting the target as a model shows that divergence minimization is consistent with the idea of perception as inference suggested by Helmholtz (Helmholtz, 1866; Gregory, 1980) . Expressing preferences as models is inspired by the free energy principle and active inference (Friston, 2010; Friston et al., 2012; 2017) , which we compare to in Appendix B. Divergence minimization inherits an interpretation of action and perception from active inference that we visualize in Figure 2a . While action and perception both minimize the same joint KL, they affect different variables. Perception is based on inputs and affects the beliefs over representations, while actions are based on the representations and affect inputs. Given a unified target, perception thus aligns the agent's beliefs with the world while actions align the world with its beliefs. Niche seeking The information bounds responsible for representation learning and exploration are tighter under expressive targets, as shown in Section 2.2. What happens when we move beyond task rewards and simply define the target as a flexible model? The more flexible the target and belief family, the better the agent can minimize the joint KL. Eventually, the agent will converge to a natural equilibrium or ecological niche where it can predict its inputs well and that it can inhabit despite external perturbations (Wiener, 1948; Ashby, 1961) . Niche seeking connects to surprise minimization (Schrödinger, 1944; Friston, 2013; Berseth et al., 2019) , which aims to maximize the marginal likelihood of inputs under a model. In environments without external perturbations, this can lead to trivial solutions once they are explored. Divergence minimization instead aims to match the marginal input distribution of the model. This encourages large niches that cover all inputs that the agent can learn to predict. Moreover, it suggests that expressive world models lead to autonomous agents that understand and inhabit large niches, rendering task rewards optional.

2.4. PAST AND FUTURE

Representations are computed from past inputs and exploration targets future inputs. To identify the two processes, we thus need to consider how an agent optimizes the joint KL after observing past inputs x < and before observing future inputs x > , as discussed in Figure 2b . For example, past inputs can be stored in an experience dataset and future inputs can be approximated by planning with a learned world model, on-policy trajectories, or replay of past inputs (Sutton, 1991) . To condition the joint KL on past inputs, we first split the information bound in Equation 3 into two smaller bounds on the past mutual information I x < ; z and additional future mutual information Table 2: Divergence minimization accounts for a wide range of agent objectives. Each latent variable used by the agent contributes a future objective term. Moreover, latent variables that are not observed over time, such as latent representations and model parameters, additionally each contribute a past objective term. Combining multiple latent variables combines their objective terms. Refer to Section 3 for detailed derivations of these individual examples and citations of the listed agents. I x > ; z x < , E ln τ (z x) -ln p φ (z) information bound = E ln τ (z x) -ln p φ (z | x < ) + ln p φ (z | x < ) -ln p φ (z) ≥ E ln τ (z x) -ln p φ (z | x < ) future information bound + ln τ (z | x < ) -ln p φ (z) Equation 5 splits the belief update from the prior p φ (z) to the posterior τ (z | x) into two updates via the intermediate belief p φ (z | x < ) and then applies the variational bound from Barber and Agakov (2003) to allow both updates to be approximate. Splitting the information bound lets us separate past and future terms in the joint KL, or even separate individual time steps. It also lets us separately choose to express terms in input or latent space. This decomposition is one of our main contributions and shows how the joint KL divergence accounts for both representation learning and exploration, KL p φ (x, z) τ (x, z) ≤ E KL p φ (z | x < ) τ (z) realizing past latent preferences -E ln τ (x < z) -ln p φ (x < ) representation learning + E KL p φ (x > | x < , z) τ (x > | x < ) realizing future input preferences -E ln τ (z x) -ln p φ (z | x < ) exploration . Conditioning on past inputs x < removes their expectation and renders p φ (x < ) constant. While some latent variables in the set z are never realized, such as latent state estimates or model parameters, other latent variables become observed over time, such as stochastic actions or skills. Because the agent selects the values of these variables, we have to condition the objective terms on them as causal interventions (Pearl, 1995; Ortega and Braun, 2010) . In practice, this means replacing all occurrences of z by the unobserved latents z > and conditioning those terms on the observed latents do(z < ). To keep notation simple, we omit this step in our notation. To build an intuition about Equation 6, we discuss the four terms on the right-hand side. The first two terms involve the past while the last two terms involve the future. The first term keeps the agent's belief p φ (z | x < ) close to the prior τ (z) to incorporate inductive biases. The second term encourages the belief to be informative of the past inputs so that the inputs are reconstructed by τ (x < | z), where is p φ (x < ) is a constant because x < are observed. The third term is the control objective that steers toward future inputs that match the preferred input distribution τ (x > | x < ). The fourth term is an information bound that seeks out future inputs that are informative of the latent representations in z and encourages actions or skills in z that maximally influence future inputs. The decomposition shows that the joint KL accounts for both learning informative representations of past inputs and exploring informative future inputs as two sides of the same coin. From this, we derive several representation and exploration objectives by including different latent variables in the set z. These objectives are summarized in Table 2 and derived with detailed examples in Section 3. Representation learning Because past inputs are observed, the past information bound only affects the latents. Expressed as Equation 3, it leads to reconstruction (Hinton et al., 2006) , and as Equation 4, it leads to contrastive learning (Gutmann and Hyvärinen, 2010; Oord et al., 2018) . This accounts for local representations of individual inputs, as well as global representations, such as latent parameters. Moreover, representations can be inferred online or amortized using an encoder (Kingma and Welling, 2013) . Latents with point estimates are equivalent to target parameters and thus are optimized jointly to tighten the variational bounds. Because past actions and skills are realized, their mutual information with realized past inputs is constant and thus contributes no past objective terms. Exploration Under a flexible target, latents in z result in information-maximizing exploration. For latent representations, this is known as expected information gain and encourages informative future inputs that convey the most bits about the latent variable, such as world model parameters, policy parameters, or state estimates (Lindley et al., 1956; Sun et al., 2011) . For stochastic actions, a fully factorized target leads to maximum entropy RL. An expressive target yields empowerment, maximizing the agent's influence over the world (Klyubin et al., 2005) . For skills, it yields skill discovery or diversity that learns distinct modes of behavior that together cover many different trajectories (Gregor et al., 2016; Florensa et al., 2017; Eysenbach et al., 2018; Achiam et al., 2018) .

3. EXAMPLES

We use the framework of action and perception as divergence minimization presented in Section 2 to derive a wide range of concrete objective functions that have been proposed in the literature, shown in Figure 1 . For this, we analyze the cases of different latent variables and factorizations of the actual and target distributions. These derivations serve as practical examples for producing new objective functions within our framework. We start by describing maximum entropy RL because of its popularity in the literature. Due to space constraints, we refer to Appendix A for the remaining examples, which include variational inference, amortized inference, filtering, KL control, empowerment, skill discovery, and information gain. Designing novel objectives In practice, an agent is determined by its target distribution, belief family, and optimization algorithm. Our framework thus suggests to break down the implementation of an agent into the same three components that are typically considered in probabilistic modeling. As Section 2 showed, the target distribution is also the model under which the agent infers its beliefs about the world. We also saw that more expressive models allow agents to increase the mutual information between their inputs and latents more. To design an agent that learns a lot about the world, we should thus design expressive world model and use them as the target distribution. For example, these could include latent state estimates, latent parameters, latent skills, hierarchical latents, or temporal abstraction. Each world model corresponds to a new agent objective.

3.1. MAXIMUM ENTROPY RL

ax < x > a > φ Actual p ax < x > a > Target τ Figure 3 : Maximum Entropy RL Maximum entropy RL (Williams and Peng, 1991; Kappen et al., 2009; Rawlik et al., 2010; Tishby and Polani, 2011; Fox et al., 2015; Schulman et al., 2017; Haarnoja et al., 2018) chooses stochastic actions to maximize a task reward while remaining close to an action prior. The action prior is typically independent of the inputs, corresponding to a factorized target. The objective thus does not contain a mutual information term. Despite factorized targets being common in practice, we suggest that expressive targets, such as world models, are preferable in the longer term. Figure 3 shows the actual and target distributions for maximum entropy RL. The input sequence is x . = {x t } and the action sequence is a . = {a t }. In the graphical model, these are grouped into past actions and inputs ax < , future actions a > , and future inputs x > . The actual distribution consists of the fixed environment dynamics and a stochastic policy. The target consists of a reward factor, an action prior that is often the same for all time steps, and the environment dynamics, Actual: p φ (x, a) . = t p(x t | x 1:t-1 , a 1:t-1 ) environment p φ (a t | x 1:t , a 1:t-1 ) policy , Target: τ (x, a) • ∝ t exp( r(x t ) reward ) p(x t | x 1:t-1 , a 1:t-1 ) environment τ (a t ). action prior (7) Minimizing the joint KL results in a complexity regularizer in action space and the expected reward. Including the environment dynamics in the target cancels out the curiosity term as in the expected reward case in Appendix A.4, leaving maximum entropy RL to explore only in action space. Moreover, including the environment dynamics in the target gives up direct control over the agent's input preferences, as they depend not just on the reward but also the environment dynamics marginal. Because the target distribution is factorized and does not capture dependencies between x and a, maximum entropy RL does not maximize their mutual information, KL p φ τ = t E KL p φ (a t | x 1:t , a 1:t-1 ) τ (a t ) complexity -E r(x t ) . expected reward (8) The action complexity KL can be simplified into an entropy regularizer by choosing a uniform action prior as in SQL (Haarnoja et al., 2017) and SAC (Haarnoja et al., 2018) . The action prior can also depend on the past inputs and incorporate knowledge from previous tasks as in Distral (Teh et al., 2017) and work by Tirumala et al. (2019) and Galashov et al. (2019) . Divergence minimization motivates combining maximum entropy RL with input density exploration by removing the environment dynamics from the target distribution. The resulting agent aims to converge to the input distribution that is proportional to the exponentiated task reward. Moreover, divergence minimization shows that the difference between maximum entropy RL and empowerment, that we describe in Appendix A.5, is the target factorizes actions and inputs or captures their dependencies.

4. RELATED WORK

Divergence minimization Various problems have been formulated as minimizing a divergence between two distributions. TherML (Alemi and Fischer, 2018) studies representation learning as KL minimization. We follow their interpretation of the data and belief as actual distribution, although their target is only defined by its factorization. ALICE (Li et al., 2017) describes adversarial learning as joint distribution matching, while Kirsch et al. (2020) unify information-based objectives. Ghasemipour et al. (2019) describe imitation learning as minimizing divergences between the inputs of learned and expert behavior. None of these works consider combined representation learning and control. Thompson sampling minimizes the forward KL to explain action and perception as exact inference (Ortega and Braun, 2010) . In comparison, we optimize the backward KL to support intractable models and connect to a wide range of practical objectives. Active inference The presented framework is inspired by the free energy principle, which studies the dynamics of agent and environment as stationary SDEs (Friston, 2010; 2019) . We inherit the interpretations of active inference, which implements agents based on the free energy principle (Friston et al., 2017) . While divergence minimization matches the input distribution under the model, active inference maximizes the probability of inputs under it, resulting in smaller niches. Moreover, active inference optimizes the exploration terms only with respect to actions, which requires a specific action prior. Finally, typical implementations of active inference involve an expensive Bayesian model average over possible action sequences, limiting its applications to date (Friston et al., 2015; 2020) . We compare to active inference in detail in Appendix B. Generalized free energy (Parr and Friston, 2019) studies a unified objective similar to ours, although its entropy terms are defined heuristically rather than derived from a general principle. Control as inference It is well known that RL can be formulated as KL minimization over inputs and actions (Todorov, 2008; Kappen et al., 2009; Rawlik et al., 2010; Ortega and Braun, 2011; Levine, 2018) , as well as skills (Hausman et al., 2018; Tirumala et al., 2019; Galashov et al., 2019) . We build upon this literature and extend it to agents with latent representations, leading to variational inference on past inputs and information seeking exploration for future inputs. Divergence minimization relates the above methods and motivates an additional entropy regularizer for inputs (Todorov, 2008; Lee et al., 2019b; Xin et al., 2020) . SLAC (Lee et al., 2019a) combines representation learning and control but does not consider the future mutual information, so their objective changes over time. In comparison, we derive the terms from a general principle and point out the information gain that results in an objective that is consistent over time. The information gain term may also address concerns about maximum entropy RL raised by O'Donoghue et al. ( 2020).

5. CONCLUSION

We introduce a general objective for action and perception of intelligent agents, based on minimizing the KL divergence. To unify the two processes, we formulate them as joint KL minimization with a shared target distribution. This target distribution is the probabilistic model under which the agent infers its representations and expresses the agent's preferences over system configurations. We summarize the key takeaways as follows: • Unified objective for action and perception Divergence minimization with an expressive target maximizes the mutual information between latents and inputs. This leads to inferring representations that are informative of past inputs and exploration of future inputs that are informative of the representations. To optimize a consistent objective that does not change over time, any latent representation should be accompanied by a corresponding exploration term. • Understanding of latent variables for decision making Different latents lead to different objective terms. Latent representations are never observed, leading to both representation learning and information gain exploration. Actions and skills become observed over time and thus do not encourage representation learning but lead to generalized empowerment and skill discovery. • Adaptive agents through expressive world models Divergence minimization agents with an expressive target find niches where they can accurately predict their inputs and that they can inhabit despite external perturbations. The niches correspond to the inputs that the agent can learn to understand, which is facilitated by the exploration terms. This suggests designing powerful world models as a path toward building autonomous agents, without the need for task rewards. • General recipe for designing novel objectives When introducing new agent objectives, we recommend deriving them from the joint KL by choosing a latent structure and target. For information maximizing agents, the target is an expressive model, leaving different latent structures to be explored. Deriving novel objectives from the joint KL facilitates comparison, renders explicit the target distribution, and highlights the intrinsic objective terms needed to reach that distribution. • Discovering new families of agent objectives Our work shows that a family of representation learning and exploration objectives can be derived from minimizing a joint KL between the system and a target distribution. Different divergence measures give rise to new families of such agent objectives that could be easier to optimize or converge to better optima for infeasible targets. We leave exploring those objective families and comparing them empirically as future work. Without constraining the class of targets, our framework is general and can describe any system. This by itself offers a framework for comparing many existing methods. However, interpreting the target as a model further suggests that intelligent agents may use especially expressive models as targets. This hypothesis should be investigated in future work by examining artificial agents with expressive world models or by modeling the behavior of natural agents as divergence minimization.

A ADDITIONAL EXAMPLES

This section leverages the presented framework to explain a wide range of objectives in a unifying review, as outlined in Figure 1 . For this, we include different variables in the actual distribution, choose different target distributions, and then rewrite the joint KL to recover familiar objectives. We start with perception, the case with latent representations but uncontrollable inputs and then turn to action, the case without latent representations but with controllable inputs. We then turn to combined action and perception. The derivations follow the general recipe described in Section 2. The same steps can be followed for new latent structures and target distributions to yield novel agent objectives. Following Helmholtz, we describe perception as inference under a model (Helmholtz, 1866; Gregory, 1980; Dayan et al., 1995) . Inference computes a posterior over representations by conditioning the model on inputs. Because this has no closed form in general, variational inference optimizes a parameterized belief to approximate the posterior (Peterson, 1987; Hinton and Van Camp, 1993; Jordan et al., 1999) . Figure 4 shows variational inference for the example of supervised learning using a BNN (Denker et al., 1987; MacKay, 1992a; Blundell et al., 2015) . The inputs are images x . = {x i } and their classes y . = {y i } and we infer the latent parameters w as a global representation of the data set (Alemi and Fischer, 2018) . The parameters depend on the inputs only through the optimization process that produces φ. The target consists of a parameter prior and a conditional likelihood that uses the parameters to predict classes from images, Actual: p φ (x, y, w) . = p φ (w) belief i p(x i , y i ) data , Target: τ (x, y, w) • ∝ τ (w) prior i τ (y i | x i , w) likelihood . Applying the framework, we minimize the KL between the actual and target joints. Because the data distribution is fixed here, the input marginal p(x, y) is a constant. In this case, the KL famously results in the free energy or ELBO objective (Hinton and Van Camp, 1993; Jordan et al., 1999) that trades off remaining close to the prior and enabling accurate predictions. The objective can be interpreted as the description length of the data set under entropy coding (Huffman, 1952; MacKay, 2003) because it measures the nats needed for storing both parameter belief and prediction residuals, KL p φ τ = KL p φ (w) τ (w) complexity -E ln τ (y x, w) accuracy + E ln p(x, y) constant . Variational methods for BNNs (Peterson, 1987; Hinton and Van Camp, 1993; Blundell et al., 2015) differ in their choices of prior and belief distributions and inference algorithm. This includes hierarchical priors (Louizos and Welling, 2016; Ghosh and Doshi-Velez, 2017) , data priors (Louizos and Welling, 2016; Hafner et al., 2019b; Sun et al., 2019) , flexible posteriors (Louizos and Welling, 2016; Sun et al., 2017; Louizos and Welling, 2017; Zhang et al., 2018; Chang et al., 2019) , low rank posteriors (Izmailov et al., 2018; Dusenberry et al., 2020) , and improved inference algorithms (Wen et al., 2018; Immer et al., 2020) . BNNs have been leveraged for RL for robustness (Okada et al., 2020; Tran et al., 2019) and exploration (Houthooft et al., 2016; Azizzadenesheli et al., 2018) . Target parameters While expressive beliefs over model parameters lead to a global search for their values, provide uncertainty estimates for predictions, and enable directed exploration in the RL setting, they can be computationally expensive. When these properties are not needed, we can choose a point mass distribution p φ (w) → δ φ (w) . = {1 if w = φ else 0} to simplify the expectations and avoid the entropy and mutual information terms that are zero for this variable (Dirac, 1958) , KL p φ (w) τ (w) complexity -E ln τ (y x, w) accuracy → ln τ (φ) complexity -E ln τ (y x, φ) accuracy . = E -ln τ φ (y x) parameterized target . (11) Point mass beliefs result in MAP or maximum likelihood estimates (Bishop, 2006; Murphy, 2012) that are equivalent to parameterizing the target as τ φ . Parameterizing the target is thus a notational choice for random variables with point mass beliefs. Technically, we also require the prior over target parameters to be integrable but this is true in practice where only finite parameter spaces exist. They can summarize inputs more compactly, enable interpolation between inputs, and facilitate generalization to unseen inputs. In this case, we can use amortized inference (Kingma and Welling, 2013; Rezende et al., 2014; Ha et al., 2016) to learn an encoder that maps each input to its corresponding belief. The encoder is shared among inputs to reuse computation. It can also compute beliefs for new inputs without further optimization, although optimization can refine the belief (Kim et al., 2018) .

A.2 AMORTIZED INFERENCE

Figure 5 shows amortized inference on the example of a VAE (Kingma and Welling, 2013; Rezende et al., 2014) . The inputs are images x . = {x i } and we infer their latent codes z = {z i }. The actual distribution consists of the unknown and fixed data distribution and the parameterized encoder p φ (z i | x i ). The target is a probabilistic model defined as the prior over codes and the decoder that computes the conditional likelihood of each image given its code. We parameterize the target here, but one could also introduce an additional latent variable to infer a distribution over decoder parameters as in Appendix A.1, Actual: p φ (x, z) . = i p(x i ) data p φ (z i | x i ) encoder , Target: τ φ (x, z) . = i τ φ (x i | z i ) decoder τ (z i ) prior . Because the data distribution is still fixed, minimizing the joint KL again results in the variational free energy or ELBO objective that trades of prediction accuracy and belief simplicity. However, by including the constant input marginal, we highlight that the prediction term is a variational bound on the mutual information that encourages the representations to be informative of their inputs, KL p φ τ φ = E KL p φ (z | x) τ (z) complexity -E ln τ φ (x z) -ln p(x) information bound . ( ) In input space, the information bound leads to reconstruction as in DBNs (Hinton et al., 2006) , VAEs (Kingma and Welling, 2013; Rezende et al., 2014) , and latent dynamics (Krishnan et al., 2015; Karl et al., 2016) . In latent space, it leads to contrastive learning as in NCE (Gutmann and Hyvärinen, 2010) , CPC (Oord et al., 2018; Guo et al., 2018) , CEB (Fischer, 2020) , and SimCLR (Chen et al., 2020) . To maximize their mutual information, x and z should be strongly correlated under the target distribution, which explains the empirical benefits of ramping up the decoder variance throughout learning (Bowman et al., 2015; Eslami et al., 2018) or scaling the temperature of the contrastive loss (Chen et al., 2020) . The target defines the variational family and includes inductive biases (Tschannen et al., 2019) . Both forms have enabled learning world models for planning (Ebert et al., 2018; Ha and Schmidhuber, 2018; Zhang et al., 2019; Hafner et al., 2018; 2019a) and accelerated RL (Lange and Riedmiller, 2010; Jaderberg et al., 2016; Lee et al., 2019a; Yarats et al., 2019; Gregor et al., 2019) . A.3 FUTURE INPUTS x < x > z φ Actual p x < x > z φ φ Target τ Figure 6 : Future Inputs Before moving to actions, we discuss perception with unobserved future inputs that are outside of our control (Ghahramani and Jordan, 1995) . This is typical in supervised learning where the test set is unavailable during training (Bishop, 2006) , in online learning where training inputs become available over time (Amari, 1967) , and in filtering where only inputs up to the current time are available (Kalman, 1960) . Figure 6 shows missing inputs on the example of filtering with an HMM (Stratonovich, 1960; Kalman, 1960; Karl et al., 2016) , although the same graphical model applies to supervied learning with a BNN or representation learning with a VAE given train and test data sets. The inputs x . = {x < , x > } consist of past images x < and future images x > that follow an unknown and fixed data distribution. We represent the input sequence using a chain z of corresponding compact latent states. However, the representations are computed only based on x < because x > is not yet available, as expressed in the factorization of the actual distribution, Actual: p φ (x, z) . = p(x > , x < ) data p φ (z | x < ) belief , Target: τ φ (x, z) . = τ φ (x < | z) likelihood τ φ (x > | z) prediction τ (z) prior . ( ) Bayesian assumption Bayesian reasoning operates within the model class τ and makes the assumption that the model class is correct. Under this assumption, the future inputs x > ∼ p(x > | x < , z) = p(x > | x < ) follow the target distribution τ φ (x > | x < , z) = τ φ (x > | z). This renders the divergence of future inputs given the other variables zero, so that x > does not need to be considered for optimization, recovering standard variational inference from Appendix A.1, KL p φ τ φ = KL p φ (x < , z) τ φ (x < , z) variational inference + E KL p(x > | x < ) τ φ (x > | z) uncontrolled future . ( ) Assuming that future inputs follow the model distribution is appropriate when the model accurately reflects our knowledge about future inputs. However, the assumption does not always hold, for example for data augmentation or distillation (Hinton et al., 2015) that generate data from another distribution to improve the model. Importantly, assuming that future inputs already follow the target is not appropriate when they can be influenced, because there would be no need to intervene. A.4 CONTROL x < x > φ Actual p x < x > Target τ Figure 7: Control We describe behavior as an optimal control problem where the agent chooses actions to move its distribution of sensory inputs toward a preference distribution over inputs that can be specified via rewards (Morgenstern and Von Neumann, 1953; Lee et al., 2019b) . We first cover deterministic actions that lead to KL control (Kappen et al., 2009; Todorov, 2008) and input density exploration (Schmidhuber, 1991; Bellemare et al., 2016; Pathak et al., 2017) . Figure 7 shows deterministic control with the input sequence x . = {x t } that the agent can partially influence by varying the parameters φ of the deterministic policy, control rule, or plan. In the graphical model, we group the input sequence into past inputs x < and future inputs x > . There are no internal latent variables. The target describes the preferences over input sequences that can be unnormalized, Actual: p φ (x) . = t p φ (x t | x 1:t-1 ) controlled dynamics , Target: τ (x) . = t τ (x t | x 1:t-1 ) preferences . ( ) Minimizing the KL between the actual and target joints maximizes log preferences and the input entropy. Maximizing the input entropy is a simple form of exploration known as input density exploration that encourages rare inputs and aims for a uniform distribution over inputs (Schmidhuber, 1991; Oudeyer et al., 2007) . This differs from the action entropy of maximum entropy RL in Section 3.1 and information gain in Appendix A.7 that takes inherent stochasticity into account, KL p φ τ = -t E ln τ (x t x 1:t-1 ) expected preferences + H p φ (x t x 1:t-1 ) curiosity . ( ) Task reward Motivated by risk-sensitivity (Pratt, 1964; Howard and Matheson, 1972) , KL control (Kappen et al., 2009) defines the preferences as exponential task rewards τ (x t | x 1:t-1 ) • ∝ exp(r(x t )). KL-regularized control (Todorov, 2008) defines the preferences with an additional passive dynamics term τ (x t | x 1:t-1 ) • ∝ exp(r(x t ))τ (x t | x 1:t-1 ). Expected reward (Sutton and Barto, 2018 ) corresponds to the preferences τ φ (x t | x 1:t-1 ) • ∝ exp(r(x t ))p φ (x t | x 1:t-1 ) that include the controlled dynamics. This cancels out the curiosity term in the joint KL, leading to a simpler objective that does not encourage rare inputs, which might limit exploration of the environment. Input density exploration Under divergence minimization, maximizing the input entropy is not an exploration heuristic but an inherent part of the control objective. In practice, the input entropy is often estimated by learning a density model of individual inputs as in pseudo-counts (Bellemare et al., 2016) , latent variable models as in SkewFit (Pong et al., 2019) , unnormalized models as in RND (Burda et al., 2018) , and non-parameteric models as in reachability (Savinov et al., 2018) . More accurately, it can be estimated by a sequence model of inputs as in ICM (Pathak et al., 2017) . The expectation over inputs is estimated by sampling episodes from either the actual environment, a replay buffer, or a learned model of the environment (Sutton, 1991) . A.5 EMPOWERMENT ax < x > a > φ Actual p ax < x > a > φ φ Target τ Figure 8: Empowerment Remaining in the stochastic control setting of Section 3.1, we consider different target distribution that predicts actions from inputs. This corresponds to an exploration objective that we term generalized empowerment, which maximizes the mutual information between the sequence of future inputs and future actions. It encourages the agent to influence its environment in as many ways as possible while avoiding actions that have no predictable effect. Figure 8 shows stochastic control with an expressive target that captures correlations between inputs and actions. The input sequence is x . = {x t } and the action sequence is a . = {a t }. In the graphical model, these are grouped into past actions and inputs ax < , future actions a > , and future inputs x > . The actual distribution consists of the environment and the stochastic policy. The target predicts actions from the inputs before and after them using a reverse predictor. We use uniform input preferences here, but the target can also include an additional reward factor as in Section 3.1, Actual: p φ (x, a) . = t p(x t | x 1:t-1 , a 1:t-1 ) environment p φ (a t | x 1:t , a 1:t-1 ) policy , Target: τ φ (x, a) • ∝ t τ φ (a t | x 1:T , a 1:t-1 ) reverse predictor . Minimizing the joint KL reveals an information boudn between future actions and inputs and a control term that maximizes input entropy and, if specified, task rewards. Empowerment (Klyubin et al., 2005) was originally introduced as potential empowerment to "keep your options open" and was later studied as realized empowerment to "use your options" (Salge et al., 2014) . Realized empowerment maximizes the mutual information I x t+k ; a t:t+k x 1:t , a 1:t-1 . Divergence minimization generalizes this to the mutual information I x t:T ; a t:T x 1:t , a 1:t-1 between the sequences of future actions and future inputs. The k-step variant is recovered by a target that conditions the reverse predictor on fewer inputs. Realized empowerment measures agent's influence on its environment and can be interpreted as maximizing information throughput with the action marginal p φ (a t | a t-1 ) as source, the environment as noisy channel, and the reverse predictor as decoder, KL p φ τ φ = E KL p(x | a) τ (x) control -E ln τ φ (a x) -ln p φ (a) generalized empowerment , E ln τ φ (a x) -ln p φ (a) generalized empowerment ≥ t E ln τ φ (a t x, a 1:t-1 ) decoder -ln p φ (a t | a 1:t-1 ) source . Empowerment has been studied for continuous state spaces (Salge et al., 2013) , for image inputs (Mohamed and Rezende, 2015) , optimized using a variational bound (Karl et al., 2017) , combined with input density exploration (de Abril and Kanai, 2018) and task rewards (Leibfried et al., 2019) , and used for task-agnostic exploration of locomotion behaviors (Zhao et al., 2020) . Divergence minimization suggests generalizing empowerment from the input k steps ahead to the sequence of all future inputs. This can be seen as combining empowerment terms of different horizons. Moreover, we offer a principled motivation for combining empowerment with input density exploration. In comparison to maximum entropy RL in Section 3.1, empowerment captures correlations between x and a in its target distribution and thus leads to information maximization. Moreover, it encourages the agent to converge to the input distribution that is proportional to the exponentiated reward. A.6 SKILL DISCOVERY zax < z > x > a > φ φ Actual p zax < z > x > a > φ φ Target τ Figure 9: Skill Discovery Many complex tasks can be broken down into sequences of simpler steps. To leverage this idea, we can condition a policy on temporally abstract options or skills (Sutton et al., 1999) . Skill discovery aims to learn useful skills, either for a specific task or without rewards to solve downstream tasks later on. Where empowerment maximizes the mutual information between inputs and actions, skill discovery can be formulated as maximizing the mutual information between inputs and skills (Gregor et al., 2016) . Figure 9 shows skill discovery with the input sequence x . = {x t }, action sequence a . = {a t }, and the sequence of temporally abstract skills z . = {z k }. The graphical model groups the sequences into past and future variables. The actual distribution consists of the fixed environment, an abstract policy that selects skills by sampling from a fixed distribution as shown here or as a function of past inputs, and the low-level policy that selects actions based on past inputs and the current skill. The target consists of an action prior and a reverse predictor for the skills and could further include a reward factor, Actual: p φ (x, a, z) . = T /K k=1 p φ (z k ) abstract policy T t=1 p φ (a t | x 1:t , a 1:t-1 , z t/K ) policy p(x t | x 1:t-1 , a 1:t-1 ) environment , Target: τ φ (x, a, z) • ∝ T /K k=1 τ φ (z k | x) reverse predictor T t=1 τ (a t ) action prior . ( ) Minimizing the joint KL results in a control term as in Appendix A.5, a complexity regularizer for actions as in Section 3.1, and a variational bound on the mutual information between the sequences of inputs and skills. The information bound is a generalization of skill discovery (Gregor et al., 2016; Florensa et al., 2017) . Conditioning the reverse predictor only on inputs that align with the duration of the skill recovers skill discovery. Maximizing the mutual information between skills and inputs encourages the agent to learn skills that together realize as many different input sequences as possible while avoiding overlap between the sequences realized by different skills, (21) VIC (Gregor et al., 2016) introduced information-based skill discovery as an extension of empowerment, motivating a line of work including SNN (Florensa et al., 2017) , DIAYN (Eysenbach et al., 2018) , work by Hausman et al. (2018) , VALOR (Achiam et al., 2018) , and work by Tirumala et al. (2019) and (Shankar and Gupta, 2020) . DADS (Sharma et al., 2019) estimates the mutual information in input space by combining a forward predictor of skills with a contrastive bound. Divergence minimization suggests a generalization of skill discovery where actions should not just consider the current skill but also seek out regions of the environment where many skills are applicable. Agents need to explore initially unknown environments to achieve goals. Learning about the world is beneficial even when it does not serve maximizing the currently known reward signal, because the knowledge might become useful later on during this or later tasks. Reducing uncertainty requires representing uncertainty about aspects we want to explore, such as dynamics parameters, policy parameters, or state representations. To efficiently reduce uncertainty, the agent should select actions that maximize the expected information gain (Lindley et al., 1956) . Figure 10 shows information gain exploration on the example of latent model parameters and deterministic actions. The inputs are a sequence x . = {x t } and the latent parameters are a global representation w. The graphical model separates inputs into past inputs x < and future inputs x > . The actual distribution consists of the controlled dynamics and the parameter belief. Amortized latent state representations would include a link from x < to z. Latent policy parameters would include a link from w to x > . The target distribution is a latent variable model that explains past inputs and predicts future inputs, as in Appendix A.3. The target could further include a reward factor, Actual: p φ (x, w) . = p φ (w) belief t p φ (x t | x 1:t-1 ) controlled dynamics , Target: τ (x, w) . = τ (w) prior t τ (x t | x 1:t-1 , w) likelihood . Minimizing the KL between the two joints reveals a control term as in previous sections and the information bound between inputs and the latent representation, as derived in Section 2.2. In contrast to Appendix A.3, we can now influence future inputs. This leads to learning representations that are informative of past inputs and exploring future inputs that are informative of the representations. The mutual information between the representation and future inputs is the expected information gain (Lindley et al., 1956; MacKay, 1992b) that encourages inputs that are expected to convey the most bits about the representation to maximally reduce uncertainty in the belief, KL p φ τ φ ≤ E KL p φ (w | x < ) τ (w) simplicity -E ln τ φ (x < w) -ln p φ (x < ) representation learning + E KL p φ (x > | x < , w) τ φ (x > | x < ) control -E ln τ φ (w x) -ln p φ (w | x < ) information gain , E ln τ φ (w x) -ln p φ (w | x < ) information gain ≥ t >t E ln τ φ (w x 1:t ) -ln p φ (w | x 1:t -1 ) intrinsic reward . Information gain can be estimated by planning (Sun et al., 2011) or from past environment interaction (Schmidhuber, 1991) . State representations lead to agents that disambiguate unobserved environment states, for example by opening doors to see objects behind them, such as in active inference (Da Costa et al., 2020) , INDIGO (Azar et al., 2019) , and DVBF-LM (Mirchev et al., 2018) . Model parameters lead to agents that discover the rules of their environment, such as in active inference (Friston et al., 2015) , VIME (Houthooft et al., 2016) , MAX (Shyam et al., 2018) , and Plan2Explore (Sekar et al., 2020) . SLAM resolves uncertainty over both states and dynamics (Moutarlier and Chatila, 1989) . Policy parameters lead to agents that explore to find the best behavior, such as bootstrapped DQN (Osband et al., 2016) and Bayesian DQN (Azizzadenesheli et al., 2018) . One might think exploration should seek inputs with large error, but reconstruction and exploration optimize the same objective. Maximizing information gain minimizes the reconstruction error at future time steps by steering toward diverse but predictable inputs. Divergence minimization shows that every latent representation should be accompanied with an expected information gain term, so that the agent optimizes a consistent objective for past and future time steps. Moreover, it shows that representations should be optimized jointly with the policy to support both reconstruction and action choice (Lange and Riedmiller, 2010; Jaderberg et al., 2016; Lee et al., 2019a; Yarats et al., 2019) .

B ACTIVE INFERENCE

Divergence minimization is motivated by the free energy principle (Friston, 2010; 2019) and its implementation active inference (Friston et al., 2017) . Both approaches share the interpretation of models as preferences (Wald, 1947; Brown, 1981; Friston et al., 2012) and account for a variety of intrinsic objectives (Friston et al., 2020) . However, typical implementations of active inference have been limited to simple tasks as of today, a problem that divergence minimization overcomes. Active inference differs from divergence minimization in the three aspects discussed below. Maximizing the input probability Divergence minimization aims to match the distribution of the system to the target distribution. Therefore, the agent aims to receive inputs that follow the marginal distribution of inputs under the model. In contrast, active inference aims to maximize the probability of inputs under the model. This is often described as minimizing Bayesian surprise. Therefore, the agent aims to receive inputs that are the most probable under its model. Mathematically, this difference stems from the conditional input entropy of the actual system that distinguishes the joint KL divergence in Equation 2 from the expected free energy used in active inference, KL p φ (x, z) τ (x, z) joint divergence = E -ln τ (x z) + E KL p φ (z | x) τ (z) expected free energy -E -ln p φ (x) input entropy . Both formulations include the entropy of latent variables and thus the information gain that encourages the agent to explore informative future inputs. Moreover, in complex environments, it is unlikely that the agent ever learns everything so that its beliefs concentrate and it stops exploring. However, in this hypothetical scenario, active inference converges to the input that is most probable under its model. In contrast, divergence minimization aims to converge to sampling from the marginal input distribution under the model, resulting in a larger niche. That said, it is possible to construct a target distribution that includes the input entropy of the actual system and thus overcome this difference. Expected free energy action prior Divergence minimization optimizes the same objective with respect to representations and actions. Therefore, actions optimize the expected information gain and representations optimize not just past accuracy but also change to support actions in maximizing the expected information gain. In contrast, active inference first optimizes the expected free energy to compute a prior over policies. After that, it optimizes the free energy with respect to both representations and actions. This means active inference optimizes the information gain only with respect to actions, without the representations changing to support better action choice based on future objective terms. Bayesian model average over policies Typical implementations of active inference compute the action prior using a Bayesian model average. This involves computing the expected free energy for every possible policy or action sequence that is available to the agent. The action prior is then computed as the softmax over the computed values. Enumerating all policies is intractable for larger action spaces or longer planning horizons, thus limiting the applicability of active inference implementations. In contrast, divergence minimization absorbs the objective terms for action and perception into a single variational optimization thereby finessing the computational complexity of computing a separate action prior. This leads to a simple framework, allowing us to draw close connections to the deep RL literature and to scale to challenging tasks, as evidenced by the many established methods that are explained under the divergence minimization framework.

C KL INTERPRETATION

Minimizing the KL divergence has a variety of interpretations. In simple terms, it says "optimize a function but don't be too confident." Decomposing Equation 2shows that we maximize the expected log target while encouraging high entropy of all the random variables. Both terms are expectations under p φ and thus depend on the parameter vector φ, KL p φ (x, z) τ (x, z) = E -ln τ (x, z) energy -H x, z entropy . The energy term expresses which system configurations we prefer. It is also known as the cross entropy loss, expected log loss, (Bishop, 2006; Murphy, 2012) , energy function when unnormalized (LeCun et al., 2006) , and agent preferences in control (Morgenstern and Von Neumann, 1953) . The entropy term prevents all random variables in the system from becoming deterministic, encouraging a global search over their possible values. It implements the maximum entropy principle to avoid overconfidence (Jaynes, 1957) , Occam's razor to prevent overfitting (Jefferys and Berger, 1992) , bounded rationality to halt optimization before reaching the point solution (Ortega and Braun, 2011) , and risk-sensitivity to account for model misspecification (Pratt, 1964; Howard and Matheson, 1972) . Expected utility The entropy distinguishes the KL from the expected utility objective that is typical in RL (Sutton and Barto, 2018) . Using a distribution as the optimization target is more general, as every system has a distribution but not every system has a utility function it is optimal for. Moreover, the dynamics of any stochastic system maximize only its log stationary distribution (Ao et al., 2013; Friston, 2013; Ma et al., 2015) . This motivates using the desired distribution as the optimization target. Expected utility is recovered in the limit of a sharp target that outweighs the entropy.

D BACKGROUND

This section introduces notation, defines basic information-theoretic quantities, and briefly reviews KL control and variational inference for latent variable models. Expectation A random variable x represents an unknown variable that could take on one of multiple values x, each with an associated probability mass or density p(x = x). Applying a function to a random variable yields a new random variable y = f (x). The expectation of a random variable is the weighted average of the values it could take on, weighted by their probability, E f (x) . = f (x)p(x) dx. We use integrals here, as used for random variables that take on continuous values. For discrete variables, the integrals simplify to sums. Information The information of an event x measures the number of bits it contains (Shannon, 1948) . Intuitively, rare events contain more information. The information is defined as the code length of the event under an optimal encoding for x ∼ p(x), I(x) . = ln 1 p(x) = -ln p(x). ( ) The logarithm base 2 measures information in bits and the natural base in the unit nats. Entropy The entropy of a random variable x is the expected information of its events. It quantifies the randomness or uncertainty of the random variable. Similarly, the conditional entropy measures the uncertainty of x that we expect to remain after observing another variable y, H x . = E -ln p(x) , H x y . = E -ln p(x y) . (28) Note that the conditional entropy uses an expectation over both variables. A deterministic distribution reaches the minimum entropy of zero. The uniform distribution reaches the maximum entropy, the logarithm of the number of possible events. KL divergence The Kullback-Leibler divergence (Kullback and Leibler, 1951) , measures the directed similarity of one distribution to another distribution. The KL divergence is defined as the expectation under p of the log difference between the two distributions p and τ , KL p(x) τ (x) . = E ln p(x) -ln τ (x) = E -ln τ (x) -H x . (29) The KL divergence is non-negative and reaches zero if and only if p = τ . Also known as relative entropy, it is the expected number of additional bits needed to describe x when using the code for a different distribution τ to encode events from x ∼ p(x). This is shown by the decomposition as cross-entropy minus entropy shown above. Analogously to the conditional entropy, the conditional KL divergence is an expectation over both variables under the first distribution. Mutual information The mutual information, or simply information, between two random variables x and y measures how many bits the value of x carries about the unobserved value of y. It is defined as the entropy of one variable minus its conditional entropy given the other variable, I X; Y . = H X -H X Y = E ln p(x y) -ln p(x) = KL p(x, y) p(x)p(y) . (30) The mutual information is symmetric in its arguments and non-negative. It reaches zero if and only if x and y are independent so that p(x, y) = p(x)p(y). Intuitively, it is higher the better we can predict one variable from the other and the more random the variable is by itself. It can also be written as KL divergence between the joint and product of marginals. Variational bound Computing the exact mutual information requires access to both the conditional and marginal distributions. When the conditional is unknown, replacing it with another distribution bounds the mutual information from below (Barber and Agakov, 2003; Poole et al., 2019) , I x; z ≥ I x; z -E KL p(x | z) τ φ (x | z) = E ln τ φ (x z) -ln p(x) . (31) Maximizing the bound with respect to the parameters φ tightens the bound, thus bringing τ φ (x | z) closer to p(x | z). Improving the bound through optimization gives it the name variational bound. The more flexible the family of τ φ (x | z), the more accurate the bound can become. Dirac distribution The Dirac distribution (Dirac, 1958) , also known as point mass, represents a random variable x with certain event x. We show an intuitive definition here; for a rigorous definition using measure theory see Rudin (1966) , δ x(x) . = 1 if x = x 0 else. ( ) The expectation under a Dirac distribution is simply the inner expression evaluated at the certain event, E δx(x) f (x) = f (x). The entropy of a Dirac distributed random variable is therefore H x = -ln δ x(x) = 0 and its mutual information with another random variables is also zero. KL control KL control (Todorov, 2008; Kappen et al., 2009) minimizes the KL divergence between the trajectory x ∼ p φ (x) of inputs x . = {x 1 , x 2 , . . . , x T } and a target distribution τ (x) • ∝ exp(r(x)) defined with a reward r(x), KL p φ (x)  The KL between the two distributions is minimized with respect to the control rule or action sequence φ, revealing the expected reward and an entropy regularizer. Because the expectations are terms of the trajectory x, they are integrals under its distribution p φ . Variational inference Latent variable models explain inputs x using latent variables z. They define a prior τ (z) and an observation model τ (x | z). To infer the posterior τ (z | x) that represents a given input x, we need to condition the model on the input. However, this requires inverting the observation model using Bayes rule and has no closed form in general. To overcome this intractability, variational inference (Hinton and Van Camp, 1993; Jordan et al., 1999)  Adding the marginal τ (x) that does not depend on φ completes the intractable posterior to the joint that can be factorized into the available parts τ (z) and τ (x | z). This reveals a complexity regularizer that keeps the belief close to the prior and an accuracy term that encourages the belief to be representative of the input. This objective is known as the variational free energy or ELBO.



Figure2: Action and perception minimize the joint KL divergence to a unified target distribution that can be interpreted as a learning probabilistic model of the system. Given the target, perception aligns the agent's beliefs with past inputs while actions align future inputs with its beliefs. There are many ways to specify the target, for example as a latent variable model that explains past inputs and predicts future inputs and an optional reward factor that is shown as a filled square.

Figure 4: Variational Inference

Figure 5: Amortized Inference Local representations represent individual inputs.They can summarize inputs more compactly, enable interpolation between inputs, and facilitate generalization to unseen inputs. In this case, we can use amortized inference(Kingma and Welling, 2013;Rezende et al., 2014;Ha et al., 2016) to learn an encoder that maps each input to its corresponding belief. The encoder is shared among inputs to reuse computation. It can also compute beliefs for new inputs without further optimization, although optimization can refine the belief(Kim et al., 2018).Figure5shows amortized inference on the example of a VAE(Kingma and Welling, 2013;Rezende et al., 2014). The inputs are images x .= {x i } and we infer their latent codes z = {z i }. The actual distribution consists of the unknown and fixed data distribution and the parameterized encoder p φ (z i | x i ). The target is a probabilistic model defined as the prior over codes and the decoder that computes the conditional likelihood of each image given its code. We parameterize the target here, but one could also introduce an additional latent variable to infer a distribution over decoder parameters as in Appendix A.1,

KL p φ τ φ = E KL p(x | a) τ (x) control + E KL p φ (a | x, z) τ (a) complexity -E ln τ φ (z x) -ln p φ (z) skill discovery.

Figure 10: Information Gain

trajectory τ (x) target = -E ln τ (x) expected reward -H x entropy .

optimizes a parameterized belief p φ (z | x) to approximate the posterior by minimizing the KL,KL p φ (z | x) τ (z | x) + ln τ (x) constant = KL p φ (z | x) τ (z) complexity -E ln τ (x z) accuracy .

acknowledgement

Acknowledgements Hidden for review.

