LATENT STATE MARGINALIZATION AS A LOW-COST APPROACH FOR IMPROVING EXPLORATION

Abstract

While the maximum entropy (MaxEnt) reinforcement learning (RL) frameworkoften touted for its exploration and robustness capabilities-is usually motivated from a probabilistic perspective, the use of deep probabilistic models has not gained much traction in practice due to their inherent complexity. In this work, we propose the adoption of latent variable policies within the MaxEnt framework, which we show can provably approximate any policy distribution, and additionally, naturally emerges under the use of world models with a latent belief state. We discuss why latent variable policies are difficult to train, how naïve approaches can fail, then subsequently introduce a series of improvements centered around low-cost marginalization of the latent state, allowing us to make full use of the latent state at minimal additional cost. We instantiate our method under the actorcritic framework, marginalizing both the actor and critic. The resulting algorithm, referred to as Stochastic Marginal Actor-Critic (SMAC), is simple yet effective. We experimentally validate our method on continuous control tasks, showing that effective marginalization can lead to better exploration and more robust training. Our implementation is open sourced at https://github.com/zdhNarsil/ 

1. INTRODUCTION

Figure 1 : The world model ( ) infers latent states ( ) from observation inputs ( ). While most existing methods only take one sample or the mean from this latent belief distribution, the agent ( ) of the proposed SMAC algorithm marginalizes out the latent state for improving exploration. Icons are adapted from Mendonca et al. (2021) . A fundamental goal of machine learning is to develop methods capable of sequential decision making, where reinforcement learning (RL) has achieved great success in recent decades. One of the core problems in RL is exploration, the process by which an agent learns to interact with its environment. To this end, a useful paradigm is the principle of maximum entropy, which defines the optimal solution to be one with the highest amount of randomness that solves the task at hand. While the maximum entropy (Max-Ent) RL framework (Todorov, 2006; Rawlik et al., 2012) is often motivated for learning complex multi-modal 1 behaviors through a stochastic agent, algorithms that are most often used in practice rely on simple agents that only make local perturbations around a single action. Part of this is due to the need to compute the entropy of the agent and use it as part of the training objective. Meanwhile, the use of more expressive models have not gained nearly as much traction in the community. While there exist works that have increased the flexibility of their agents by making use of more complex distributions such as energy-based models (Haarnoja et al., 2017) , normalizing flows (Haarnoja et al., 2018a; Ward et al., 2019 ), mixture-of-experts (Ren et al., 2021) , and autoregressive models (Zhang et al., 2021b) , these constructions often result in complicate training procedures and are inefficient in practice. Instead, we note that a relatively simple approach to increasing expressiveness is to make use of latent variables, providing the agent with its own inference procedure for modeling stochasticity in the observations, environment, and unseen rewards. Introducing latent variables into the policy makes it possible to capture a diverse set of scenarios that are compatible with the history of observations. In particular, a majority of approaches for handling partial observability make use of world models (Hafner et al., 2019; 2020) , which already result in a latent variable policy, but existing training algorithms do not make use of the latent belief state to its fullest extent. This is due in part to the fact that latent variable policies do not admit a simple expression for its entropy, and we show that naïvely estimating the entropy can lead to catastrophic failures during policy optimization. Furthermore, high-variance stochastic updates for maximizing entropy do not immediately distinguish between local random perturbations and multi-modal exploration. We propose remedies to these aforementioned downsides of latent variable policies, making use of recent advances in stochastic estimation and variance reduction. When instantiated in the actor-critic framework, the result is a simple yet effective policy optimization algorithm that can perform better exploration and lead to more robust training in both fully-observed and partially-observed settings. Our contributions can be summarized as follows: • We motivate the use of latent variable policies for improving exploration and robustness to partial observations, encompassing policies trained on world models as a special instance. • We discuss the difficulties in applying latent variable policies within the MaxEnt RL paradigm. We then propose several stochastic estimation methods centered around costefficiency and variance reduction. • When applied to the actor-critic framework, this yields an algorithm (SMAC; Figure 1 ) that is simple, effective, and adds minimal costs. • We show through experiments that SMAC is more sample efficient and can more robustly find optimal solutions than competing actor-critic methods in both fully-observed and partially-observed continuous control tasks.

2.1. MAXIMUM ENTROPY REINFORCEMENT LEARNING

We first consider a standard Markov decision process (MDP) setting. We denote states x t ∈ S and actions a t ∈ A, for timesteps t ∈ N. There exists an initial state distribution p(x 1 ), a stochastic transition distribution p(x t |x t-1 , a t-1 ), and a deterministic reward function r t : S ×A → R. We can then learn a policy π(a t |x t ) such that the expected sum of rewards is maximized under trajectories τ ≜ (x 1 , a 1 , . . . , x T , a T ) sampled from the policy and the transition distributions. While it is known that the fully-observed MDP setting has at least one deterministic policy as a solution (Sutton & Barto, 2018; Puterman, 1990) , efficiently searching for an optimal policy generally requires exploring sufficiently large part of the state space and keeping track of a frontier of current best solutions. As such, many works focus on the use of stochastic policies, often in conjunction with the maximum entropy (MaxEnt) framework, max π E p(τ ) ∞ t=0 γ t (r t (x t , a t ) + αH(π(•|x t ))) , where H(π(•|x t )) = E at∼π(•|xt) [-log π(a t |x t )] , (1) where p(τ ) is the trajectory distribution with policy π, H(•) is entropy and γ is a discount factor. The MaxEnt RL objective has appeared many times in the literature (e.g. Todorov ( 2006 2017)), and is recognized for its exploration (Hazan et al., 2019) and robustness (Eysenbach & Levine, 2022) capabilities. It can be equivalently interpreted as variational inference from a probabilistic modeling perspective (Norouzi et al., 2016; Levine, 2018; Lee et al., 2020a) . Intuitively, MaxEnt RL encourages the policy to obtain sufficiently high reward while acting as randomly as possible, capturing the largest possible set of optimal actions. Furthermore, it also



* Work done during an internship at Meta AI. Correspondence to: <dinghuai.zhang@mila.quebec>. 1 By multi-modality, we're referring to distributions with different and diverse modes.



); Rawlik et al. (2012); Nachum et al. (

