LATENT STATE MARGINALIZATION AS A LOW-COST APPROACH FOR IMPROVING EXPLORATION

Abstract

While the maximum entropy (MaxEnt) reinforcement learning (RL) frameworkoften touted for its exploration and robustness capabilities-is usually motivated from a probabilistic perspective, the use of deep probabilistic models has not gained much traction in practice due to their inherent complexity. In this work, we propose the adoption of latent variable policies within the MaxEnt framework, which we show can provably approximate any policy distribution, and additionally, naturally emerges under the use of world models with a latent belief state. We discuss why latent variable policies are difficult to train, how naïve approaches can fail, then subsequently introduce a series of improvements centered around low-cost marginalization of the latent state, allowing us to make full use of the latent state at minimal additional cost. We instantiate our method under the actorcritic framework, marginalizing both the actor and critic. The resulting algorithm, referred to as Stochastic Marginal Actor-Critic (SMAC), is simple yet effective. We experimentally validate our method on continuous control tasks, showing that effective marginalization can lead to better exploration and more robust training. Our implementation is open sourced at https://github.com/zdhNarsil/ Stochastic-Marginal-Actor-Critic.

1. INTRODUCTION

Figure 1 : The world model ( ) infers latent states ( ) from observation inputs ( ). While most existing methods only take one sample or the mean from this latent belief distribution, the agent ( ) of the proposed SMAC algorithm marginalizes out the latent state for improving exploration. Icons are adapted from Mendonca et al. (2021) . A fundamental goal of machine learning is to develop methods capable of sequential decision making, where reinforcement learning (RL) has achieved great success in recent decades. One of the core problems in RL is exploration, the process by which an agent learns to interact with its environment. To this end, a useful paradigm is the principle of maximum entropy, which defines the optimal solution to be one with the highest amount of randomness that solves the task at hand. While the maximum entropy (Max-Ent) RL framework (Todorov, 2006; Rawlik et al., 2012) is often motivated for learning complex multi-modal 1 behaviors through a stochastic agent, algorithms that are most often used in practice rely on simple agents that only make local perturbations around a single action. Part of this is due to the need to compute the entropy of the agent and use it as part of the training objective. Meanwhile, the use of more expressive models have not gained nearly as much traction in the community. While there exist works that have increased the flexibility of their agents by making use of more complex distributions such as energy-based models (Haarnoja et al., 2017) , normalizing flows (Haarnoja et al., 2018a; Ward et al., 2019 ), mixture-of-experts (Ren et al., 2021) , and

