ITERATIVE AMORTIZED POLICY OPTIMIZATION

Abstract

Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control, enabling the estimation and sampling of high-value actions. From the variational inference perspective on RL, policy networks, when employed with entropy or KL regularization, are a form of amortized optimization, optimizing network parameters rather than the policy distributions directly. However, this direct amortized mapping can empirically yield suboptimal policy estimates and limited exploration. Given this perspective, we consider the more flexible class of iterative amortized optimizers. We demonstrate that the resulting technique, iterative amortized policy optimization, yields performance improvements over direct amortization methods on benchmark continuous control tasks.

1. INTRODUCTION

Reinforcement learning (RL) algorithms involve policy evaluation and policy optimization (Sutton & Barto, 2018) . Given a policy, one can estimate the value for each state or state-action pair following that policy, and given a value estimate, one can improve the policy to maximize the value. This latter procedure, policy optimization, can be challenging in continuous control due to instability and poor asymptotic performance. In deep RL, where policies over continuous actions are often parameterized by deep networks, such issues are typically tackled using regularization from previous policies (Schulman et al., 2015; 2017) or by maximizing policy entropy (Mnih et al., 2016; Fox et al., 2016) . These techniques can be interpreted as variational inference (Levine, 2018), using optimization to infer a policy that yields high expected return while satisfying prior policy constraints. This smooths the optimization landscape, improving stability and performance (Ahmed et al., 2019) . However, one subtlety arises: when used with entropy or KL regularization, policy networks perform amortized optimization (Gershman & Goodman, 2014). That is, rather than optimizing the action distribution, e.g. mean and variance, many deep RL algorithms, such as soft actor-critic (SAC) (Haarnoja et al., 2018b; c) , instead optimize a network to output these parameters, learning to optimize the policy. Typically, this is implemented as a direct mapping from states to action distribution parameters. While direct amortization schemes have improved the efficiency of variational inference as encoder networks (Kingma & Welling, 2014; Rezende et al., 2014; Mnih & Gregor, 2014) , they are also suboptimal (Cremer et al., 2018; Kim et al., 2018; Marino et al., 2018b) . This suboptimality is referred to as the amortization gap (Cremer et al., 2018) , translating into a gap in the RL objective. Likewise, direct amortization is typically restricted to a single estimate of the distribution, limiting the ability to sample diverse solutions. In RL, this translates into a deficiency in exploration. Inspired by techniques and improvements from variational inference, we investigate iterative amortized policy optimization. Iterative amortization (Marino et al., 2018b) uses gradients or errors to iteratively update the parameters of a distribution. Unlike direct amortization, which receives gradients only after outputting the distribution, iterative amortization uses these gradients online, thereby learning to perform iterative optimization. In generative modeling settings, iterative amortization tends to empirically outperform direct amortization (Marino et al., 2018b; a) , with the added benefit of finding multiple modes of the optimization landscape (Greff et al., 2019) . Using MuJoCo environments (Todorov et al., 2012) from OpenAI gym (Brockman et al., 2016) , we demonstrate performance improvements of iterative amortized policy optimization over direct amortization in model-free and model-based settings. We analyze various aspects of policy optimization, including iterative policy refinement, adaptive computation, and zero-shot optimizer transfer. Identifying policy networks as a form of amortization clarifies suboptimal aspects of direct approaches 1

