ITERATIVE AMORTIZED POLICY OPTIMIZATION

Abstract

Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control, enabling the estimation and sampling of high-value actions. From the variational inference perspective on RL, policy networks, when employed with entropy or KL regularization, are a form of amortized optimization, optimizing network parameters rather than the policy distributions directly. However, this direct amortized mapping can empirically yield suboptimal policy estimates and limited exploration. Given this perspective, we consider the more flexible class of iterative amortized optimizers. We demonstrate that the resulting technique, iterative amortized policy optimization, yields performance improvements over direct amortization methods on benchmark continuous control tasks.

1. INTRODUCTION

Reinforcement learning (RL) algorithms involve policy evaluation and policy optimization (Sutton & Barto, 2018) . Given a policy, one can estimate the value for each state or state-action pair following that policy, and given a value estimate, one can improve the policy to maximize the value. This latter procedure, policy optimization, can be challenging in continuous control due to instability and poor asymptotic performance. In deep RL, where policies over continuous actions are often parameterized by deep networks, such issues are typically tackled using regularization from previous policies (Schulman et al., 2015; 2017) or by maximizing policy entropy (Mnih et al., 2016; Fox et al., 2016) . These techniques can be interpreted as variational inference (Levine, 2018), using optimization to infer a policy that yields high expected return while satisfying prior policy constraints. This smooths the optimization landscape, improving stability and performance (Ahmed et al., 2019) . However, one subtlety arises: when used with entropy or KL regularization, policy networks perform amortized optimization (Gershman & Goodman, 2014) . That is, rather than optimizing the action distribution, e.g. mean and variance, many deep RL algorithms, such as soft actor-critic (SAC) (Haarnoja et al., 2018b; c) , instead optimize a network to output these parameters, learning to optimize the policy. Typically, this is implemented as a direct mapping from states to action distribution parameters. While direct amortization schemes have improved the efficiency of variational inference as encoder networks (Kingma & Welling, 2014; Rezende et al., 2014; Mnih & Gregor, 2014) , they are also suboptimal (Cremer et al., 2018; Kim et al., 2018; Marino et al., 2018b) . This suboptimality is referred to as the amortization gap (Cremer et al., 2018) , translating into a gap in the RL objective. Likewise, direct amortization is typically restricted to a single estimate of the distribution, limiting the ability to sample diverse solutions. In RL, this translates into a deficiency in exploration. Inspired by techniques and improvements from variational inference, we investigate iterative amortized policy optimization. Iterative amortization (Marino et al., 2018b) uses gradients or errors to iteratively update the parameters of a distribution. Unlike direct amortization, which receives gradients only after outputting the distribution, iterative amortization uses these gradients online, thereby learning to perform iterative optimization. In generative modeling settings, iterative amortization tends to empirically outperform direct amortization (Marino et al., 2018b; a) , with the added benefit of finding multiple modes of the optimization landscape (Greff et al., 2019) . Using MuJoCo environments (Todorov et al., 2012) from OpenAI gym (Brockman et al., 2016) , we demonstrate performance improvements of iterative amortized policy optimization over direct amortization in model-free and model-based settings. We analyze various aspects of policy optimization, including iterative policy refinement, adaptive computation, and zero-shot optimizer transfer. Identifying policy networks as a form of amortization clarifies suboptimal aspects of direct approaches to policy optimization. Iterative amortization, by harnessing gradient-based feedback during policy optimization, offers an effective and principled improvement.

2.1. PRELIMINARIES

We consider Markov decision processes (MDPs), where s t ∈ S and a t ∈ A are the state and action at time t, resulting in reward r t = r(s t , a t ). Environment state transitions are given by s t+1 ∼ p env (s t+1 |s t , a t ), and the agent is defined by a parametric distribution, p θ (a t |s t ), with parameters θ. The discounted sum of rewards is denoted as R(τ ) = t γ t r t , where γ ∈ (0, 1] is the discount factor, and τ = (s 1 , a 1 , . . . ) is a trajectory. The distribution over trajectories is: p(τ ) = ρ(s 1 ) T t=1 p env (s t+1 |s t , a t )p θ (a t |s t ), where the initial state is drawn from the distribution ρ(s 1 ). The standard RL objective consists of maximizing the expected discounted return, E p(τ ) [R(τ )]. For convenience of presentation, we use the undiscounted setting (γ = 1), though the formulation can be applied with any valid γ.

2.2. KL-REGULARIZED REINFORCEMENT LEARNING

Various works have formulated RL, planning, and control problems in terms of probabilistic inference (Dayan & Hinton, 1997; Attias, 2003; Toussaint & Storkey, 2006; Todorov, 2008; Botvinick & Toussaint, 2012; Levine, 2018) . These approaches consider the agent-environment interaction as a graphical model, then convert reward maximization into maximum marginal likelihood estimation, learning and inferring a policy that results in maximal reward. This conversion is accomplished by introducing one or more binary observed variables (Cooper, 1988) , denoted as O, with p(O = 1|τ ) ∝ exp R(τ )/α , where α is a temperature hyper-parameter. These new variables are often referred to as "optimality" variables (Levine, 2018) . We would like to infer latent variables, τ , and learn parameters, θ, that yield the maximum log-likelihood of optimality, i.e. log p(O = 1). Evaluating this likelihood requires marginalizing the joint distribution, p(O = 1) = p(τ, O = 1)dτ . This involves averaging over all trajectories, which is intractable in high-dimensional spaces. Instead, we can use variational inference to lower bound this objective, introducing a structured approximate posterior distribution: π(τ |O) = T t=1 p env (s t+1 |s t , a t )π(a t |s t , O). This provides the following lower bound on the objective, log p(O = 1): log p(O = 1|τ )p(τ )dτ ≥ π(τ |O) log p(O = 1|τ ) + log p(τ ) π(τ |O) dτ (3) = E π [R(τ )/α] -D KL (π(τ |O) p(τ )). Equivalently, we can multiply by α, defining the variational RL objective as: J (π, θ) ≡ E π [R(τ )] -αD KL (π(τ |O) p(τ )) This objective consists of the expected return (i.e., the standard RL objective) and a KL divergence between π(τ |O) and p(τ ). In terms of states and actions, this objective is written as:  J (π, θ) = Est, At a given timestep, t, one can optimize this objective by estimating the future terms in the summation using a "soft" action-value (Q π ) network (Haarnoja et al., 2017) or model (Piché et al., 2019) . For instance, sampling s t ∼ p env , slightly abusing notation, we can write the objective at time t as:  J (π, θ) = E π [Q π (



rt∼penv

