ITERATIVE AMORTIZED POLICY OPTIMIZATION

Abstract

Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control, enabling the estimation and sampling of high-value actions. From the variational inference perspective on RL, policy networks, when employed with entropy or KL regularization, are a form of amortized optimization, optimizing network parameters rather than the policy distributions directly. However, this direct amortized mapping can empirically yield suboptimal policy estimates and limited exploration. Given this perspective, we consider the more flexible class of iterative amortized optimizers. We demonstrate that the resulting technique, iterative amortized policy optimization, yields performance improvements over direct amortization methods on benchmark continuous control tasks.

1. INTRODUCTION

Reinforcement learning (RL) algorithms involve policy evaluation and policy optimization (Sutton & Barto, 2018) . Given a policy, one can estimate the value for each state or state-action pair following that policy, and given a value estimate, one can improve the policy to maximize the value. This latter procedure, policy optimization, can be challenging in continuous control due to instability and poor asymptotic performance. In deep RL, where policies over continuous actions are often parameterized by deep networks, such issues are typically tackled using regularization from previous policies (Schulman et al., 2015; 2017) or by maximizing policy entropy (Mnih et al., 2016; Fox et al., 2016) . These techniques can be interpreted as variational inference (Levine, 2018) , using optimization to infer a policy that yields high expected return while satisfying prior policy constraints. This smooths the optimization landscape, improving stability and performance (Ahmed et al., 2019) . However, one subtlety arises: when used with entropy or KL regularization, policy networks perform amortized optimization (Gershman & Goodman, 2014) . That is, rather than optimizing the action distribution, e.g. mean and variance, many deep RL algorithms, such as soft actor-critic (SAC) (Haarnoja et al., 2018b; c) , instead optimize a network to output these parameters, learning to optimize the policy. Typically, this is implemented as a direct mapping from states to action distribution parameters. While direct amortization schemes have improved the efficiency of variational inference as encoder networks (Kingma & Welling, 2014; Rezende et al., 2014; Mnih & Gregor, 2014) , they are also suboptimal (Cremer et al., 2018; Kim et al., 2018; Marino et al., 2018b) . This suboptimality is referred to as the amortization gap (Cremer et al., 2018) , translating into a gap in the RL objective. Likewise, direct amortization is typically restricted to a single estimate of the distribution, limiting the ability to sample diverse solutions. In RL, this translates into a deficiency in exploration. Inspired by techniques and improvements from variational inference, we investigate iterative amortized policy optimization. Iterative amortization (Marino et al., 2018b) uses gradients or errors to iteratively update the parameters of a distribution. Unlike direct amortization, which receives gradients only after outputting the distribution, iterative amortization uses these gradients online, thereby learning to perform iterative optimization. In generative modeling settings, iterative amortization tends to empirically outperform direct amortization (Marino et al., 2018b; a) , with the added benefit of finding multiple modes of the optimization landscape (Greff et al., 2019) . Using MuJoCo environments (Todorov et al., 2012) from OpenAI gym (Brockman et al., 2016) , we demonstrate performance improvements of iterative amortized policy optimization over direct amortization in model-free and model-based settings. We analyze various aspects of policy optimization, including iterative policy refinement, adaptive computation, and zero-shot optimizer transfer. Identifying policy networks as a form of amortization clarifies suboptimal aspects of direct approaches to policy optimization. Iterative amortization, by harnessing gradient-based feedback during policy optimization, offers an effective and principled improvement.

2.1. PRELIMINARIES

We consider Markov decision processes (MDPs), where s t ∈ S and a t ∈ A are the state and action at time t, resulting in reward r t = r(s t , a t ). Environment state transitions are given by s t+1 ∼ p env (s t+1 |s t , a t ), and the agent is defined by a parametric distribution, p θ (a t |s t ), with parameters θ. The discounted sum of rewards is denoted as R(τ ) = t γ t r t , where γ ∈ (0, 1] is the discount factor, and τ = (s 1 , a 1 , . . . ) is a trajectory. The distribution over trajectories is: p(τ ) = ρ(s 1 ) T t=1 p env (s t+1 |s t , a t )p θ (a t |s t ), where the initial state is drawn from the distribution ρ(s 1 ). The standard RL objective consists of maximizing the expected discounted return, E p(τ ) [R(τ )]. For convenience of presentation, we use the undiscounted setting (γ = 1), though the formulation can be applied with any valid γ.

2.2. KL-REGULARIZED REINFORCEMENT LEARNING

Various works have formulated RL, planning, and control problems in terms of probabilistic inference (Dayan & Hinton, 1997; Attias, 2003; Toussaint & Storkey, 2006; Todorov, 2008; Botvinick & Toussaint, 2012; Levine, 2018) . These approaches consider the agent-environment interaction as a graphical model, then convert reward maximization into maximum marginal likelihood estimation, learning and inferring a policy that results in maximal reward. This conversion is accomplished by introducing one or more binary observed variables (Cooper, 1988) , denoted as O, with p(O = 1|τ ) ∝ exp R(τ )/α , where α is a temperature hyper-parameter. These new variables are often referred to as "optimality" variables (Levine, 2018) . We would like to infer latent variables, τ , and learn parameters, θ, that yield the maximum log-likelihood of optimality, i.e. log p(O = 1). Evaluating this likelihood requires marginalizing the joint distribution, p(O = 1) = p(τ, O = 1)dτ . This involves averaging over all trajectories, which is intractable in high-dimensional spaces. Instead, we can use variational inference to lower bound this objective, introducing a structured approximate posterior distribution: π(τ |O) = T t=1 p env (s t+1 |s t , a t )π(a t |s t , O). (2) This provides the following lower bound on the objective, log p(O = 1): log p(O = 1|τ )p(τ )dτ ≥ π(τ |O) log p(O = 1|τ ) + log p(τ ) π(τ |O) dτ (3) = E π [R(τ )/α] -D KL (π(τ |O) p(τ )). Equivalently, we can multiply by α, defining the variational RL objective as: J (π, θ) ≡ E π [R(τ )] -αD KL (π(τ |O) p(τ )) This objective consists of the expected return (i.e., the standard RL objective) and a KL divergence between π(τ |O) and p(τ ). In terms of states and actions, this objective is written as: J (π, θ) = Est,rt∼penv at∼π T t=1 r t -α log π(a t |s t , O) p θ (a t |s t ) . ( ) At a given timestep, t, one can optimize this objective by estimating the future terms in the summation using a "soft" action-value (Q π ) network (Haarnoja et al., 2017) or model (Piché et al., 2019) . For instance, sampling s t ∼ p env , slightly abusing notation, we can write the objective at time t as:  J (π, θ) = E π [Q π (s t , a t )] -αD KL (π(a t |s t , O)||p θ (a t |s t )). c p W K 4 c T + z u X k X 7 l e i q O b M B N x k i L E / H v R K J U U N c 1 t o M P F d + X M A c a N c F o p n z D D O D q z V i Y 9 B W G W i 8 v H r K y P 1 L x 0 R n 8 y u Y w E 1 X S 1 j s m x d h s m 6 h 9 a 8 P W G x E L q X N V D Z 6 A 7 S f D 7 A O u g c 1 E P L u s X j 1 e 1 5 u 3 y O E V y Q k 7 J O Q l I g z T J A 2 m R N u E k J W / k n X x 4 R a / u X X u N 7 1 K v s O w 5 J i v h W L J + h C 4 s U = " > A A A C i X i c d V H L S g M x F E 3 H d + u j 6 t J N s A i u y o w K i i v R j e 4 U + h D a o W T S 2 z Y 0 m Q z J H W k Z 5 l f c 6 i / 5 N 2 Z q F 7 b V C x c O 5 7 4 O 5 0 a J F B Z 9 / 6 v k r a 1 v b G 5 t 7 5 Q r u 3 v 7 B 9 X D o 5 b V q e H Q 5 F p q 8 x o x C 1 L E 0 E S B E l 4 T A 0 x F E t r R + K G o t 9 / A W K H j B k 4 T C B U b x m I g O E N H 9 a p H X Y Q J G p U 9 I R j H v U H e q 9 b 8 u j 8 L u g q C O a i R e T z 3 D k u 9 b l / z V E G M X D J r O 4 G f Y J g x g 4 J L y M v d 1 E L C + J g N o e N g z B T Y M J u J z + m Z Y / p 0 o I 3 L G O m M / T 2 R M W X t V E W u U z E c 2 e V a Q f 5 V 6 6 Q 4 u A k z E S c p Q s x / D g 1 S S V H T w g n a F w Y 4 y q k D j B v h t F I + Y o Z x 5 8 T i p k Y Q Z o W 4 Y s 3 C + U j l 5 T P 6 m y l k J K g m i 3 1 M D r W 7 M F L / 0 I K v D i Q W U u e q 7 j s D 3 U u C 5 Q e s g t Z F P b i s X 7 x c 1 e 7 u 5 8 / Z J i f k l J y T g F y T O / J I n k m T c D I h 7 + S D f H o V L / B u v N u f V q 8 0 n z k m C + E 9 f A M 0 u 8 o O < / l a t e x i t > Q ⇡ (s t , a t ) < l a t e x i t s h a 1 _ b a s e 6 4 = " M J E 8 1 U h X K + z d K 2 r 8 F W v I K l H N 2 U w = " > A A A C m X i c d V H L S g M x F E 3 H d 3 1 V X b o J L Y K C l B k V d O l j I 6 5 a t F V o h y G T Z t p g M g n J H b E M 3 f s 1 b v V X / B s z t Y K 1 e i F w 7 r n P n B t r w S 3 4 / k f J m 5 t f W F x a X i m v r q 1 v b F a 2 t t t W Z Y a y F l V C m Y e Y W C Z 4 y l r A Q b A H b R i R s W D 3 8 e N V E b 9 / Y s Z y l d 7 B U L N Q k n 7 K E 0 4 J O C q q V J t R V 3 O 8 3 5 U E B n G S 2 1 E E h / j b I 8 4 7 i C o 1 v + 6 P D c + C Y A J q a G K N a K s U d X u K Z p K l Q A W x t h P 4 G s K c G O B U s F G 5 m 1 m m C X 0 k f d Z x M C W S 2 T A f f 2 a E 9 x z T w 4 k y 7 q W A x + z P i p x I a 4 c y d p n F l v Z 3 r C D / i n U y S M 7 C n K c 6 A 5 b S r 0 F J J j A o X C i D e 9 w w C m L o A K G G u 1 0 x H R B D K D j 9 p j r d B W F e L F e 0 m R o f y 1 F 5 D / 9 k i j U 0 y O f p P C L 6 y k 0 Y y H 9 o T m c L t G W Z U 1 X 1 n I D u J M H v A 8 y C 9 l E 9 O K 4 f N U 9 q 5 5 e T 4 y y j X V R F + y h A p + g c X a M G a i G K X t A r e k P v 3 q 5 3 4 V 1 7 N 1 + p X m l S s 4 O m z L v 9 B H 4 T z 7 E = < / l a t e x i t > ⇡ (a t |s t , O) < l a t e x i t s h a 1 _ b a s e 6 4 = " c f 2 O 3 K I g E u C 4 a 9 B Q L o u E V 1 r 8 y 6 Q = " > A A A C q n i c d V H d T t s w F H b D N l j H t s I u d 2 O t Q g I J V Q l M g k s 0 b r g b m y g w N V F 0 4 j i t h R 1 b 9 g l a F f I g e 5 r d s k f Y 2 8 x p O 4 l S O J K l 7 3 z n 7 / M 5 m Z H C Y R j + 7 Q R r L 1 6 + W t 9 4 3 X 2 z + f b d + 9 7 W 9 q X T l W V 8 y L T U 9 j o D x 6 U o + R A F S n 5 t L A e V S X 6 V 3 Z y 2 8 a t b b p 3 Q 5 Q V O D U 8 U j E t R C A b o q b R 3 G B u R x m Y i 6 G 6 s A C d Z U U O T I r 2 j / 1 3 n 3 f 2 5 x 0 D W X 5 u 9 t N c P B + H M 6 C q I F q B P F n a e b n X S O N e s U r x E J s G 5 U R Q a T G q w K J j k T T e u H D f A b m D M R x 6 W o L h L 6 t n v G r r j m Z w W 2 v p X I p 2 x D y t q U M 5 N V e Y z W 5 H u c a w l n 4 q N K i y O k 1 q U p k J e s v m g o p I U N W 1 X R X N h O U M 5 9 Q C Y F V 4 r Z R O w w N A v d K n T R Z T U r b i 2 z d L 4 T D X d H f q Q a W U Y V D + X 8 0 C O t Z 8 w U c / Q g q 0 W G M c r v 1 W d + w X 6 k 0 S P D 7 A K L g 8 G 0 e H g 4 N v n / s m X x X E 2 y E f y i e y S i B y R E 3 J G z s m Q M P K L / C b 3 5 E + w H 3 w P f g S j e W r Q W d R 8 I E s W 5 P 8 A l J f W 4 A = = < / l a t e x i t > p ✓ (a t |s t ) < l a t e x i t s h a 1 _ b a s e 6 4 = " G H 6 l m s h / 3 B l a 2 1 U I j W o A R j x 3 L a U = " > A A A C n X i c d V H d a h N B F J 6 s 1 d b 4 0 1 Q v v X A w V O p N 2 K 1 C v S x K o R d F K j R N I V m W s 5 O z y d C Z n W H m r B j W v I F P 4 2 1 9 k b 5 N Z 9 M I T V M P D H z n O 7 / z n d w q 6 S m O r 1 v R o 4 3 H T z a 3 n r a f P X / x c r u z 8 + r c m 8 o J 7 A u j j L v I w a O S J f Z J k s I L 6 x B 0 r n C Q X 3 5 t 4 o M f 6 L w 0 5 R n N L K Y a J q U s p A A K V N Z 5 b 7 M R T Z G A 7 4 0 0 0 D Q v a p h n x H / x f 6 4 P 7 o e s 0 4 1 7 8 c L 4 O k i W o M u W d p r t t L L R 2 I h K Y 0 l C g f f D J L a U 1 u B I C o X z 9 q j y a E F c w g S H A Z a g 0 a f 1 4 k N z v h u Y M S + M C 6 8 k v m D v V t S g v Z / p P G Q 2 W / r 7 s Y Z 8 K D a s q P i c 1 r K 0 F W E p b g c V l e J k e K M O H 0 u H g t Q s A B B O h l 2 5 m I I D Q U H D l U 5 n S V o 3 y z V t V s b n e t 7 e 5 X e Z Z g 1 L + u d q H q i J C R O m + j + 0 F O s F 1 m M V V D X j I G A 4 S X L / A O v g f L + X f O z t f / / U P f y y P M 4 W e 8 P e s T 2 W s A N 2 y I 7 Z K e s z w X 6 z P + y K / Y 3 e R k f R S f T t N j V q L W t e s x W L B j c L F N G l < / l a t e x i t > p env (s t |s t 1 , a t 1 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " X V E F z 5 w x W 3 q F o f r 6 K 5 n Z r T Q j X 8 1 0 g V 7 s c K B s n a u c p / Z r 2 s v x 3 r y b 7 F p i 8 X L 1 I n a t M h r t h x U t J K i p r 0 K d C Y a z l D O P Q D W C L 8 r Z R X 4 M 6 P X a q 3 T c Z y 6 f r m + z d r 4 X H W D X X q R 6 d c w q D 6 u 5 4 E s t Z 9 Q q X / Q g m 0 W G M t b f 1 U 9 8 w f 0 k s S X B d g E J w f j + N n 4 4 P 3 z 0 e G r l T h b 5 C F 5 R P P U v N m O e 8 y k G + + N x V N A i Y = " > A A A C u 3 i c d V H b b h M x E H W W W w m 3 F B 5 5 s Y g q F Q m i 3 Y I E L 0 g V 8 M B j k Z q 2 U r J a e Z 3 Z x K 1 v s m e r R m Z / i K / h t X w N 3 n S F k g Z G s n R 8 Z s Z z f K a 0 U n h M 0 + t e c u f u v f s P d h 7 2 H z 1 + 8 v T Z Y P f 5 i T e 1 4 z D m R h p 3 V j I P U m g Y o 0 A J Z 9 Y B U 6 W E 0 / L i S 5 s / v Q T n h d H H u L S Q K z b X o h K c Y a S K w V d b h C n C F T o V Q F 8 2 D d 2 f K o a L s g q + K Z D + o G v X g G + z 5 s 1 f h n X M 6 2 I w T E f p K u g 2 y D o w J F 0 c F b u 9 Y j o z v F a g k U v m / S R L L e a B O R R c Q t O f 1 h 4 s 4 x d s D p M I N V P g 8 7 D 6 b k P 3 I j O j l X H x a K Q r d r 0 j M O X 9 U p W x s l X q b + d a 8 l + 5 S Y 3 V x z w I b W s E z W 8 G V b W k a G j r H Z 0 J B x z l M g L G n Y h a K V 8 w x z h G h z d e O s 7 y 0 I p r n 9 k Y X 6 q m v 0 f X m V a G R X W 1 W c f k 3 M Q J C / U f W v D t B u u h j q 6 a W T Q w r i S 7 v Y B t Z + / R O s E 7 q 8 x J m B R L F p K X L B G X o q 7 Q 1 j x b D I s v p b k 8 Z G 0 F h C j m N 6 M X c O F 8 G 8 d k 2 K n + i D x 7 z 3 k c Z W T A t M 0 l 4 / H I R z o + s g W o I + W d p 5 u t d J 4 4 n m l Y I S u W T O j a P Q Y F I z i 4 J L a L p x 5 c A w f s u m M P a w Z A p c U s + / 2 t A D z 0 x o r q 1 / J d I 5 + 7 i i Z s q 5 m c p 8 Z q v W P Y 2 1 5 P 9 i 4 w r z L 0 k t S l M h l H w x K K 8 k R U 3 b v d G J s M B R z j x g 3 A q v l f K C W c b R b 3 e l 0 2 W U 1 K 2 4 t s 3 K + E w 1 3 Q P 6 m G l l G F S / V v O Y n G o / o V D P 0 I K v F x g H l d + q n v g F + p N E T w + w D q 6 O B t Z I T F 6 Q Q / K W H J E J Y e R X 8 C A Y B Y / D z + H X 8 F v 4 f Z k a B q u a + 2 T N w h + / A Q J 1 + Y 4 = < / l a t e x i t > Figure 1 : Amortization. Left: Optimization over two dimension of the policy mean, µ 1 and µ 3 , for a particular state. A direct amortized policy network outputs a suboptimal estimate, yielding an amortization gap in performance. An iterative amortized policy network finds an improved estimate. Right: Diagrams of direct and iterative amortization. Larger circles denote distributions, and smaller red circles denote terms in the objective, J (Eq. 7). Dashed arrows denote amortization. Iterative amortization uses gradient feedback during optimization, while direct amortization does not. Policy optimization in the KL-regularized setting corresponds to maximizing J w.r.t. π. We often consider parametric policies, in which π is defined by distribution parameters, λ, e.g. Gaussian mean, µ, and variance, σ 2 . In this case, policy optimization corresponds to maximizing: λ ← arg max λ J (π, θ). Optionally, we can then also learn the policy prior parameters, θ (Abdolmaleki et al., 2018) .

2.3. KL-REGULARIZED POLICY NETWORKS PERFORM DIRECT AMORTIZATION

Policy-based approaches to RL typically do not directly optimize the action distribution parameters, e.g. through gradient-based optimization. Instead, the action distribution parameters are output by a function approximator (deep network), f φ , which is trained using deterministic (Silver et al., 2014; Lillicrap et al., 2016) or stochastic gradients (Williams, 1992; Heess et al., 2015) . When combined with entropy or KL regularization, this policy network is a form of amortized optimization (Gershman & Goodman, 2014), learning to estimate policies. Again, denoting the action distribution parameters, e.g. mean and variance, as λ, for a given state, s, we can express this direct mapping as λ ← f φ (s), (direct amortization) (9) and we denote the corresponding policy as π φ (a|s, O; λ). Thus, f φ attempts to learn to optimize Eq. 8. This setup is shown in Figure 1 (Right). Without entropy or KL regularization, i.e. π φ (a|s) = p θ (a|s), we can instead interpret the network as directly integrating the LHS of Eq. 3, which is less efficient and more challenging. Adding regularization smooths the optimization landscape, resulting in more stable improvement and higher asymptotic performance (Ahmed et al., 2019) . Viewing policy networks as a form of amortized variational optimizer (inference model) (Eq. 9) allows us to see that they are similar to encoder networks in variational autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) . This raises the following question: are policy networks providing fully-optimized policy objectives? In VAEs, it is empirically observed that amortization results in suboptimal approximate posterior estimates, with the resulting gap in the variational bound referred to as the amortization gap (Cremer et al., 2018) . In the RL setting, this means that an amortized policy, π φ , results in worse performance than the optimal policy within the parametric policy class, which we denote as π. Thus, the amortization gap is the gap in following inequality: J (π φ , θ) ≤ J ( π, θ). Because J is a variational bound on the RL objective, i.e. expected return, a looser bound, due to amortization, prevents an agent from more completely optimizing this objective. To visualize the RL amortization gap, in Figure 1 (Left), we display the optimization surface, J , for two dimensions of the policy mean at a particular state in the MuJoCo environment Hopper-v2. We see that the estimate of a direct amortized policy (diamond) is suboptimal, far from the optimal estimate (star). Additional 2D plots are shown in Figure B .3. However, note that the absolute difference in the objective due to direct amortization is relatively small compared with the objective itself. That is, suboptimal estimates tend to have only a minor impact on evaluation performance, as we show in Appendix B.4. Rather, policy suboptimality hinders data collection, sampling fewer actions with high value estimates. Further, direct amortization is typically limited to a single static estimate of the policy, unable to directly adapt to the RL objective and therefore limiting exploration. To improve upon this scheme, in Section 3, we turn to a technique developed in generative modeling, iterative amortization (Marino et al., 2018b) , which retains the efficiency benefits of amortization while employing a more flexible iterative estimation procedure.

2.4. RELATED WORK

Previous works have investigated methods for improving policy optimization. QT-Opt (Kalashnikov et al., 2018) uses the cross-entropy method (CEM) (Rubinstein & Kroese, 2013) , an iterative derivative-free optimizer, to optimize a Q-value estimator for robotic grasping. CEM and related methods are also used in model-based RL for performing model-predictive control (Nagabandi et al., 2018; Chua et al., 2018; Piché et al., 2019; Hafner et al., 2019) . Gradient-based policy optimization, in contrast, is less common (Henaff et al., 2017; Srinivas et al., 2018; Bharadhwaj et al., 2020) , however, gradient-based optimization can also be combined with CEM (Amos & Yarats, 2020) . Most policy-based methods use direct amortization, either using a feedforward (Haarnoja et al., 2018b) or recurrent (Guez et al., 2019) network. Similar approaches have also been applied to model-based value estimates (Byravan et al., 2020; Clavera et al., 2020; Amos et al., 2020) , as well as combining direct amortization with model predictive control (Lee et al., 2019) and planning (Rivière et al., 2020) . A separate line of work has explored improving the policy distribution, using normalizing flows (Haarnoja et al., 2018a; Tang & Agrawal, 2018) and latent variables (Tirumala et al., 2019) . In principle, iterative amortization can perform policy optimization in each of these settings.

3.1. FORMULATION

Iterative amortized optimizers (Marino et al., 2018b) utilize some form of error or gradient to update the approximate posterior distribution parameters. While various forms exist, we consider gradient-encoding models (Andrychowicz et al., 2016) due to their generality. Compared with direct amortization in Eq. 9, we use iterative amortized optimizers of the general form et al., 2018b) . In practice, the update is carried out using a "highway" gating operation (Hochreiter & Schmidhuber, 1997; Srivastava et al., 2015) . Denoting ω φ ∈ [0, 1] as the gate and δ φ as the update, both of which are output by f φ , the gating operation is expressed as λ ← f φ (s, λ, ∇ λ J ), ( λ ← ω φ λ + (1 -ω φ ) δ φ , where denotes element-wise multiplication. This update is typically run for a fixed number of steps, and, as with a direct policy, the iterative optimizer is trained using stochastic gradient estimates of ∇ φ J , obtained through the path-wise derivative estimator (Kingma & Welling, 2014; Rezende et al., 2014; Heess et al., 2015) . Because the gradients ∇ λ J must be estimated online, i.e. during policy optimization, this scheme requires some way of estimating J online, e.g. through a parameterized Q-value network (Mnih et al., 2013) or a differentiable model (Heess et al., 2015) .

3.2.1. ADDED FLEXIBILITY

Iterative amortized optimizers are more flexible than their direct counterparts, incorporating feedback from the objective during policy optimization (Algorithm 2), rather than only after optimization (Algorithm 1). Increased flexibility improves the accuracy of optimization, thereby tightening the variational bound (Marino et al., 2018b; a) . We see this flexibility in Figure 1 (Left), where an iterative amortized policy network, despite being trained with a different value estimator, is capable of iteratively optimizing the policy estimate (blue dots), quickly arriving near the optimal estimate. Direct amortization is typically restricted to a single estimate, inherently limiting exploration. In contrast, iterative amortized optimizers, by using stochastic gradients and random initialization, can traverse the optimization landscape. As with any iterative optimization scheme, this allows iterative amortization to obtain multiple valid estimates (Greff et al., 2019) . We illustrate this capability across two action dimensions in Figure 2 for a state in the Ant-v2 MuJoCo environment. Over multiple policy optimization runs, iterative amortization finds multiple modes, sampling from two high-value regions of the action space. This provides increased flexibility in action exploration.

3.2.2. MITIGATING VALUE OVERESTIMATION

Model-free approaches generally estimate Q π using function approximation and temporal difference learning. However, this comes with the pitfall of value overestimation, i.e. positive bias in the estimate, Q π (Thrun & Schwartz, 1993) . This issue is tied to uncertainty in the value estimate, though it is distinct from optimism under uncertainty. If the policy can exploit regions of high uncertainty, the resulting target values will introduce positive bias into the estimate. More flexible policy optimizers may exacerbate the problem, exploiting this uncertainty to a greater degree. Further, a rapidly changing policy increases the difficulty of value estimation (Rajeswaran et al., 2020) . Fujimoto et al. (2018) apply and improve upon this technique for actor-critic settings, estimating the target Q-value as the minimum of two Q-networks, Q ψ1 and Q ψ2 : Q π (s, a) = min i=1,2 Q ψ i (s, a), where ψ i denotes the "target" network parameters. As noted by Fujimoto et al. (2018) , this not only counteracts value overestimation, but also penalizes high-variance value estimates, because the minimum decreases with the variance of the estimate. Ciosek et al. (2019) noted that, for a bootstrapped ensemble of two Q-networks, the minimum operation can be interpreted as estimating Q π (s, a) = µ Q (s, a) -βσ Q (s, a), with mean µ Q (s, a) ≡ 1 2 i=1,2 Q ψ i (s, a), standard deviation σ Q (s, a) ≡ ( 1 2 i=1,2 (Q ψ i (s, a) - µ Q (s, a)) 2 ) 1/2 , and β = 1. Thus, to further penalize high-variance value estimates, preventing value overestimation, we can increase β. For large β, however, value estimates become overly pessimistic, negatively impacting training. Thus, β reduces target value variance at the cost of increased bias. Due to the flexibility of iterative amortization, the default β = 1 results in increased value bias (Figure 3a ) and a more rapidly changing policy (Figure 3b ) as compared with direct amortization. Further penalizing high-variance target values with β = 2.5 reduces value overestimation and improves policy stability. For details, see Appendix A.2. Recent techniques for mitigating overestimation have been proposed, such as adjusting the temperature, α (Fox, 2019) . In offline RL, this issue has been tackled through the action prior (Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2019) or by altering Q-network training (Agarwal et al., 2019; Kumar et al., 2020) . While such techniques could be used here, increasing β provides a simple solution with no additional computational overhead.

4. EXPERIMENTS 4.1 SETUP

To focus on policy optimization, we implement iterative amortized policy optimization using the soft actor-critic (SAC) setup described by Haarnoja et al. (2018c) . This uses two Q-networks, uniform action prior, p θ (a|s) = U(-1, 1), and a tuning scheme for the temperature, α. In our experiments, "direct" refers to direct amortization employed in SAC, i.e. a direct policy network, and "iterative" refers to iterative amortization. Both approaches use the same network architecture, adjusting only the number of inputs and outputs to accommodate gradients, current policy estimates, and gated updates (Sec. 3.1). Unless otherwise stated, we use 5 iterations per time step for iterative amortization, following Marino et al. (2018b) . For details, refer to Appendix A and Haarnoja et al. (2018b; c) . 

4.2.2. PERFORMANCE COMPARISON

We evaluate iterative amortized policy optimization on MuJoCo (Todorov et al., 2012) continuous control tasks from OpenAI gym (Brockman et al., 2016) . In Figure 5 , we compare the cumulative reward of direct and iterative amortized policy optimization across environments. Each curve shows the mean and ± standard deviation of 5 random seeds. In all cases, iterative amortized policy optimization matches or outperforms the baseline direct method, both in sample efficiency and final performance. Across environments, iterative amortization also yields more consistent performance.

4.2.3. DECREASED AMORTIZATION GAP

To evaluate policy optimization, we estimate per-step amortization gaps using the experiments from Figure 5 , performing additional iterations of gradient ascent on J , w.r.t. the policy parameters, λ ≡ [µ, σ] (see Appendix A.3). We also evaluate the iterative agents trained with 5 iterations for an additional 5 amortized iterations. Results are shown in Figure 6 . We emphasize that it is challenging to directly compare amortization gaps across optimization schemes, as these involve different value functions, and therefore different objectives. Likewise, we estimate the amortization gap using the learned Q-value networks, which may be biased (Figure 3 ). Nevertheless, we find that iterative amortized policy optimization achieves, on average, lower amortization gaps than direct amortization across all environments. Further amortized iterations at evaluation yield further estimated improvement, demonstrating generalization. However, we note that the amortization gaps are relatively small compared to the estimated discounted objective. Accordingly, when we evaluate the more fully optimized policies in the environment, we do not observe a noticeable increase in performance (see Appendix B.4 ). This demonstrates that policy suboptimality is not a significant concern for evaluation. Rather, improved policy optimization is helpful for training. This allows the agent to collect data where value estimates are highest and ultimately improve these value estimates. Indeed, when we train iterative amortization while varying the policy optimization iterations per step (Section 4.2.4), we observe that the estimated amortization gap again decreases with increasing iterations, but now with a corresponding increase in performance (Figure B.6 ). Thus, reducing the amortization gap only indirectly improves task performance by improving training, with the relationship depending on the Q-value estimator and other factors.

4.2.4. VARYING ITERATIONS

Direct amortized policy optimization is restricted to a single forward pass through the network. Iterative amortization, in contrast, is capable of improving during policy optimization with additional computation time. To demonstrate this capability, we train iterative amortized policy optimization while varying the number of iterations in {1, 2, 5}. In Figure 7 , we see that increasing the number of amortized optimization iterations generally improves sample efficiency (Walker2d-v2), asymptotic performance (Ant-v2), or both . Thus, the quality of the policy optimizer can play a significant role in determining performance through training.

4.2.5. ITERATIVE AMORTIZATION WITH MODEL-BASED VALUE ESTIMATES

While our analysis has centered on the model-free setting, iterative amortized policy optimization can also be applied to model-based value estimates. As model-based RL remains an active research area (Janner et al., 2019) , we provide a proof-of-concept in this setting, using a learned deterministic model on HalfCheetah-v2 (see Appendix A.5). As shown in Figure 8a , iterative amortization outperforms direct amortization in this setting. Iterative amortization refines planned trajectories, shown for a single state dimension in Figure 8b , yielding corresponding improvements (Figure 8c ). Further, because we are learning an iterative policy optimizer, we can zero-shot transfer a policy optimizer trained with a model-free value estimator to a model-based value estimator (Figure 8d ). This is not possible with a direct amortized optimizer, which does not use value estimates online during policy optimization. Iterative amortization is capable of generalizing to new value estimates, instantly incorporating updated value estimates in policy optimization. This demonstrates and highlights the opportunity for improving model-based planning through iterative amortization.

5. DISCUSSION

We have introduced iterative amortized policy optimization, a flexible and powerful policy optimization technique. Using the MuJoCo continuous control suite, we have demonstrated improved performance over direct amortization with both model-based and model-free value estimates. Iterative amortization provides a drop-in replacement and improvement over direct policy networks in deep RL. Although iterative amortized policy optimization requires additional computation, this could be combined with some form of adaptive computation time (Graves, 2016; Figurnov et al., 2018) , gauging the required iterations. Likewise, efficiency depends, in part, on the policy initialization, which could be improved by learning the action prior, p θ (a|s) (Abdolmaleki et al., 2018) . The power of iterative amortization is in using the value estimate during policy optimization to iteratively improve the policy online. This is a form of negative feedback control (Astrom & Murray, 2008), using errors to guide policy optimization. Beyond providing a more powerful optimizer, we are hopeful that iterative amortized policy optimization, by using online feedback, will enable a range of improved RL algorithms, capable of instantly adapting to different value estimates, as shown in Figure 8d .

A EXPERIMENT DETAILS

A.1 2D PLOTS In Figures 1 and 2 , we plot the estimated variational objective, J , as a function of two dimensions of the policy mean, µ. To create these plots, we first perform policy optimization (direct amortization in Figure 1 and iterative amortization in Figure 2 ), estimating the policy mean and variance. This is performed using on-policy trajectories from evaluation episodes (for a direct agent in Figure 1 and an iterative agent in Figure 2 ). While holding all other dimensions of the policy constant, we then estimate the variational objective while varying two dimensions of the mean (1 & 3 in Figure 1 and 2 & 6 in Figure 2 ). Iterative amortization is additionally performed while preventing any updates to the constant dimensions. Even in this restricted setting, iterative amortization is capable of optimizing the policy. Additional 2D plots comparing direct vs. iterative amortization on other environments are shown in Figure 14 , where we see similar trends.

A.2 VALUE BIAS ESTIMATION

We estimate the bias in the Q-value estimator using a similar procedure as Fujimoto et al. ( 2018), comparing the estimate of the Q-networks ( Q π ) with a Monte Carlo estimate of the future objective in the actual environment, Q π , using a set of state-action pairs. To enable comparison across setups, we collect 100 state-action pairs using a uniform random policy, then evaluate the estimator's bias, E s,a Q π -Q π , throughout training. To obtain the Monte Carlo estimate of Q π , we use 100 action samples, which are propagated through all future time steps. The result is discounted using the same discounting factor as used during training, γ = 0.99, as well as the same Lagrange multiplier, α. Figure 3 shows the mean and ± standard deviation across the 100 state-action pairs.

A.3 AMORTIZATION GAP ESTIMATION

Calculating the amortization gap in the RL setting is challenging, as properly evaluating the variational objective, J , involves unrolling the environment. During training, the objective is estimated using a set of Q-networks and/or a learned model. However, finding the optimal policy distribution, π, under these learned value estimates may not accurately reflect the amortization gap, as the value estimator likely contains positive bias (Figure 3 ). Because the value estimator is typically locally accurate near the current policy, we estimate the amortization gap by performing gradient ascent on J w.r.t. the policy distribution parameters, λ, initializing from the amortized estimate (from π φ ). This is a form semi-amortized variational inference (Hjelm et al., 2016; Krishnan et al., 2018; Kim et al., 2018) . We use the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 5 × 10 -3 for 100 gradient steps, which we found consistently converged. This results in the estimated optimized π. We estimate the gap using 100 on-policy states, calculating J (θ, π) -J (θ, π), i.e. the improvement in the objective after gradient-based optimization. Figure 6 shows the resulting mean and ± standard deviation. We also run iterative amortized policy optimization for an additional 5 iterations during this evaluation, empirically yielding an additional decrease in the estimated amortization gap. A.4 HYPERPARAMETERS Our setup follows that of soft actor-critic (SAC) (Haarnoja et al., 2018b; c) , using a uniform action prior, i.e. entropy regularization, and two Q-networks (Fujimoto et al., 2018) . Off-policy training is performed using a replay buffer (Lin, 1992; Mnih et al., 2013) . Training hyperparameters are given in Table 6 . Temperature Following Haarnoja et al. (2018c) , we adjust the temperature, α, to maintain a specified entropy constraint, α = |A|, where |A| is the size of the action space, i.e. the dimensionality. Policy We use the same network architecture (number of layers, units/layer, non-linearity) for both direct and iterative amortized policy optimizers (Table 2 ). Each policy network results in Gaussian distribution parameters, and we apply a tanh transform to ensure a ∈ [-1, 1] (Haarnoja et al., 2018b) . In the case of a Gaussian, the distribution parameters are λ = [µ, σ]. The inputs and  0 R x 2 B b f N 6 E K h f q 3 s C b s = " > A A A C g X i c d V H L S g M x F E 2 n P m p 9 t b p 0 E y w F 3 Z S Z K i i 4 K b p x W a E v 6 A w l k 0 n b 0 G Q S k o x Y h v 6 G W / 0 t / 8 Z M O 4 s + 9 E L g c O 6 5 u Y d z Q 8 m o N q 7 7 U 3 C K e / s H h 6 W j 8 v H J 6 d l 5 p X r R 0 y J R m H S x Y E I N Q q Q J o z H p G m o Y G U h F E A 8 Z 6 Y e z l 6 z f f y d K U x F 3 z F y S g K N J T M c U I 2 M p 3 w 9 5 6 j M r j 9 B i V K m 5 D X d Z c B d 4 O a i B v N q j a m H k R w I n n M Q G M 6 T 1 0 H O l C V K k D M W M L M p + o o l E e I Y m Z G h h j D j R Q b o 0 v Y B 1 y 0 R w L J R 9 s Y F L d n 0 i R V z r O Q + t k i M z 1 d u 9 j P y r N 0 z M + D F I a S w T Q 2 K 8 W j R O G D Q C Z g n A i C q C D Z t b g L C i 1 i v E U 6 Q Q N j a n j Z 8 6 X p B m 5 r J v N t a H f F G u w 3 U m s y E N / 9 j U I T Y R d s O U / 0 N T v D s g N U l s q i K y A d q T e N s H 2 A W 9 Z s O 7 a z T f 7 m u t 5 / w 4 J X A F r s E N 8 M A D a I F X 0 A Z d g I E E n + A L f D t F 5 9 Z x n e Z K 6 h T y m U u w U c 7 T L + f P x t w = < / l a t e x i t > a < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 A R K 9 4 g D p e s g N y Q Q m 3 L s x d H 0 V / M = " > A A A C f 3 i c d V H L T g I x F O 2 M L 8 Q X 6 N J N I y G 6 I j N o o u 6 I b l x i w i v C h H R K B x r 6 m L Q d I 5 n w F 2 7 1 v / w b O 8 C C A b 1 J k 5 N z H + f 0 3 j B m V B v P + 3 H c n d 2 9 / Y P C Y f H o + O T 0 r F Q + 7 2 i Z K E z a W D K p e i H S h F F B 2 o Y a R n q x I o i H j H T D 6 X O W 7 7 4 T p a k U L T O L S c D R W N C I Y m Q s 9 T b g y E z C K E X z Y a n i 1 b x F w G 3 g r 0 A F r K I 5 L D v D w U j i h B N h M E N a 9 3 0 v N k G K l K G Y k X l x k G g S I z x F Y 9 K 3 U C B O d J A u L M 9 h 1 T I j G E l l n z B w w a 5 3 p I h r P e O h r c w s 6 s 1 c R v 6 V 6 y c m e g h S K u L E E I G X Q l H C o J E w + z 8 c U U W w Y T M L E F b U e o V 4 g h T C x m 4 p N 6 n l B 2 l m L h u T k w / 5 v F i F 6 0 x m I z b 8 I 1 + H 2 F h a h Q n / h 6 Z 4 u y H W J L F b l S O 7 Q H s S f / M A 2 6 B T r / m 3 t f r r X a X x t D p O A V y C K 3 A D f H A P G u A F N E E b Y C D A J / g C 3 6 7 j X r s 1 1 1 u W u s 6 q 5 w L k w n 3 8 B R V n x g 8 = < / l a t e x i t > Direct < l a t e x i t s h a 1 _ b a s e 6 4 = " N H T l N G K u f 9 j v A A k t 9 p + D K T 2 7 3 Z M = " > A A A C h n i c d V H L T g I x F C 3 j C / E B 6 N J N I z F x R W Z 8 B J d E X b j E R M A E J q R T L t D Y T i f t H Q O Z 8 C V u 9 a P 8 G z v I Q k R v 0 u T k 3 N f p u V E i h U X f / y x 4 G 5 t b 2 z v F 3 d L e / s F h u V I 9 6 l i d G g 5 t r q U 2 z x G z I E U M b R Q o 4 T k x w F Q k o R u 9 3 O X 5 7 i s Y K 3 T 8 h L M E Q s X G s R g J z t B R g 0 q 5 j z B F o 7 J 7 Y Y D j f F C p + X V / E X Q d B E t Q I 8 t o D a q F Q X + o e a o g R i 6 Z t b 3 A T z D M m E H B J c x L / d R C w v g L G 0 P P w Z g p s G G 2 U D 6 n Z 4 4 Z 0 p E 2 7 s V I F + z P j o w p a 2 c q c p W K 4 c T + z u X k X 7 l e i q O b M B N x k i L E / H v R K J U U N c 1 t o M P F d + X M A c a N c F o p n z D D O D q z V i Y 9 B W G W i 8 v H r K y P 1 L x 0 R n 8 y u Y w E 1 X S 1 j s m x d h s m 6 h 9 a 8 P W G x E L q X N V D Z 6 A 7 S f D 7 A O u g c 1 E P L u s X j 1 e 1 5 u 3 y O E V y Q k 7 J O Q l I g z T J A 2 m R N u E k J W / k n X x 4 R a / u X X u N 7 1 K v s O w 5 J i v h N b 8 A 8 z L I n g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " l x i q l 3 J C V A J F F + 6 o Z S B q r 8 b b M t c = " > A A A C e X i c d V H L T g I x F C 3 j C / E F u n R T J S T E B Z k B E 1 0 S 3 b j E h F c C E 9 I p h W l o p 0 3 b M Z I J v + B W f 8 1 v c W M H W D C g N 2 l y c u 7 j n N 4 b S E a 1 c d 3 v n L O 3 f 3 B 4 l D 8 u n J y e n V 8 U S 5 d d L W K F S Q c L J l Q / Q J o w G p G O o Y a R v l Q E 8 Y C R X j B 7 T v O 9 N 6 I 0 F V H b z C X x O Z p G d E I x M i k 1 l C E d F c t u z V 0 G 3 A X e G p T B O l q j U m 4 0 H A s c c x I Z z J D W A 8 + V x k + Q M h Q z s i g M Y 0 0 k w j M 0 J Q M L I 8 S J 9 p O l 2 Q W s W G Y M J 0 L Z F x m 4 Z D c 7 E s S 1 n v P A V n J k Q r 2 d S 8 m / c o P Y T B 7 9 h E Y y N i T C K 6 F J z K A R M P 0 5 H F N F s G F z C x B W 1 H q F O E Q K Y W P 3 k 5 n U 9 v w k N Z e O y c g H f F G o w E 0 m t S E N f 8 / W I T Y V V i H k / 9 A U 7 z Z I T W K 7 V T G 2 C 7 Q n 8 b Y P s A u 6 9 Z r X q N V f 7 8 v N p / V x 8 u A a 3 I I q 8 M A D a I I X 0 A I d g E E I P s A n + M r 9 O D d O 1 b l b l T q 5 d c 8 V y I T T + A U J g 8 R I < / l a t e x i t > ✓ < l a t e x i t s h a 1 _ b a s e 6 4 = " y 9 N z 2 T + f m S M i / 3 D G 5 f 5 3 p x w G B 6 k = " > A A A C e 3 i c d V H L S g M x F E 3 H V 6 1 v X b o J F k F E y o w P d C m 6 c V n B P q A d S i a 9 b a P J Z E j u i G X o P 7 j V P / N j B D O 1 i z 7 0 Q u B w 7 u O c 3 B s l U l j 0 / a + C t 7 S 8 s r p W X C 9 t b G 5 t 7 + z u 7 d e t T g 2 H G t d S m 2 b E L E g R Q w 0 F S m g m B p i K J D S i l / s 8 3 3 g F Y 4 W O n 3 C Y Q K h Y P x Y 9 w R k 6 q t 7 G A S D r 7 J b 9 i j 8 O u g i C C S i T S V Q 7 e 4 V O u 6 t 5 q i B G L p m 1 r c B P M M y Y Q c E l j E r t 1 E L C + A v r Q 8 v B m C m w Y T a 2 O 6 L H j u n S n j b u x U j H 7 H R H x p S 1 Q x W 5 S s V w Y O d z O f l X r p V i 7 y b M R J y k C D H / F e q l k q K m + d 9 p V x j g K I c O M G 6 E 8 0 r 5 g B n G 0 W 1 o Z t J T E G a 5 u X z M j H y k R q V j O s 3 k N h J U b 7 N 1 T P a 1 U x i o f 2 j B F x s S C 6 n b q u 6 6 B b q T B P M H W A T 1 8 0 p w U T l / v C z f 3 k 2 O U y S H 5 I i c k I B c k 1 v y Q K q k R j h 5 J u / k g 3 w W v r 2 y d + q d / Z Z 6 h U n P A Z k J 7 + o H E U P F M Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 K L a q y O k C C l o Z 6 V Q C H y J u 7 l R L r Y = " > A A A C e X i c d V H L T g I x F O 2 M L 8 Q X 6 N J N l Z A Q F 2 Q G T H R J d O M S E 1 4 J T E i n F G h o p 0 3 b M Z I J v + B W f 8 1 v c W M H Z s F D b 9 L k 5 N z H O b 0 3 l I x q 4 3 n f j r u 3 f 3 B 4 l D v O n 5 y e n V 8 U i p c d L W K F S R s L J l Q v R J o w G p G 2 o Y a R n l Q E 8 Z C R b j h 7 T v P d N 6 I 0 F V H L z C U J O J p E d E w x M i k 1 k J o O C y W v 6 i 0 D 7 g I / A y W Q R X N Y d I a D k c A x J 5 H B D G n d 9 z 1 p g g Q p Q z E j i / w g 1 k Q i P E M T 0 r c w Q p z o I F m a X c C y Z U Z w L J R 9 k Y F L d r 0 j Q V z r O Q 9 t J U d m q r d z K f l X r h + b 8 W O Q 0 E j G h k R 4 J T S O G T Q C p j + H I 6 o I N m x u A c K K W q 8 Q T 5 F C 2 N j 9 b E x q + U G S m k v H b M i H f J E v w 3 U m t S E N f 9 + s Q 2 w i r M K U / 0 N T v N s g N Y n t V s X I L t C e x N 8 + w C 7 o 1 K p + v V p 7 v S 8 1 n r L j 5 M A 1 u A U V 4 I M H 0 A A v o A n a A I M p + A C f 4 M v 5 c W / c i n u 3 K n W d r O c K b I R b / w U g V M R T < / l a t e x i t > r J < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 J f T 0 S m U y N E j L k g t H j + p A b v Q 0 c E = " > A A A C k H i c d V H L S g M x F E 3 H d 3 1 V X Y m b Y B F c l R k V d O d r I 6 4 U r A q d Y b i T p m 0 w L 5 K M W I b B r 3 G r 3 + P f m K l d W K s X A o d z X y f n Z p o z 6 8 L w s x b M z M 7 N L y w u 1 Z d X V t f W G x u b 9 1 b l h t A 2 U V y Z x w w s 5 U z S t m O O 0 0 d t K I i M 0 4 f s 6 b L K P z x T Y 5 m S d 2 6 o a S K g L 1 m P E X C e S h v b s Y S M Q 1 r E e s B K H A t w A w K 8 u C 7 T R j N s h a P A 0 y A a g y Y a x 0 2 6 U U v j r i K 5 o N I R D t Z 2 o l C 7 p A D j G O G 0 r M e 5 p R r I E / R p x 0 M J g t q k G P 2 h x H u e 6 e K e M v 5 J h 0 f s z 4 4 C h L V D k f n K S q P 9 n a v I v 3 K d 3 P V O k o J J n T s q y f e i X s 6 x U 7 g y B H e Z o c T x o Q d A D P N a M R m A A e K 8 b R O T 7 q K k q M R V Y y b W Z 6 K s 7 + G f T C V D O / E y W Q e 8 r / y G g f i H Z m S 6 Q V u a e 1 d V 1 x v o T x L 9 P s A 0 u D 9 o R Y e t g 9 u j 5 t n F + D i L a A f t o n 0 U o W N 0 h q 7 Q D W o j g l 7 R G 3 p H H 8 F m c B K c B u f f p U F t 3 L O F J i K 4 / g I d P M y V < / l a t e x i t > (a) Direct Amortization 23 < l a t e x i t s h a 1 _ b a s e 6 4 = " H l S p u r 9 H 0 R x 2 B b f N 6 E K h f q 3 s C b s = " > A A A C g X i c d V H L S g M x F E 2 n P m p 9 t b p 0 E y w F 3 Z S Z K i i 4 K b p x W a E v 6 A w l k 0 n b 0 G Q S k o x Y h v 6 G W / 0 t / 8 Z M O 4 s + 9 E L g c O 6 5 u Y d z Q 8 m o N q 7 7 U 3 C K e / s H h 6 W j 8 v H J 6 d l 5 p X r R 0 y J R m H S x Y E I N Q q Q J o z H p G m o Y G U h F E A 8 Z 6 Y e z l 6 z f f y d K U x F 3 z F y S g K N J T M c U I 2 M p 3 w 9 5 6 j M r j 9 B i V K m 5 D X d Z c B d 4 O a i B v N q j a m H k R w I n n M Q G M 6 T 1 0 H O l C V K k D M W M L M p + o o l E e I Y m Z G h h j D j R Q b o 0 v Y B 1 y 0 R w L J R 9 s Y F L d n 0 i R V z r O Q + t k i M z 1 d u 9 j P y r N 0 z M + D F I a S w T Q 2 K 8 W j R O G D Q C Z g n A i C q C D Z t b g L C i 1 i v E U 6 Q Q N j a n j Z 8 6 X p B m 5 r J v N t a H f F G u w 3 U m s y E N / 9 j U I T Y R d s O U / 0 N T v D s g N U l s q i K y A d q T e N s H 2 A W 9 Z s O 7 a z T f 7 m u t 5 / w 4 J X A F r s E N 8 M A D a I F X 0 A Z d g I E E n + A L f D t F 5 9 Z x n e Z K 6 h T y m U u w U c 7 T L + f P x t w = < / l a t e x i t > a < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 A R K 9 4 g D p e s g N y Q Q m 3 L s x d H 0 V / M = " > A A A C f 3 i c d V H L T g I x F O 2 M L 8 Q X 6 N J N I y G 6 I j N o o u 6 I b l x i w i v C h H R K B x r 6 m L Q d I 5 n w F 2 7 1 v / w b O 8 C C A b 1 J k 5 N z H + f 0 3 j B m V B v P + 3 H c n d 2 9 / Y P C Y f H o + O T 0 r F Q + 7 2 i Z K E z a W D K p e i H S h F F B 2 o Y a R n q x I o i H j H T D 6 X O W 7 7 4 T p a k U L T O L S c D R W N C I Y m Q s 9 T b g y E z C K E X z Y a n i 1 b x F w G 3 g r 0 A F r K I 5 L D v D w U j i h B N h M E N a 9 3 0 v N k G K l K G Y k X l x k G g S I z x F Y 9 K 3 U C B O d J A u L M 9 h 1 T I j G E l l n z B w w a 5 3 p I h r P e O h r c w s 6 s 1 c R v 6 V 6 y c m e g h S K u L E E I G X Q l H C o J E w + z 8 c U U W w Y T M L E F b U e o V 4 g h T C x m 4 p N 6 n l B 2 l m L h u T k w / 5 v F i F 6 0 x m I z b 8 I 1 + H 2 F h a h Q n / h 6 Z 4 u y H W J L F b l S O 7 Q H s S f / M A 2 6 B T r / m 3 t f r r X a X x t D p O A V y C K 3 A D f H A P G u A F N E E b Y C D A J / g C 3 6 7 j X r s 1 1 1 u W u s 6 q 5 w L k w n 3 8 B R V n x g 8 = < / l a t e x i t > Iterative < l a t e x i t s h a 1 _ b a s e 6 4 = " s u 8 B r L + 9 5 Y X 6 w S b m N 0 W L J + h C 4 s U = " > A A A C i X i c d V H L S g M x F E 3 H d + u j 6 t J N s A i u y o w K i i v R j e 4 U + h D a o W T S 2 z Y 0 m Q z J H W k Z 5 l f c 6 i / 5 N 2 Z q F 7 b V C x c O 5 7 4 O 5 0 a J F B Z 9 / 6 v k r a 1 v b G 5 t 7 5 Q r u 3 v 7 B 9 X D o 5 b V q e H Q 5 F p q 8 x o x C 1 L E 0 E S B E l 4 T A 0 x F E t r R + K G o t 9 / A W K H j B k 4 T C B U b x m I g O E N H 9 a p H X Y Q J G p U 9 I R j H v U H e q 9 b 8 u j 8 L u g q C O a i R e T z 3 D k u 9 b l / z V E G M X D J r O 4 G f Y J g x g 4 J L y M v d 1 E L C + J g N o e N g z B T Y M J u J z + m Z Y / p 0 o I 3 L G O m M / T 2 R M W X t V E W u U z E c 2 e V a Q f 5 V 6 6 Q 4 u A k z E S c p Q s x / D g 1 S S V H T w g n a F w Y 4 y q k D j B v h t F I + Y o Z x 5 8 T i p k Y Q Z o W 4 Y s 3 C + U j l 5 T P 6 m y l k J K g m i 3 1 M D r W 7 M F L / 0 I K v D i Q W U u e q 7 j s D 3 U u C 5 Q e s g t Z F P b i s X 7 x c 1 e 7 u 5 8 / Z J i f k l J y T g F y T O / J I n k m T c D I h 7 + S D f H o V L / B u v N u f V q 8 0 n z k m C + E 9 f A M 0 u 8 o O < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " l x i q l 3 J C V A J F F + 6 o Z S B q r 8 b b M t c = " > A A A C e X i c d V H L T g I x F C 3 j C / E F u n R T J S T E B Z k B E 1 0 S 3 b j E h F c C E 9 I p h W l o p 0 3 b M Z I J v + B W f 8 1 v c W M H W D C g N 2 l y c u 7 j n N 4 b S E a 1 c d 3 v n L O 3 f 3 B 4 l D 8 u n J y e n V 8 U S 5 d d L W K F S Q c L J l Q / Q J o w G p G O o Y a R v l Q E 8 Y C R X j B 7 T v O 9 N 6 I 0 F V H b z C X x O Z p G d E I x M i k 1 l C E d F c t u z V 0 G 3 A X e G p T B O l q j U m 4 0 H A s c c x I Z z J D W A 8 + V x k + Q M h Q z s i g M Y 0 0 k w j M 0 J Q M L I 8 S J 9 p O l 2 Q W s W G Y M J 0 L Z F x m 4 Z D c 7 E s S 1 n v P A V n J k Q r 2 d S 8 m / c o P Y T B 7 9 h E Y y N i T C K 6 F J z K A R M P 0 5 H F N F s G F z C x B W 1 H q F O E Q K Y W P 3 k 5 n U 9 v w k N Z e O y c g H f F G o w E 0 m t S E N f 8 / W I T Y V V i H k / 9 A U 7 z Z I T W K 7 V T G 2 C 7 Q n 8 b Y P s A u 6 9 Z r X q N V f 7 8 v N p / V x 8 u A a 3 I I q 8 M A D a I I X 0 A I d g E E I P s A n + M r 9 O D d O 1 b l b l T q 5 d c 8 V y I T T + A U J g 8 R I < / l a t e x i t > ✓ < l a t e x i t s h a 1 _ b a s e 6 4 = " y 9 N z 2 T + f m S M i / 3 D G 5 f 5 3 p x w G B 6 k = " > A A A C e 3 i c d V H L S g M x F E 3 H V 6 1 v X b o J F k F E y o w P d C m 6 c V n B P q A d S i a 9 b a P J Z E j u i G X o P 7 j V P / N j B D O 1 i z 7 0 Q u B w 7 u O c 3 B s l U l j 0 / a + C t 7 S 8 s r p W X C 9 t b G 5 t 7 + z u 7 d e t T g 2 H G t d S m 2 b E L E g R Q w 0 F S m g m B p i K J D S i l / s 8 3 3 g F Y 4 W O n 3 C Y Q K h Y P x Y 9 w R k 6 q t 7 G A S D r 7 J b 9 i j 8 O u g i C C S i T S V Q 7 e 4 V O u 6 t 5 q i B G L p m 1 r c B P M M y Y Q c E l j E r t 1 E L C + A v r Q 8 v B m C m w Y T a 2 O 6 L H j u n S n j b u x U j H 7 H R H x p S 1 Q x W 5 S s V w Y O d z O f l X r p V i 7 y b M R J y k C D H / F e q l k q K m + d 9 p V x j g K I c O M G 6 E 8 0 r 5 g B n G 0 W 1 o Z t J T E G a 5 u X z M j H y k R q V j O s 3 k N h J U b 7 N 1 T P a 1 U x i o f 2 j B F x s S C 6 n b q u 6 6 B b q T B P M H W A T 1 8 0 p w U T l / v C z f 3 k 2 O U y S H 5 I i c k I B c k 1 v y Q K q k R j h 5 J u / k g 3 w W v r 2 y d + q d / Z Z 6 h U n P A Z k J 7 + o H E U P F M Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 K L a q y O k C C l o Z 6 V Q C H y J u 7 l R L r Y = " > A A A C e X i c d V H L T g I x F O 2 M L 8 Q X 6 N J N l Z A Q F 2 Q G T H R J d O M S E 1 4 J T E i n F G h o p 0 3 b M Z I J v + B W f 8 1 v c W M H Z s F D b 9 L k 5 N z H O b 0 3 l I x q 4 3 n f j r u 3 f 3 B 4 l D v O n 5 y e n V 8 U i p c d L W K F S R s L J l Q v R J o w G p G 2 o Y a R n l Q E 8 Z C R b j h 7 T v P d N 6 I 0 F V H L z C U J O J p E d E w x M i k 1 k J o O C y W v 6 i 0 D 7 g I / A y W Q R X N Y d I a D k c A x J 5 H B D G n d 9 z 1 p g g Q p Q z E j i / w g 1 k Q i P E M T 0 r c w Q p z o I F m a X c C y Z U Z w L J R 9 k Y F L d r 0 j Q V z r O Q 9 t J U d m q r d z K f l X r h + b 8 W O Q 0 E j G h k R 4 J T S O G T Q C p j + H I 6 o I N m x u A c K K W q 8 Q T 5 F C 2 N j 9 b E x q + U G S m k v H b M i H f J E v w 3 U m t S E N f 9 + s Q 2 w i r M K U / 0 N T v N s g N Y n t V s X I L t C e x N 8 + w C 7 o 1 K p + v V p 7 v S 8 1 n r L j 5 M A 1 u A U V 4 I M H 0 A A v o A n a A I M p + A C f 4 M v 5 c W / c i n u 3 K n W d r O c K b I R b / w U g V M R T < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " H l S p u r 9 H 0 R x 2 B b f N 6 E K h f q 3 s C b s = " > A A A C g X i c d V H L S g M x F E 2 n P m p 9 t b p 0 E y w F 3 Z S Z K i i 4 K b p x W a E v 6 A w l k 0 n b 0 G Q S k o x Y h v 6 G W / 0 t / 8 Z M O 4 s + 9 E L g c O 6 5 u Y d z Q 8 m o N q 7 7 U 3 C K e / s H h 6 W j 8 v H J 6 d l 5 p X r R 0 y J R m H S x Y E I N Q q Q J o z H p G m o Y G U h F E A 8 Z 6 Y e z l 6 z f f y d K U x F 3 z F y S g K N J T M c U I 2 M p 3 w 9 5 6 j M r j 9 B i V K m 5 D X d Z c B d 4 O a i B v N q j a m H k R w I n n M Q G M 6 T 1 0 H O l C V K k D M W M L M p + o o l E e I Y m Z G h h j D j R Q b o 0 v Y B 1 y 0 R w L J R 9 s Y F L d n 0 i R V z r O Q + t k i M z 1 d u 9 j P y r N 0 z M + D F I a S w T Q 2 K 8 W j R O G D Q C Z g n A i C q C D Z t b g L C i 1 i v E U 6 Q Q N j a n j Z 8 6 X p B m 5 r J v N t a H f F G u w 3 U m s y E N / 9 j U I T Y R d s O U / 0 N T v D s g N U l s q i K y A d q T e N s H 2 A W 9 Z s O 7 a z T f 7 m u t 5 / w 4 J X A F r s E N 8 M A D a I F X 0 A Z d g I E E n + A L f D t F 5 9 Z x n e Z K 6 h T y m U u w U c 7 T L + f P x t w = < / l a t e x i t > r J < l a t e x i t s h a 1 _ b a s e 6 4 = " h o f 6 6 M 9 L 7 X m h w a o S H n f K G c d s 2 1 w = " > A A A C k 3 i c d V F N S 8 N A E N 3 G 7 / p V F U 9 6 W C y C p 5 K o o O B F 1 I M I g o J V o Q l h s t m 2 i / s R d j d i C b n 4 a 7 z q v / H f u K k 9 W K s D C 4 8 3 8 2 b e z i Q Z Z 8 b 6 / m f N m 5 q e m Z 2 b X 6 g v L i 2 v r D b W 1 u + N y j W h b a K 4 0 o 8 J G M q Z p G 3 L L K e P m a Y g E k 4 f k q f z K v / w T L V h S t 7 Z Q U Y j A T 3 J u o y A d V T c 2 A 4 l J B z i I u R O l E K J Q w G 2 T 4 A X V 2 X c a P o t f x h 4 E g Q j 0 E S j u I n X a n G Y K p I L K i 3 h Y E w n 8 D M b F a A t I 5 y W 9 T A 3 N A P y B D 3 a c V C C o C Y q h t 8 o 8 a 5 j U t x V 2 j 1 p 8 Z D 9 q S h A G D M Q i a u s P J r f u Y r 8 K 9 f J b f c 4 K p j M c k s l + R 7 U z T m 2 C l c 7 w S n T l F g + c A C I Z s 4 r J n 3 Q Q K z b 3 F i n u y A q K n N V m 7 H x i S j r u / g n U 9 n I r H g Z r w P e U 2 5 C X / x D M z I p y A z N 3 V Z V 6 h b o T h L 8 P s A k u N 9 v B Q e t / d v D 5 u n Z 6 D j z a A v t o D 0 U o C N 0 i i 7 R D W o j g l 7 R G 3 p H H 9 6 m d + K d e R f f p V 5 t p N l A Y + F d f w E B C c 3 T < / l a t e x i t > r J < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 J f T 0 S m U y N E j L k g t H j + p A b v Q 0 c E = " > A A A C k H i c d V H L S g M x F E 3 H d 3 1 V X Y m b Y B F c l R k V d O d r I 6 4 U r A q d Y b i T p m 0 w L 5 K M W I b B r 3 G r 3 + P f m K l d W K s X A o d z X y f n Z p o z 6 8 L w s x b M z M 7 N L y w u 1 Z d X V t f W G x u b 9 1 b l h t A 2 U V y Z x w w s 5 U z S t m O O 0 0 d t K I i M 0 4 f s 6 b L K P z x T Y 5 m S d 2 6 o a S K g L 1 m P E X C e S h v b s Y S M Q 1 r E e s B K H A t w A w K 8 u C 7 T R j N s h a P A 0 y A a g y Y a x 0 2 6 U U v j r i K 5 o N I R D t Z 2 o l C 7 p A D j G O G 0 r M e 5 p R r I E / R p x 0 M J g t q k G P 2 h x H u e 6 e K e M v 5 J h 0 f s z 4 4 C h L V D k f n K S q P 9 n a v I v 3 K d 3 P V O k o J J n T s q y f e i X s 6 x U 7 g y B H e Z o c T x o Q d A D P N a M R m A A e K 8 b R O T 7 q K k q M R V Y y b W Z 6 K s 7 + G f T C V D O / E y W Q e 8 r / y G g f i H Z m S 6 Q V u a e 1 d V 1 x v o T x L 9 P s A 0 u D 9 o R Y e t g 9 u j 5 t n F + D i L a A f t o n 0 U o W N 0 h q 7 Q D W o j g l 7 R G 3 p H H 8 F m c B K c B u f f p U F t 3 L O F J i K 4 / g I d P M y V < / l a t e x i t > (b) Iterative Amortization Figure 9 : Amortized Optimizers. Diagrams of (a) direct and (b) iterative amortized policy optimization. As in Figure 1 , larger circles represent probability distributions, and smaller red circles represent terms in the objective. Red dotted arrows represent gradients. In addition to the state, s t , iterative amortization uses the current policy distribution estimate, λ, and the policy optimization gradient, ∇ λ J , to iteratively optimize J . Like direct amortization, the optimizer network parameters, φ, are updated using ∇ φ J . This generally requires some form of stochastic gradient estimation to differentiate through a t ∼ π(a t |s t , O; λ). outputs of each optimizer form are given in Table 1 . Again, δ and ω are respectively the update and gate of the iterative amortized optimizer (Eq. 11), each of which are defined for both µ and σ. Following Marino et al. (2018b) , we apply layer normalization (Ba et al., 2016) individually to each of the inputs to iterative amortized optimizers. We initialize iterative amortization with µ = 0 and σ = 1, however, these could be initialized from a learned action prior (Marino et al., 2018a) . 

Q-value

We investigated two Q-value network architectures. Architecture A (Table 3 ) is the same as that from Haarnoja et al. (2018b) . Architecture B (Table 4 ) is a wider, deeper network with highway connectivity (Srivastava et al., 2015) , layer normalization (Ba et al., 2016) , and ELU nonlinearities (Clevert et al., 2015) . We initially compared each Q-value network architecture using each policy optimizer on each environment, as shown in Figure 10 . The results in Figure 5 were obtained using the better performing architecture in each case, given in Table 5 . As in Fujimoto et al. (2018) , we use an ensemble of 2 separate Q-networks in each experiment. Value Pessimism (β) As discussed in Section 3.2.2, the increased flexibility of iterative amortization allows it to potentially exploit inaccurate value estimates. We increased the pessimism hyperpa- rameter, β, to further penalize variance in the value estimate. Experiments with direct amortization use the default β = 1 in all environments, as we did not find that increasing β helped in this setup. For iterative amortization, we use β = 1.5 on Hopper-v2 and β = 2.5 on all other environments. This is only applied during training; while collecting data in the environment, we use β = 1 to not overly penalize exploration.  µ st+1 = s t + ∆ st . Model Training The state transition and reward networks are both trained using maximum loglikelihood training, using data examples from the replay buffer. Training is performed at the same frequency as policy and Q-network training, using the same batch size (256) and network optimizer. However, we perform 10 3 updates at the beginning of training, using the initial random steps, in order to start with a reasonable model estimate. Value Estimation To estimate Q-values, we combine short model rollouts with the model-free estimates from the Q-networks. Specifically, we unroll the model and policy, obtaining state, reward, and policy estimates at current and future time steps. We then apply the Q-value networks to these future state-action estimates. Future rewards and value estimates are combined using the Retrace estimator (Munos et al., 2016) . Denoting the estimate from the Q-network as Q ψ (s, a) and the reward estimate as r(s, a), we calculate the Q-value estimate at the current time step as Q π (s t , a t ) = Q ψ (s t , a t ) + E t+h t =t γ t -t λ t -t r(s t , a t ) + γ V ψ (s t +1 ) -Q ψ (s t , a t ) , ( ) where λ is an exponential weighting factor, h is the rollout horizon, and the expectation is evaluated under the model and policy. In the variational RL setting, the state-value, V π (s), is V π (s) = E π Q π (s, a) -α log π(a|s, O) p θ (a|s) . In Eq. 13, we approximate V π using the Q-network to approximate Q π in Eq. 14, yielding V ψ (s). Finally, to ensure consistency between the model and the Q-value networks, we use the model-based estimate from Eq. 13 to provide target values for the Q-networks, as in Janner et al. (2019) . Future Policy Estimates Evaluating the expectation in Eq. 13 requires estimates of π at future time steps. This is straightforward with direct amortization, which employs a feedforward policy, however, with iterative amortization, this entails recursively applying an iterative optimization procedure. Alternatively, we could use the prior, p θ (a|s), at future time steps, but this does not apply in the max-entropy setting, where the prior is uniform. For computational efficiency, we instead learn a separate direct (amortized) policy for model-based rollouts. That is, with iterative amortization, we create a separate direct network using the same hyperparameters from Table 2 . This network distills iterative amortization into a direct amortized optimizer, through the KL divergence, D KL (π it. ||π dir. ). Rollout policy networks are common in model-based RL (Silver et al., 2016; Piché et al., 2019) .

B.1 IMPROVEMENT PER STEP

In Figure 12 , we plot the average improvement in the variational objective per step throughout training, with each curve showing a different random seed. That is, each plot shows the average change in the variational objective after running 5 iterations of iterative amortized policy optimization. With the exception of HalfCheetah-v2, the improvement remains relatively constant throughout training and consistent across seeds.

B.2 COMPARISON WITH ITERATIVE OPTIMIZERS

Iterative amortized policy optimization obtains the accuracy benefits of iterative optimization while retaining the efficiency benefits of amortization. In Section 4, we compared the accuracy of iterative and direct amortization, seeing that iterative amortization yields reduced amortization gaps (Figure 6 ) and improved performance (Figure 5 ). In this section, we compare iterative amortization with two popular iterative optimizers: Adam (Kingma & Ba, 2014), a gradient-based optimizer, and cross-entropy method (CEM) (Rubinstein & Kroese, 2013), a gradient-free optimizer. s t+1 |s t , a t < l a t e x i t s h a 1 _ b a s e 6 4 = " t k d j q u y g F t 2 0 A g 7 / + 4 a A 3 5 T l D 1 w = " > A A A C p X i c d V H d S s M w F M 7 q 3 5 x / U y + 9 C Q 5 R U E Y 7 B b 0 U v f F G m L B N Y S s l z d I t m D Q l O R V H 7 V v 4 N N 7 q S / g 2 p n P I 5 v R A 4 D v f + c 1 3 w k R w A 6 7 7 W X I W F p e W V 8 q r l b X 1 j c 2 t 6 v Z O x 6 h U U 9 a m S i j 9 E B L D B I 9 Z G z g I 9 p B o R m Q o 2 H 3 4 e F 3 E 7 5 + Y N l z F L R g l z J d k E P O I U w K W C q r 1 n i Q w D K P M 5 E E G x 1 6 O X / A U B f j k x y X W D a o 1 t + 6 O D c 8 D b w J q a G L N Y L s U 9 P q K p p L F Q A U x p u u 5 C f g Z 0 c C p Y H m l l x q W E P p I B q x r Y U w k M 3 4 2 / l i O D y z T x 5 H S 9 s W A x + x 0 R U a k M S M Z 2 s x i S f M 7 V p B / x b o p R B d + x u M k B R b T 7 0 F R K j A o X K i E + 1 w z C m J k A a G a 2 1 0 x H R J N K F g t Z z q 1 P D 8 r l i v a z I w P Z V 4 5 w N N M s U Y C 8 n k 2 j 4 i B s h O G 8 h + a 0 / m C x L D U q q r 6 V k B 7 E u / 3 A e Z B p 1 H 3 T u u N u 7 P a 5 d X k O G W 0 h / b R E f L Q O b p E N 6 i J 2 o i i V / S G 3 t G H c + j c O i 2 n 8 5 3 q l C Y 1 u 2 j G n O A L K L D V D w = = < / l a t e x i t > r(s t , a t ) < l a t e x i t s h a 1 _ b a s e 6 4 = " + R E E N M t M / G P J b j M 1 6 L k 6 D Y G 3 1 j 8 = " > A A A C l n i c d V H L S g M x F E 3 H d 3 1 V 3 Q g u D J Z C B S k z K u h K R B F d V r C 2 0 A 5 D J s 2 0 o c l k S O 6 I Z e j S r 3 G r H + P f m K k V r N U L g X P P f e b c M B H c g O t + F J y 5 + Y X F p e W V 4 u r a + s Z m a W v 7 0 a h U U 9 a g S i j d C o l h g s e s A R w E a y W a E R k K 1 g w H 1 3 m 8 + c S 0 4 S p + g G H C f E l 6 M Y 8 4 J W C p o L S v c b U j C f T D K D O j A P A R / n a J d Q + D U t m t u W P D s 8 C b g D K a W D 3 Y K g S d r q K p Z D F Q Q Y x p e 2 4 C f k Y 0 c C r Y q N h J D U s I H Z A e a 1 s Y E 8 m M n 4 1 / M s I V y 3 R x p L R 9 M e A x + 7 M i I 9 K Y o Q x t Z r 6 l + R 3 L y b 9 i 7 R S i c z / j c Z I C i + n X o C g V G B T O Z c F d r h k F M b S A U M 3 t r p j 2 i S Y U r H h T n R 4 8 P 8 u X y 9 t M j Q / l q F j B P 5 l 8 j Q T k 8 3 Q e E T 1 l J / T l P z S n s w W J Y a l V V X W t g P Y k 3 u 8 D z I L H 4 5 p 3 U j u + P y 1 f X k 2 O s 4 z 2 0 A G q I g + d o U t 0 h + q o g S h 6 Q a / o D b 0 7 u 8 6 F c + P c f q U 6 h U n N D p o y p / 4 J K L n O Q A = = < / l a t e x i t >

Value

< l a t e x i t s h a 1 _ b a s e 6 4 = " x s a P A 9 + S b C 1 W 4 H D V C k y 8 P 3 6 R + K A = " > A A A C g 3 i c d V H J S g N B E O 2 M e 9 z 1 6 K U x C I I Q Z m J A L 0 L Q i 0 c F s 0 A y h p 5 O J W n s n h 6 6 a y R h y H 9 4 1 b / y b + y J O W T R g o L H q + 3 x K k q k s O j 7 3 w V v b X 1 j c 2 t 7 p 7 i 7 t 3 9 w e H R 8 0 r A 6 N R z q X E t t W h G z I E U M d R Q o o Z U Y Y C q S 0 I z e H v J 6 8 x 2 M F T p + w X E C o W K D W P Q F Z + i o 1 w 7 C C I 3 K G k y m M O k e l f y y P w 2 6 C o I Z K J F Z P H W P C 9 1 O T / N U Q Y x c M m v b g Z 9 g m D G D g k u Y F D u p h Y T x N z a A t o M x U 2 D D b C p 7 Q i 8 c 0 6 N 9 b V z G S K f s / E T G l L V j F b l O x X B o l 2 s 5 + V e t n W L / N s x E n K Q I M f 8 9 1 E 8 l R U 1 z D 2 h P G O A o x w 4 w b o T T S v m Q G c b R O b W w 6 S U I s 1 x c v m b h f K Q m x Q s 6 z + Q y E l S j x T 4 m B 9 p d G K p / a M F X B x I L q X N V 9 5 y B 7 i X B 8 g N W Q a N S D q 7 L l e d q q X Y / e 8 4 2 O S P n 5 J I E 5 I b U y C N 5 I n X C i S E f 5 J N 8 e R v e l V f x q r + t X m E 2 c 0 o W w r v 7 A X y I y A U = < / l a t e x i t >

Policy

< l a t e x i t s h a 1 _ b a s e 6 4 = " 5 a (Munos et al., 2016) , given in Eq. 13. 2013) . These iterative optimizers require over an order of magnitude more iterations to reach comparable performance with iterative amortization, making them impractical in many applications. Q C 3 c M m z O E q y O 9 2 b R B o z 0 9 X z + 4 = " > A A A C h n i c d V H J T g J B E G 3 G D X E B 9 e i l I z H x R G Z c g k e i F 4 + Y A J r A h P Q 0 B X T s n p 5 0 1 x j I h C / x q h / l 3 9 i D H F i 0 k k p e X m 0 v r 6 J E C o u + / 1 3 w t r Z 3 d v e K + 6 W D w 6 P j c u X k t G N 1 a j i 0 u Z b a v E b M g h Q x t F G g h N f E A F O R h J f o 7 T G v v 7 y D s U L H L Z w m E C o 2 i s V Q c I a O 6 l f K P Y Q J G p U 1 t R R 8 O u t X q n 7 N n w f d B M E C V M k i m v 2 T Q r 8 3 0 D x V E C O X z N p u 4 C c Y Z s y g 4 B J m p V 5 q I W H 8 j Y 2 g 6 2 D M F N g w m y u f 0 U v H D O h Q G 5 c x 0 j m 7 P J E x Z e 1 U R a 5 T M R z b 9 V p O / l X r p j i 8 D z M R J y l C z H 8 P D V N J U d P c B j o Q B j j K q Q O M G + G 0 U j 5 m h n F 0 Z q 1 s a g V h l o v L 1 6 y c j 9 S s d E m X m V x G g m q y 2 s f k S L s L Y / U P L f j m Q G I h d a 7 q g T P Q v S R Y f 8 A m 6 F z X g p v a 9 f N t t f G w e E 6 R n J M L c k U C U i c N 8 k S a p E 0 4 S c k H + S R f X t G r e X d e / b f V K y x m z s h K e I 0 f H x r I s w = = < / l a t e x i t > Model < l a t e x i t s h a 1 _ b a s e 6 4 = " C z g M r F K k G O s a 2 O n m b g n 1 z X O c 4 x o = " > A A A C g 3 i c d V H L S g M x F E 3 H V 6 1 v X b o J F k E Q y k w t 6 E Y Q 3 b g R K t h W a M e S S W / b Y D I Z k j t i G e Y / 3 O p f + T d m a h f W 6 o U L h 3 N f h 3 u i R A q L v v 9 Z 8 p a W V 1 b X y u u V j c 2 t 7 Z 3 d v f 2 2 1 a n h 0 O J a a v M Y M Q t S x N B C g R I e E w N M R R I 6 0 f N N U e + 8 g L F C x w 8 4 S S B U b B S L o e A M H f X U Q 3 h F o 7 I 7 P Q C Z 9 3 e r f s 2 f B l 0 E w Q x U y S y a / b 1 S v z f Q P F U Q I 5 f M 2 m 7 g J x h m z K D g E v J K L 7 W Q M P 7 M R t B 1 M G Y K b J h N Z e f 0 2 D E D O t T G Z Y x 0 y v 6 c y J i y d q I i 1 6 k Y j u 3 v W k H + V e u m O L w I M x E n K U L M v w 8 N U 0 l R 0 + I H d C A M c J Q T B x g 3 w m m l f M w M 4 + g + N b f p I Q i z Q l y x Z u 5 8 p P L K M f 3 J F D I S V K / z f U y O t L s w V v / Q g i 8 O J B Z S 9 1 V n S V 5 x l g S / D V g E 7 X o t O K v V 7 x v V q + u Z O W V y S I 7 I C Q n I O b k i t 6 R J W o Q T Q 9 7 I O / n w V r x T r + 4 1 v l u 9 0 m z m g M y F d / k F Y 4 r H + Q = = < / l a t To compare the accuracy and efficiency of the optimizers, we collect 100 states for each seed in each environment from the model-free experiments in Section 4.2.2. For each optimizer, we optimize the variational objective, J , starting from the same initialization. Tuning the step size, we found that 0.01 yielded the steepest improvement without diverging for both Adam and CEM. Gradients are evaluated with 10 action samples. For CEM, we sample 100 actions and fit a Gaussian mean and variance to the top 10 samples. This is comparable with QT-Opt (Kalashnikov et al., 2018) , which draws 64 samples and retains the top 6 samples. Each plot shows the optimization objective over two dimensions of the policy mean, µ. This optimization surface contains the value function trained using a direct amortized policy. The black diamond, denoting the estimate of this direct policy, is generally near-optimal, but does not match the optimal estimate (red star). Iterative amortized optimizers are capable of generalizing to these surfaces in each case, reaching optimal policy estimates. The results, averaged across states and random seeds, are shown in Figure 13 . CEM (gradientfree) is less efficient than Adam (gradient-based), which is unsurprising, especially considering that Adam effectively approximates higher-order curvature through momentum terms. However, Adam and CEM both require over an order of magnitude more iterations to reach comparable performance with iterative amortization. While iterative amortized policy optimization does not always obtain asymptotically optimal estimates, we note that these networks were trained with only 5 iterations, yet continue to improve and remain stable far beyond this limit. Finally, comparing wall clock time for each optimizer, iterative amortization is only roughly 1.25× slower than CEM and 1.15× slower than Adam, making iterative amortization still substantially more efficient.

B.3 ADDITIONAL 2D OPTIMIZATION PLOTS

In Figure 1 , we provided an example of the suboptimal optimization resulting from direct amortization on the Hopper-v2 environment. We also demonstrated that iterative amortization is capable of automatically generalizing to this optimization surface, outperforming the direct optimizer. To show that this is a general phenomenon, in Figure 14 , we present examples of corresponding 2D plots for each of the other environments considered in this paper. As before, we see that direct amortization is near-optimal, but, with the exception of HalfCheetah-v2, does not match the optimal estimate. In contrast, iterative amortization is able to find the optimal estimate, again, generalizing the unseen optimization surfaces.

B.4 ADDITIONAL OPTIMIZATION & THE AMORTIZATION GAP

In Section 4, we compared the performance of direct and iterative amortization, as well as their estimated amortization gaps. In this section, we provide additional results analyzing the relationship between policy optimization and the performance in the actual environment. As we have emphasized previously (see Section 3.2.2), this relationship is complex, as optimizing an inaccurate Q-value estimate does not improve task performance. The amortization gap quantifies the suboptimality in the objective, J , of the policy estimate. As described in Section A.3, we estimate the optimized policy by performing additional gradient-based optimization on the policy distribution parameters (mean and variance). However, when we deploy this optimized policy for evaluation in the actual environment, as shown for direct amortization in Figure 15 , we do not observe a noticeable difference in performance. Thus, while amortization may find suboptimal policy estimates, we observe that the actual difference in the objective is either too small or inaccurate to affect performance at test time. Likewise, in Section 4.2.2, we observed that using additional amortized iterations during evaluation further decreased the amortization gap for iterative amortization. However, when we deploy this Increasing the iterations improves performance and decreases the estimated amortization gap. more fully optimized policy in the environment, as shown in Figure 16 , we do not generally observe a corresponding performance improvement. In fact, on HalfCheetah-v2 and Walker2d-v2, we observe a slight decrease in performance. This further highlights the fact that additional policy optimization may exploit inaccurate Q-value estimates. However, importantly, in Figures 15 and 16 , the additional policy optimization is only performed for evaluation. That is, the data collected with the more fully optimized policy is not used for training and therefore cannot be used to correct the inaccurate value estimates. Thus, while more accurate policy optimization, as quantified by the amortization gap, may not substantially affect evaluation performance, it does play a significant role in improving training. This was shown in Section 4.2.4, where we observed that training with additional iterative amortized policy optimization iterations, i.e., a more flexible policy optimizer, generally results in improved performance. By using a more accurate (or exploitative) policy for data collection, the agent is able to better evaluate its Q-value estimates, which accrues over the course of training. This trend is shown for HalfCheetah-v2 in Figure 17 , where we observed the largest difference in performance across numbers of iterations. We generally observe that increasing the number of iterations during training improves performance and decreases the amortization gap. Interestingly, when performance dips for the agents trained with 2 iterations, there is a corresponding increase in the amortization gap. 

B.5 MULTIPLE POLICY ESTIMATES

As discussed in Section 3.2.1, iterative amortization has the added benefit of potentially obtaining multiple policy distribution estimates, due to stochasticity in the optimization procedure (as well as initialization). In contrast, unless latent variables or normalizing flows are incorporated into the policy, direct amortization is limited to a single policy estimate. To estimate the degree to which iterative amortization obtains multiple policy estimates during training, we perform two separate runs of policy optimization per state and evaluate the L2 distance between the means of these policy estimates (after applying the tanh). Note that in MuJoCo action spaces, which are bounded to [-1, 1], the maximum distance is 2 |A|, where |A| is the size of the action space. We average the policy mean distance over 100 states and all experiment seeds, with the results shown in Figure 18 . In all environments, we see that the average distance initially increases during training, remaining relatively constant for Hopper-v2 and HalfCheetah-v2 and decreasing slightly for Walker2d-v2 and Ant-v2. Note that the distance for direct amortization would be exactly 0 throughout. This indicates that iterative amortization does indeed obtain multiple policy estimates, maintaining some portion of multi-estimate policies throughout training.



t e x i t s h a 1 _ b a s e 6 4 = " N H T l N G K u f 9 jv A A k t 9 p + D K T 2 7 3 Z M = " > A A A C h n i c d V H L T g I x F C 3 j C / E B 6 N J N I z F x R W Z 8 B J d E X b j E R M A E J q R T L t D Y T i f t H Q O Z 8 C V u 9 a P 8 G z v I Q k R v 0 u T k 3 N f p u V E i h U X f / y x 4 G 5 t b 2 z v F 3 d L e / s F h u V I 9 6 l i d G g 5 t r q U 2 z x G z I E U M b R Q o 4 T k x w F Q k o R u 9 3 O X 5 7 i s Y K 3 T 8 h L M E Q s X G s R g J z t B R g 0 q 5 j z B F o 7 J 7 Y Y D j f F C p + X V / E X Q d B E t Q I 8 t o D a q F Q X +o e a o g R i 6 Z t b 3 A T z D M m E H B J c x L / d R C w v g L G 0 P P w Z g p s G G 2 U D 6 n Z 4 4 Z 0 p E 2 7 s V I F + z P j o w p a 2 c q

N b 8 A 8 z L I n g = = < / l a t e x i t > Iterative < l a t e x i t s h a 1 _ b a s e 6 4 = " s u 8 B r L + 9 5 Y X 6 w S b m N 0

c H I w y t 6 N D r 6 / H x 5 + 7 p a z Q 1 6 S V 2 S f Z O Q D O S T f y B E Z E 0 5 + k l / k m v x O P i U 8 O U / k T W n S 6 3 p e k I 1 I 6 j 9 r s d 6 c < / l a t e x i t > E ⇡ [Q ⇡ (s t , a t )] < l a t e x i t s h a 1 _ b a s e 6 4 = " z P Q e L l Z 8 6 k n a M E R R K 3 4 n A b J 7 O z I = " > A A A C t 3 i c d V F d a 9 s w F F X c b m 2 z j 6 b d 4 1 7 E Q q G D E e x 2 0 O 0 t t A z 2 2 E L T F m J j Z O U 6 F p U s T b o e C 8 Z / p 7 9 m r y v s 3 0 x O U m i a 9 Y L g 3 H O / j u 7 N j B Q O w / B v J 9 j Y f P F y a 3 u n + + r 1 m 7 e 7 v b 3 9 K 6 c r y 2 H E t d T 2 J m M O p C h h h A I l 3 B g L T G U S r r P b s z

H x 4 O j i c 3 9 4 u j z O N n l P P p B D E p E T M i T f y T k Z E U 7 u y G / y h 9 w H X 4 M 0 y I N i k R p 0 l j X v y I o F P / 4 B R p j c E Q = = < / l a t e x i t > ↵E ⇡  log ⇡ (a t |s t , O) p ✓ (a t |s t ) < l a t e x i t s h a 1 _ b a s e 6 4 = " / M M p v X K o B + n + b l Z 1 2 r Y P U y / 5 8 K I = " > A A A D B X i c f V H d a t R A F J 7 E v 7 r + d K t 3 e j O 4 F l r Q J a m C X h Z F 8 M 4 K 3 b a w C e F k d p I M n c k M m R N x G X P t n W / i n X j r c / g e P o C T 3 R W 6 X f X A M N / 5 z i / n y 4 0 U F q P o Z x B e u X r t + o 2 t m 4 N b t + / c 3 R 7 u 3 D u x u m 0 Y n z A t d X O W g + V S 1 H y C A i U / M w 0 H l U t + m p + / 7 u O n H 3 h j h a 6 P c W 5 4 q q C s R S E Y o K e y 4 Z e n C U h T A U 0 U Y J X n 7 k 2 X J U b Q R P I C p / 7 T J U 2 K B p j z r I 9 U g u 4 t U w s H X Y b 0 E / 3 j 2 t 5 9 s n Q Z S P e u 2 + + c y R K s O M J / q / Y 7 m j S i r D D N h q N o H C 2 M b o J 4 B U Z k Z U f Z T p A l M 8 1 a x W t k E q y d x p H B 1 E G D g k n e D Z L W c g P s H E o + 9 b A G

Figure 2: Estimating Multiple Policy Modes. Unlike direct amortization, which is restricted to a single estimate, iterative amortization can effectively sample from multiple high-value action modes. This is shown for a particular state in Ant-v2, showing multiple optimization runs across two action dimensions (Left). Each square denotes an initialization. The optimizer finds both modes, with the densities plotted on the Right. This capability provides increased flexibility in action exploration.

Figure3: Mitigating Value Overestimation. Using the same value estimation setup (β = 1 in Eq. 12), shown on Ant-v2, iterative amortization results in (a) higher value overestimation bias (closer to zero is better) and (b) a more rapidly changing policy as compared with direct amortization. Increasing β helps to mitigate these issues by further penalizing variance in the value estimate.

Figure 4: Policy Optimization. Visualization over time steps of (a) one dimension of the policy distribution and (b) the improvement in the objective, ∆J , across policy optimization iterations. (c) Comparison of iterative amortization with Adam (Kingma & Ba, 2014) (gradient-based) and CEM (Rubinstein & Kroese, 2013) (gradient-free). Iterative amortization is substantially more efficient.

Figure6: Decreased Amortization Gap. Estimated amortization gaps per step for direct and iterative amortized policy optimization. Iterative amortization achieves comparable or lower gaps across environments. Gaps are estimated using stochastic gradient-based optimization over 100 random states. Curves show the mean and ± standard deviation over 5 random seeds.

Figure 8: Optimizing Model-Based Value Estimates. (a) Performance comparison of direct and iterative amortization using model-based value estimates. (b) Planned trajectories over policy optimization iterations. (c) The corresponding estimated objective increases over iterations. (d) Zeroshot transfer of iterative amortization from model-free (MF) to model-based (MB) estimates.

Figure10: Value Architecture Comparison. Plots show performance for ≥ 3 seeds for each value architecture (A or B) for each policy optimization technique (direct or iterative). Note: results for iterative + B on Hopper-v2 were obtained with an overly pessimistic value estimate (β = 2.5 rather than β = 1.5) and are consequently worse.

Figure 12: Per-Step Improvement. Each plot shows the per-step improvement in the estimated variational RL objective, J , throughout training resulting from iterative amortized policy optimization. Each curve denotes a different random seed.

Figure 14: 2D Plots.Each plot shows the optimization objective over two dimensions of the policy mean, µ. This optimization surface contains the value function trained using a direct amortized policy. The black diamond, denoting the estimate of this direct policy, is generally near-optimal, but does not match the optimal estimate (red star). Iterative amortized optimizers are capable of generalizing to these surfaces in each case, reaching optimal policy estimates.

Figure15: Test-Time Gradient-Based Optimization. Each plot compares the performance of direct amortization vs. direct amortization with 50 additional gradient-based policy optimization iterations. Note that this additional optimization is only performed at test time.

Figure 18: Distance Between Policy Means. Each plot shows the L2 distance between the estimated policy means from two separate policy optimization runs at a given state. Results are averaged over 100 on-policy states at each point in training and over experiment seeds.

Policy Inputs & Outputs.

Policy Networks.

Q-value Network Architecture A.

Q-value Network Architecture B.

Q-value Network Architecture by Environment.

Training Hyperparameters.Model We use separate networks to estimate the state transition dynamics, p env (s t+1 |s t , a t ), and reward function, r(s t , a t ). The network architecture is given in Table7. Each network outputs the mean of a Gaussian distribution; the standard deviation is a separate, learnable parameter. The reward network directly outputs the mean estimate, whereas the state transition network outputs a residual estimate, ∆ st , yielding an updated mean estimate through:

Model Network Architectures.

Model-Based Hyperparameters.

e x i t > Figure 11: Model-Based Value Estimation. Diagram of model-based value estimation direct amortization). For clarity, the diagram is shown without the policy prior network, p θ (a t |s t ).The model consists of a deterministic reward estimate, r(s t , a t ), (green diamond) and a state estimate, s t+1 |s t , a t , (orange diamond). The model is unrolled over a horizon, H, and the Q-value is estimated using the Retrace estimator

