GREEDY ACTOR-CRITIC: A NEW CONDITIONAL CROSS-ENTROPY METHOD FOR POLICY IMPROVEMENT

Abstract

Many policy gradient methods are variants of Actor-Critic (AC), where a value function (critic) is learned to facilitate updating the parameterized policy (actor). The update to the actor involves a log-likelihood update weighted by the actionvalues, with the addition of entropy regularization for soft variants. In this work, we explore an alternative update for the actor, based on an extension of the cross entropy method (CEM) to condition on inputs (states). The idea is to start with a broader policy and slowly concentrate around maximally valued actions, using a maximum likelihood update towards actions in the top percentile per state. The speed of this concentration is controlled by a proposal policy, that concentrates at a slower rate than the actor. We first provide a policy improvement result in an idealized setting, and then prove that our conditional CEM (CCEM) strategy tracks a CEM update per state, even with changing action-values. We empirically show that our GreedyAC algorithm, that uses CCEM for the actor update, performs better than Soft Actor-Critic and is much less sensitive to entropy-regularization.

1. INTRODUCTION

Many policy optimization strategies update the policy towards the Boltzmann policy. This strategy became popularized by Soft Q-Learning (Haarnoja et al., 2017) and Soft Actor-Critic (SAC) (Haarnoja et al., 2018a) , but has a long history in reinforcement learning (Kober & Peters, 2008; Neumann, 2011) . In fact, recent work (Vieillard et al., 2020a; Chan et al., 2021) has highlighted that an even broader variety of policy optimization methods can be seen as optimizing either a forward or reverse KL divergence to the Boltzmann policy, as in SAC. In fact, even the original Actor-Critic (AC) update (Sutton, 1984) can be seen as optimizing a reverse KL divergence, with zero-entropy. The use of the Boltzmann policy underlies many methods for good reason: it guarantees policy improvement (Haarnoja et al., 2018a) . More specifically, this is the case when learning entropyregularized action-values Q π τ for a policy π with regularization parameter τ > 0. The Boltzmann policy for a state is proportional to exp(Q π τ (s, a)τ -1 ). The level of emphasis on high-valued actions is controlled by τ : the higher the magnitude of the entropy level (larger τ ), the less the probabilities in the Boltzmann policy are peaked around maximally valued actions. This choice, however, has several limitations. The policy improvement guarantee is for the entropyregularized MDP, rather than the original MDP. Entropy regularization is used to encourage exploration (Ziebart et al., 2008; Mei et al., 2019) and improve the optimization surface (Ahmed et al., 2019; Shani et al., 2020) , resulting in a trade-off between improving the learning process and converging to the optimal policy. Additionally, SAC and other methods are well-known to be sensitive to the entropy regularization parameter (Pourchot & Sigaud, 2019) . Prior work has explored optimizing entropy during learning (Haarnoja et al., 2018b) , however, this optimization introduces yet another hyperparameter to tune, and this approach may be less performant than a simple grid search (see Appendix D). It is reasonable to investigate alternative policy improvement approaches that could potentially improve our actor-critic algorithms. In this work we propose a new greedification strategy towards this goal. The basic idea is to iteratively take the top percentile of actions, ranked according to the learned action-values. The procedure slowly concentrates on the maximal action(s), across states, for the given action-values. The update itself is simple: N ∈ N actions are sampled according to a proposal policy, the actions are sorted based on the magnitude of the action-values, and the policy is updated to increase the probability of the ⌈ρN ⌉ maximally valued actions for ρ ∈ (0, 1). We call this algorithm for the actor Conditional CEM (CCEM), because it is an extension of the well-known Cross-Entropy Method (CEM) (Rubinstein, 1999) to condition on inputsfoot_0 . We leverage theory for CEM to validate that our algorithm concentrates on maximally valued actions across states over time. We introduce GreedyAC, a new AC algorithm that uses CCEM for the actor. GreedyAC has several advantages over using Boltzmann greedification. First, we show that our new greedification operator ensures a policy improvement for the original MDP, rather than a different entropy-regularized MDP. Second, we can still leverage entropy to prevent policy collapse, but only incorporate it into the proposal policy. This ensures the agent considers potentially optimal actions for longer, but does not skew the actor. In fact, it is possible to decouple the role of entropy for exploration and policy collapse within GreedyAC: the actor could have a small amount of entropy to encourage exploration, and the proposal policy a higher level of entropy to avoid policy collapse. Potentially because of this decoupling, we find that GreedyAC is much less sensitive to the choice of entropy regularizer, as compared to SAC. This design of the algorithm may help it avoid getting stuck in a locally optimal action, and empirical evidence for CEM suggests it can be quite effective for this purpose (Rubinstein & Kroese, 2004) . In addition to our theoretical support for CCEM, we provide an empirical investigation comparing GreedyAC, SAC, and a vanilla AC, highlighting that GreedyAC performs consistently well, even in problems like the Mujoco environment Swimmer and pixel-based control where SAC performs poorly.

2. BACKGROUND AND PROBLEM FORMULATION

The interaction between the agent and environment is formalized by a Markov decision process (S, A, P, R, γ), where S is the state space, A is the action space, P : S × A × S → [0, ∞) is the one-step state transition dynamics, R : S × A × S → R is the reward function, and γ ∈ [0, 1] is the discount rate. We assume an episodic problem setting, where the start state S 0 ∼ d 0 for start state distribution d 0 : S → [0, ∞) and the length of the episode T is random, depending on when the agent reaches termination. At each discrete timestep t = 1, 2, . . . , T , the agent finds itself in some state S t and selects an action A t drawn from its stochastic policy π : S × A → [0, ∞). The agent then transitions to state S t+1 according to P and observes a scalar reward R t+1 . = R(S t , A t , S t+1 ). For a parameterized policy π w with parameters w, the agent attempts to maximize the objective J(w) = E πw [ T t=0 γ t R t+1 ], where the expectation is according to start state distribution d 0 , transition dynamics P, and policy π w . Policy gradient methods, like REINFORCE (Williams, 1992) , attempt to obtain (unbiased) estimates of the gradient of this objective to directly update the policy. The difficulty is that the policy gradient is expensive to sample, because it requires sampling return trajectories from states sampled from the visitation distribution under π w , as per the policy gradient theorem (Sutton et al., 1999) . Theory papers analyze such an idealized algorithm (Kakade & Langford, 2002; Agarwal et al., 2021) , but in practice this strategy is rarely used. Instead, it is much more common to (a) ignore bias in the state distribution (Thomas, 2014; Imani et al., 2018; Nota & Thomas, 2020) and (b) use biased estimates of the return, in the form of a value function critic. The action-value function Q π (s, a) . = E π [ T -t k=1 γ t R t+k |S t = s, A t = a] is the expected return from a given state and action, when following policy π. Many PG methods-specifically variants of Actor-Critic-estimate these action-values with parameterized Q θ (s, a), to use the update Q θ (s, a)∇ ln π w (a|s) or one with a baseline [Q θ (s, a) -V (s)]∇ ln π w (a|s) where the value function V (s) is also typically learned. The state s is sampled from a replay buffer, and a ∼ π w (•|s), for the update. There has been a flurry of work, and success, pursuing this path, including methods such as OffPAC (Degris et al., 2012b) , SAC (Haarnoja et al., 2018a) , SQL (Haarnoja et al., 2017) , TRPO (Schulman et al., 2015) and many other variants of related ideas (Peters et al., 2010; Silver et al., 2014; Schulman et al., 2016; Lillicrap et al., 2016; Wang et al., 2017; Gu et al., 2017; Schulman et al., 2017; Abdolmaleki et al., 2018; Mei et al., 2019; Vieillard et al., 2020b) . Following close behind are unification results that make sense of this flurry of work (Tomar et al., 2020; Vieillard et al., 2020a; Chan et al., 2021; Lazić et al., 2021) . They highlight that many methods include a mirror descent component-to minimize KL to the most recent policy-and an entropy-regularization component (Vieillard et al., 2020a) . In particular, these methods are better thought of as (approximate) policy iteration approaches that update towards the Boltzmann policy, in some cases using a mirror descent update. The Boltzmann policy B τ Q(s, a) for a given Q is B τ Q(s, a) = exp(Q(s, a)τ -1 ) A exp(Q(s, b)τ -1 )db for entropy parameter τ . As τ → 0, this policy puts all weight on greedy actions. As τ → ∞, all actions are weighted uniformly. This policy could be directly used as the new greedy policy. However, because it is expensive to sample from B τ Q(s, a), typically a parameterized policy π w is learned to approximate B τ Q(s, a), by minimizing a KL divergence. As the entropy goes to zero, we get an unregularized update that corresponds to the vanilla AC update (Chan et al., 2021) .

3. CONDITIONAL CEM

Though using the Boltzmann policy has been successful, it does have some limitations. The primary limitation is that it is sensitive to the choice of entropy (Pourchot & Sigaud, 2019; Chan et al., 2021) . A natural question is what other strategies we can use for this greedification step in these approximate policy iteration algorithms, and how they compare to this common approach. We propose and motivate a new approach in this section, and then focus the paper on providing insight into its benefits and limitations, in contrast to using the Boltzmann policy. Let us motivate our approach, by describing the well-known global optimization algorithm called the Cross Entropy Method (CEM) (Rubinstein, 1999) . Global optimization strategies are designed to find the global optimum of a general function f (β) for some parameters β. For example, for parameters β of a neural network, f may be the loss function on a sample of data. An advantage of these methods is that they do not rely on gradient-based strategies, which are prone to getting stuck in local optima. Instead, they use randomized search strategies, that have optimality guarantees in some settings (Hu et al., 2012) and have been shown to be effective in practice (Peters & Schaal, 2007; Hansen et al., 2003; Szita & Lörincz, 2006; Salimans et al., 2017) . CEM maintains a distribution p(β) over parameters β, iteratively narrowing the range of plausible solutions. The algorithm maintain a current threshold f t , that slowly increases over time as it narrows on the maximal β. On iteration t, N parameter vectors β 1 , . . . , β N are sample from p t ; the algorithm only keeps β * 1 , . . . , β * h where f (β * i ) ≥ f t and discards the rest. The KL divergence is reduced between p t and this empirical distribution Î = {β * 1 , . . . , β * h }, for h ≤ N . This step corresponds to increasing the likelihood of the β in the set Î. Iteratively, the distribution over parameters p t narrows around β with higher values under f . To make it more likely to find the global optimum, the initial distribution p 0 is a wide distribution, such as a Gaussian distribution with mean zero µ 0 = 0 and a diagonal covariance Σ 0 of large magnitude.

Algorithm 1 Percentile Empirical Distribution(N, ρ)

Evaluate and sort in descending order: Q θ (S t , a i1 ) ≥ . . . ≥ Q θ (S t , a i N ) return Î(S t ) = {a i1 , . . . , a i h } (where h = ⌈ρN ⌉ ) CEM attempts to find the single-best set of optimal parameters for a single optimization problem. The straightforward use in reinforcement learning is to learn the single-best set of policy parameters w (Szita & Lörincz, 2006; Mannor et al., 2003) . Our goal, however, is to (repeatedly) find maximally valued actions a * conditioned on each state for Q(s, •). The global optimization strategy could be run on each step to find the exact best action for each current state, as in QT-Opt (Kalashnikov et al., 2018) and follows-ups (Simmons-Edler et al., 2019; Shao et al., 2022) , but this is expensive and throws away prior information about the function surface obtained when previous optimizations were executed. We extend CEM to be (a) conditioned on state and (b) learned iteratively over time. The key modification when extending CEM to Conditional CEM (CCEM), to handle these two key differences, is to introduce another proposal policy that concentrates more slowly. This proposal policy is entropyregularized to ensure that we keep a broader set of potential actions when sampling, in case changing action-values are very different since the previous update to that state. The main policy (the actor) does not use entropy regularization, allowing it to more quickly start acting according to currently greedy actions, without collapsing. We visualize this in Figure 1 . π Proposal Policy I * I * I * ∇ π(s) = ∇ I * ln( π(a|s)) ∇ π(s) = ∇ I * ln(π(a|s)) +∇H( π(•|s)) Q(s, •) Q(s, •) π Actor Policy

Actions

Actions Actions π π

Time

Figure 1 : In the left figure we see multiple updates for both policies of the CCEM in a single state. We use uniform policies, for interpretability. In the rightmost figure, we show an actual progression of CCEM with Gaussian policies, when executed on the action-values depicted in the leftmost figure. The Actor policy (in black) concentrates more quickly than the Proposal policy (in red). The CCEM algorithm is presented in Algorithm 2. On each step, the proposal policy, πw ′ t (•|S t ), is sampled to provide a set of actions a 1 , . . . , a N from which we construct the empirical distribution Î(S t ) = {a * 1 , . . . , a * h } of maximally valued actions. The actor parameters w t are updated using a gradient ascent step on the log-likelihood of the actions Î(S t ). The proposal parameters w ′ t are updated using a similar update, but with an entropy regularizer. To obtain Î(S t ), we select a * i ⊂ {a 1 , . . . , a N } where Q(S t , a * i ) are in the top (1 -ρ) quantile values. For example, for ρ = 0.2, approximately the top 20% of actions are chosen, with h = ⌈ρN ⌉. Implicitly, f t is Q θ (S t , a * h ) for a * h the action with the lowest value in this top percentile. This procedure is summarized in Algorithm 1. Greedy Actor-Critic, in Algorithm 3, puts this all together. We use experience replay, and the CCEM algorithm on a mini-batch. The updates involve obtaining the sets Î(S) for every S in the mini-batch B and updating with the gradient 1 |B| S∈B a∈ Î(S) ∇ w ln π w (a|S). The Sarsa update to the critic involves (1) sampling an on-policy action from the actor A ′ ∼ π w (•|S ′ ) for each tuple in the minibatch and (2) using the update 1 |B| (S,A,S ′ ,R,A ′ )∈B (R + γQ θ (S ′ , A ′ ) -Q θ (S, A))∇ θ Q θ (S, A). Other critic updates are possible; we discuss alternatives and connections to related algorithms in Appendix A. Algorithm 2 Conditional CEM for the Actor Input: S t and Q θ , N ∈ N, ρ ∈ (0, 1) if actions discrete and |A| ≤ 1/ρ then Î(S t ) = arg max a∈A Q θ (S t , a) else Sample N actions a i ∼ πw ′ (•|S t ) Obtain Î(S t ) using Algorithm 1 end if w ← w + α p,t a∈ Î(St) ∇ w ln π w (a|S t ) w ′ ← w ′ +α p,t [ a∈ Î(St) ∇ w ′ ln πw ′(a|S t )+ τ ∇ w ′ H(π w ′(•|S t ))] Algorithm 3 Greedy Actor-Critic Initialize parameters θ, w, w ′ , replay buffer B Obtain initial state S while agent interacting with the environment do Take action A ∼ π w (•|S), observe R, S ′ Add (S, A, S ′ , R) to the buffer B Grab a random mini-batch B from buffer B Update θ using Sarsa for policy π w on B Update w, w ′ using Algorithm 2 on B. end while CCEM for Discrete Actions. Although we can use the same algorithm for discrete actions, we can make it simpler when we have a small number of discrete actions. Our algorithm is designed around the fact that it is difficult to solve for the maximal action for Q θ (S t , a) for continuous actions; we slowly identify this maximal action across states. For a small set of discrete actions, it is easy to get this maximizing action. If |A| < 1/ρ, then the top percentile consists of the one top action (or the top actions if there are ties); we can directly set Î(S t ) = arg max a∈A Q θ (S t , a) and do not need to maintain a proposal policy. For this reason, we focus our theory on the continuous-action setting, which is the main motivation for using CEM for the actor update.

4. THEORETICAL GUARANTEES

In this section, we motivate that the target policy underlying CCEM guarantees policy improvement, and characterize the ODE underlying CCEM. We show it tracks a CEM update in expectation across states and slowly concentrates around maximally valued actions even while the action-values are changing.

4.1. POLICY IMPROVEMENT UNDER AN IDEALIZED SETTING

We first consider the setting where we have access to Q π , as is typically done for characterizing the policy improvement properties of an operator (Haarnoja et al., 2018a; Ghosh et al., 2020; Chan et al., 2021) as well as for the original policy improvement theorem (Sutton & Barto, 2018) . Our update moves our policy towards a percentile-greedy policy that redistributes probability solely to the (1 -ρ)-quantile according to magnitudes under Q(s, a). More formally, let f ρ Q (π; s) be the threshold such that {a∈A|Q(s,a)≥f ρ Q (π;s)} π(a|s)da = ρ, namely that gives the set of actions in the top 1 -ρ quantile, according to magnitudes under Q(s, •). Then we can define the percentile-greedy policy as π ρ (a|s, Q, π) = π(a|s)/ρ Q(s, a) ≥ threshf ρ Q (π; s) 0 else where diving by ρ renormalizes the distribution. Computing this policy would be onerous; instead, we only sample the KL divergence to this policy, using a sample percentile. Nonetheless, this percentile-greedy policy represents the target policy that the actor updates towards (in the limit of samples N for the percentile). Intuitively, this target policy should give policy improvement, as it redistributes weight for low valued actions proportionally to high-valued actions. We formalize this in the following theorem. We write π ρ (a|s) instead of π ρ (a|s, Q π , π), when it is clear from context. Theorem 4.1. For a given policy π, action-value Q π and ρ > 0, the percentile-greedy policy π ρ in π and Q π is guaranteed to be at least as good as π in all states: A π ρ (a|s, Q π , π)Q πρ (s, a)da ≥ A π(a|s)Q π (s, a)da Proof. The proof is a straightforward modification of the policy improvement theorem. Notice that A π ρ (a|s)Q π (s, a)da = {a∈A|Q(s,a)≥f ρ Q (π;s)} π(a|s) ρ Q π (s, a)da ≥ A π(a|s)Q π (s, a)da by the definition of percentiles, for any state s. Rewriting A π(a|s)Q π (s, a)da = E π [Q π (s, A)], V π (s) = E π [Q π (s, A)] ≤ E πρ [Q π (s, A)] = E πρ [R t+1 + γE π [Q π (S t+1 , A t+1 )|S t = s] ≤ E πρ [R t+1 + γE πρ [Q π (S t+1 , A t+1 )]|S t = s] ≤ E πρ [R t+1 + γR t+2 + γ 2 E π [Q π (S t+2 , A t+2 )]|S t = s] . . . ≤ E πρ [R t+1 + γR t+2 + γ 2 R t+3 + . . . γ T -1 R T |S t = s] = E πρ [Q πρ (s, A)] = V πρ (s) This result is a sanity check to ensure the target policy is sensible in our update. Note that the Boltzmann policy only guarantees improvement under the entropy-regularized action-values.

4.2. CCEM TRACKS THE GREEDY ACTION

Beyond the idealized setting, we would like to understand the properties of the stochastic algorithm. CCEM is not a gradient descent approach, so we need to reason about its dynamics-namely the underlying ODE. We expect CCEM to behave like CEM per state, but with some qualifiers. First, CCEM uses a parameterized policy conditioned on state, meaning that there is aliasing between the action distributions per state. CEM, on the other hand, does not account for such aliasing. We identify conditions on the parameterized policy and use an ODE that takes expectations over states. Second, the function we are attempting to maximize is also changing with time, because the actionvalues are updating. We address this issue using a two-timescale stochastic approximation approach, where the action-values Q θ change more slowly than the policy, allowing the policy to track the maximally valued actions. The policy itself has two timescales, to account for its own parameters changing at different timescales. Actions for the maximum likelihood step are selected according to older (slower) parameters w ′ , so that it is as if the primary (faster) parameters w are updated using samples from a fixed distribution. These two policies correspond to our proposal policy (slow) and actor (fast). We show that the ODE for the CCEM parameters w t is based on the gradient ∇ w(t) E S∼ν,A∼π w ′ (•|S) I {Q θ (S,A)≥f ρ θ (w ′ ;S)} ln π w(t) (A|S) where θ and w ′ are changing at slower timescales, and so effectively fixed from the perspective of the faster changing w t . The term per-state is exactly the update underlying CEM, and so we can think of this ODE as one for an expected CEM Optimizer, across states for parameterized policies. We say that CCEM tracks this expected CEM Optimizer, because θ and w ′ are changing with time. We provide an informal theorem statement here for Theorem B.1, with a proof-sketch. The main result, including all conditions, is given in Appendix B. We discuss some of the (limitations of the) conditions after the proof sketch. Informal Result: Let θ t be the action-value parameters with stepsize α q,t , and w t be the policy parameters with stepsize α a,t , with w ′ t a more slowly changing set of policy parameters set to t and w ′ t evolves faster than θ t . Under these four conditions, the CCEM Actor tracks the expected CEM Optimizer. w ′ t = (1 -α ′ a,t )w ′ t + α ′ a,t w t for stepsize α ′ a,t ∈ (0, Proof Sketch: The stochastic update to the Actor is not a direct gradient-descent update. Each update to the Actor is a CEM update, which requires a different analysis to ensure that the stochastic noise remains bounded and is asymptotically negligible. Further, the classical results of CEM also do not immediately apply, because such updates assume distribution parameters can be directly computed. Here, distribution parameters are conditioned on state, as outputs from a parametrized function. We identify conditions on the parametrized policy to ensure well-behaved CEM updates. The multi-timescale analysis allows us to focus on the updates of the Actor w t , assuming the actionvalue parameter θ and action-sampling parameter w ′ are quasi-static. These parameters are allowed to change with time-as they will in practice-but are moving at a sufficiently slower timescale relative to w t and hence the analysis can be undertaken as if they are static. The first step in the proof is to formulate the update to the weights as a projected stochastic recursionsimply meaning a stochastic update where after each update the weights are projected to a compact, convex set to keep them bounded. The stochastic recursion is reformulated into a summation involving the mean vector field g θ (w t ) (which depends on the action-value parameters θ), martingale noise, and a loss term ℓ θ t that is due to having approximate quantiles. The key steps are then to show almost surely that the mean vector field g θ is locally Lipschitz, the martingale noise is quadratically bounded and that the loss term ℓ θ t decays to zero asymptotically. For the first and second, we identify conditions on the policy parameterization that guarantee these. For the final case, we adapt the proof for sampled quantiles approaching true quantiles for CEM, with modifications to account for expectations over the conditioning variable, the state. ■ This result has several limitations. First, it does not perfectly characterize the CCEM algorithm that we actually use. We do not use the update w ′ t = (1 -α ′ a,t )w ′ t + α ′ a,t w t , and instead use entropy regularization to make w ′ t concentrate more slowly than w t . The principle is similar; empirically we found entropy regularization to be an effective strategy to achieve this condition. Second, the theory assumes the state distribution is fixed, and not influenced by π w . It is standard to analyze the properties of (off-policy) algorithms for fixed datasets as a first step, as was done for Q-learning (Jaakkola et al., 1994) . It allows us to separate concerns, and just ask: does our method concentrate on maximal actions across states? An important next step is to characterize the full Greedy Actor-Critic algorithm, beyond just understanding the properties of the CCEM component.

5. EMPIRICAL RESULTS

We are primarily interested in investigating sensitivity to hyperparameters. This sensitivity reflects how difficult it can be to get AC methods working on a new task-relevant for both applied settings and research. AC methods have been notoriously difficult to tune due to the interacting time scales of the actor and critic (Degris et al., 2012a) , further compounded by the sensitivity in the entropy scale. The use of modern optimizers may reduce some of the sensitivity in stepsize selection; these experiments help understand if that is the case. Further, a very well-tuned algorithm may not be representative of performance across problems. We particularly examine the impacts of selecting a single set of hyperparameters across environments, in contrast to tuning per environment. We chose to conduct experiments in small, challenging domains appropriately sized for extensive experiment repetition. Ensuring significance in results and carefully exploring hyperparameter sensitivity required many experiments. Our final plots required ∼30,000 runs across all environments, algorithms, and hyperparameters. Further, contrary to popular belief, classic control domains are a challenge for Deep RL agents (Ghiassian et al., 2020) , and performance differences in these environments have been shown to extend to larger environments (Obando-Ceron & Castro, 2021).

5.1. ALGORITHMS

We focus on comparing GreedyAC to Soft Actor-Critic (SAC) both since this allows us to compare to a method that uses the Boltzmann target policy on action-values and because SAC continues to have the most widely reported successfoot_2 . We additionally include VanillaAC as a baseline, a basic AC variant which does not include any of the tricks SAC utilizes to improve performance, such as action reparameterization to estimate the policy gradient or double Q functions to mitigate maximization bias. For discrete actions, policies are parameterized using Softmax distributions. For continuous actions, policies are parameterized using Gaussian distributions, except SAC which uses a squashed Gaussian policy as per the original work. We tested SAC with a Gaussian policy, and it performed worse. All algorithms use neural networks. Feedforward networks consist of two hidden layers of 64 units (classic control environments) or 256 units (Swimmer-v3 environment). Convolutional layers consists of one convolutional layer with 3 kernels of size 16 followed by a fully connected layer of size 128. All algorithms use the Adam optimizer (Kingma & Ba, 2014), experience replay, and target networks for the value functions. See Appendix C.1 for a full discussion of hyperparameters.

5.2. ENVIRONMENTS

We use the classic versions of Mountain Car (Sutton & Barto, 2018) , Pendulum (Degris et al., 2012a) , and Acrobot (Sutton & Barto, 2018) . Each environment is run with both continuous and discrete action spaces; states are continuous. Discrete actions consist of the two extreme continuous actions and 0. All environments use a discount factor of γ = 0.99, and episodes are cut off at 1,000 timesteps, teleporting the agent back to the start state (but not causing termination). To demonstrate the potential of GreedyAC at scale, we also include experiments on Freeway and Breakout from MinAtar (Young & Tian, 2019) as well as on Swimmer-v3 from OpenAI Gym (Brockman et al., 2016) . On MinAtar, episodes are cutoff at 2,500 timesteps. In Mountain Car, the goal is to drive an underpowered car up a hill. State consists of the position in [-1.2, 0.6] and velocity in [-0.7, 0.7]. The agent starts in a random position in [-0.6, -0.4] and velocity 0. The action is the force to apply to the car, in [-1, 1]. The reward is -1 per step. In Pendulum, the goal is to hold a pendulum with a fixed base in a vertical position. State consists of the angle (normalized in [-π, π)) and angular momentum (in [-1, 1]). The agent starts with the pendulum facing downwards and 0 velocity. The action is the torque applied to the fixed base, in [-2, 2] . The reward is the cosine of the angle of the pendulum from the positive y-axis. In Acrobot, the agent controls a doubly-linked pendulum with a fixed base. The goal is to swing the second link one link's length above the fixed base. State consists of the angle of each link (in [-π, π)) and the angular velocity of each link (in [-4π, 4π ] and [-9π, 9π] respectively). The agent starts with random angles and angular velocities in [-0.1, 0.1]. The action is the torque applied to the joint between the two links, in [-1, 1]. The reward is -1 per step.

5.3. EXPERIMENTAL DETAILS

We sweep hyperparameters for 40 runs, tuning over the first 10 runs and reporting results using the final 30 runs for the best hyperparameters. We sweep critic step size α = 10 x for x ∈ {-5, -4, . . . , -1}. We set the actor step size to be κ × α and sweep κ ∈ 10 -3 , 10 -2 , 10 -1 , 1, 2, 10 . We sweep entropy scales τ = 10 y for y ∈ {-3, -2, -1, 0, 1}. For the classic control experiments, we used fixed batch sizes of 32 samples and a replay buffer capacity of 100,000 samples. For the MinAtar experiments, we used fixed batch sizes of 32 samples and a buffer capacity of 1 million. For the Swimmer experiments, we used fixed batch sizes of 100 samples and a buffer capacity of 1 million. For CCEM, we fixed ρ = 0.1 and sample N = 30 actions. To select hyperparameters across environment, we must normalize performance to provide an aggregate score. We use near-optimal performance as the normalizer for each environment, with a score of 1 meaning equal to this performance. We only use this normalization to average scores across environments. We report learning curves using the original unnormalized returns. For more details, see Appendix C.2. Per-environment Tuning: We first examine how well the algorithms can perform when they are tuned perenvironment. In Figure 2 , we see that SAC performs well in Pendulum-CA (continuous actions) and in Pendulum-DA (discrete actions) but poorly in the other settings. SAC learns slower than GreedyAC and VanillaAC on Acrobot. GreedyAC performs worse than SAC in Pendulum-CA, but still performs acceptably, nearly reaching the same final performance. SAC performs poorly on both versions of Mountain Car. That AC methods struggle with Acrobot is common wisdom, but here we see that both GreedyAC and VanillaAC do well on this problem. GreedyAC is the clear winner in Mountain Car. Across-environment Tuning: We next examine the performance of the algorithms when they are forced to select one hyperparameter setting across continuous-or discrete-action environments separately, shown in Figure 3 . We expect algorithms that are less sensitive to their parameters to suffer less degradation. Under this regime, GreedyAC has a clear advantage over SAC. GreedyAC maintains acceptable performance across all environments, sometimes learning more slowly than under per-environment tuning, but having reasonable behavior. SAC performs poorly on two-thirds of the environments. GreedyAC is less sensitive than VanillaAC under across-environment tuning and performs at least as good as VanillaAC.

5.4. RESULTS

Hyperparameter Sensitivity: We examine the sensitivity of GreedyAC and SAC to their entropy scales, focusing on the continuous action environments. We plot sensitivity curves, with one plotted for each entropy scale, with the stepsize on the x-axis and average return across all steps and all 40 runs on the y-axis. Because there are two stepsizes, we have two sets of plots. When examining the sensitivity to the critic stepsize, we select the best actor stepsize. We do the same for the actor stepsize plots. We provide the plots with individual lines in Appendix C.3 and here focus on a more summarized view. ). We set the actor step-size scale to 1.0 and a critic step-size of 10 -foot_3 for both GreedyAC and SAC-the defaults of SAC. We set the entropy scale of SAC to 10 -3 based on a grid search. Figure 5a above clearly indicates GreedyAC can learn a good policy from high-dimensional inputs; comparable performance to DQN Rainbow. Finally, we ran GreedyAC on Swimmer-v3 from OpenAI Gym (Brockman et al., 2016) . We tuned over one run and then ran the tuned hyperparameters for an additional 9 runs to generate Figure 5b . We report online and offline performance. Offline evaluation is performed every 10,000 steps, for 10 episodes, where only the mean action is selected and learning is disabled. We report SAC's final performance on Swimmer from the SpinningUp benchmark 3 . GreedyAC is clearly not state-of-the-art here-most methods are not-however, GreedyAC steadily improves throughout the experiment.

7. CONCLUSION

In this work, we introduced a new Actor-Critic (AC) algorithm called GreedyAC, that uses a new update to the Actor based on an extension of the cross-entropy method (CEM). The idea is to (a) define a percentile-greedy target policy and (b) update the actor towards this target policy, by reducing a KL divergence to it. This percentile-greedy policy guarantees policy improvement, and we prove that our Conditional CEM algorithm tracks the actions of maximal value under changing action-values. We conclude with an in-depth empirical study, showing that GreedyAC has significantly lower sensitivity to its hyperparameters than SAC does.

A RELATED POLICY OPTIMIZATION ALGORITHMS

As mentioned in the main text, there are many policy optimization algorithms that can be seen as approximate policy iteration (API) rather than performing gradient descent on a policy objective. An overview and survey are given by Vieillard et al. (2020a) and Chan et al. (2021) . There, many methods are shown to either minimize a forward or reverse KL-divergence to the Boltzmann policy. Our approach similarly updates the actor using a KL-divergence to a target policy, but here that target policy is the percentile-greedy policy. By doing a maximum likelihood update with actions sampled under the percentile-greedy policy, we are reducing the forward KL-divergence to the percentile-greedy policy in Equation 2. Our CCEM update for the actor is new, but there are several approaches that resemble the idea, particularly those that try to match an expert. This includes dual policy iteration methods (DPI) (Parisotto et al., 2016; Sun et al., 2018; Steckelmacher et al., 2019) and RL as classification methods (Lagoudakis & Parr, 2003; Lazaric et al., 2010; Farahmand et al., 2015) . DPI has two policies, one which is guiding the other. For example, one policy might be an expensive tree search and another a learned neural network, trained to mimic the first (expert or guide) policy. CCEM, on the other hand, uses two policies differently. Our actor does not imitate our proposal policy. Rather, the proposal policy is used to improve the search over the nonconcave surface of Q. It samples actions more broadly, to make it more likely to find a maximizing action. Further, the actor increases the likelihood of only the top actions and does not imitate the proposal policy. In contrast, Bootstrap DPI (Steckelmacher et al., 2019, Equation 5) uses an update based on Actor-Mimic (Parisotto et al., 2016) , where the policy increases likelihood of actions for the softmax policies it is trying to mimic. The resemblance arises from the fact that (Steckelmacher et al., 2019, Equation 5) can be seen as a sum over forward KL divergences to softmax policies (for discrete actions), just like we have a forward KL divergence but to the percentile-greedy policy (for discrete or continuous actions). The other class of algorithms, RL as classification, also look similar due to using a forward KL divergence. They reduce the problem to identifying "positive" actions in a state (producing maximal returns) and "negative" actions in a state (producing non-maximal returns). If a cross-entropy loss is used, then this corresponds to maximizing the likelihood of the positive actions and minimizing the likelihood of the negative ones. More generally, other classification algorithms can be used that do not involve maximizing likelihood (like SVMs). The RL as classifications algorithms primarily focus on how to obtain these positive and negative actions, and otherwise look quite different from Greedy AC, in addition to being restricted to a discrete set of actions. Finally, we can also consider the connection to Conservative Policy Iteration (CPI) (Kakade & Langford, 2002) and a generalization called Deep CPI (Vieillard et al., 2020b) . CPI updates the policy to be an interpolation between the greedy policy G(Q) and the current policy π, to get the new policy π ′ = (1 -α)π + αG(Q) for α ∈ [0, 1]. Deep CPI extends this idea to parameterized policies, instead minimizing a forward KL to this interpolation policy. Greedy AC could be seen as another way to obtain a conservative update, because it does not move the actor all the way to the greedy policy. Instead, it moves towards the percentile-greedy policy (in Equation 2), which shifts probability to the upper percentile of actions. Similarly to the interpolation policy, this percentile-greedy policy depends on the previous policy and on increasing probability for maximally valued actions. As yet, Deep CPI has not been shown to enjoy the same theoretical guarantees as CPI: minimizing a forward KL to the interpolation policy does not provide the same guarantees. It remains an open question how to implement this conservative update in deep RL, and it would be interesting to understand if the CCEM update could provide an alternative route to obtaining such guarantees.

B CONVERGENCE ANALYSIS OF THE ACTOR

We provided an informal proof statement and proof sketch in Section 4.2, to provide intuition for the result. Here, we provide the formal proof in the following subsections. We first provide some definitions, particularly for the quantile function which is central to the analysis. We then lay out the assumptions, and discuss some policy parameterizations to satisfy those assumptions. We finally state the theorem, with proof, and provide one lemma needed to prove the theorem in the final subsection.

B.1 NOTATION AND DEFINITIONS

Notation: For a set A, let Å represent the interior of A, while ∂A is the boundary of A. The abbreviation a.s. stands for almost surely and i.o. stands for infinitely often. Let N represent the set {0, 1, 2, . . . }. For a set A, we let I A to be the indicator function/characteristic function of A and is defined as I A (x) = 1 if x ∈ A and 0 otherwise. Let E g [•], V g [•] and P g (•) denote the expectation, variance and probability measure w.r.t. g. For a σ-field F, let E [•|F] represent the conditional expectation w.r.t. F. A function f : X → Y is called Lipschitz continuous if ∃L ∈ (0, ∞) s.t. ∥f (x 1 ) -f (x 2 )∥ ≤ L∥x 1 -x 2 ∥, ∀x 1 , x 2 ∈ X. A function f is called locally Lipschitz continuous if for every x ∈ X, there exists a neighbourhood U of X such that f |U is Lipschitz continuous. Let C(X, Y ) represent the space of continuous functions from X to Y . Also, let B r (x) represent an open ball of radius r with centered at x. For a positive integer M , let [M ] . = {1, 2 . . . M }. Definition 1. A function Γ : U ⊆ IR d1 → V ⊆ IR d2 is Frechet differentiable at x ∈ U if there exists a bounded linear operator Γ x : IR d1 → IR d2 such that the limit lim ϵ↓0 Γ(x + ϵy) -x ϵ (3) exists and is equal to Γ x (y). We say Γ is Frechet differentiable if Frechet derivative of Γ exists at every point in its domain. Definition 2. Given a bounded real-valued continuous function H : IR d → IR with H(a) ∈ [H l , H u ] and a scalar ρ ∈ [0, 1], we define the (1-ρ)-quantile of H(A) w.r.t. the PDF g (denoted as f ρ (H, g)) as follows: f ρ (H, g) . = sup ℓ∈[H l ,Hu] {P g H(A) ≥ ℓ ≥ ρ}, where P g is the probability measure induced by the PDF g, i.e., for a Borel set A, P g (A) . = A g(a)da. This quantile operator will be used to succinctly write the quantile for Q θ (S, •), with actions selected according to π w , i.e.,  f ρ θ (w; s) . = f ρ (Q θ (s, •), π w (•|s)) = sup ℓ∈[Q θ l ,Q θ u ] {P πw(•|s) Q θ (s, A) ≥ ℓ ≥ ρ}. •, •, •)| < R max < ∞. We analyze the long run behaviour of the conditional cross-entropy recursion (actor) which is defined as follows: w t+1 . = Γ W w t + α a,t 1 N t A∈Ξt I {Q θ t (St,A)≥ f ρ t+1 } ∇ wt ln π w (A|S t ) , where Ξ t . = {A t,1 , A t,2 , . . . , A t,Nt } iid ∼ π w ′ t (•|S t ). w ′ t+1 . = w ′ t + α ′ a,t (w t+1 -w ′ t ) . Here, Γ W {•} is the projection operator onto the compact (closed and bounded) and convex set W ⊂ IR m with a smooth boundary ∂W . Therefore, Γ W maps vectors in IR m to the nearest vectors in W w.r.t. the Euclidean distance (or equivalent metric). Convexity and compactness ensure that the projection is unique and belongs to W . Assumption 2. The pre-determined, deterministic, step-size sequences {α a,t } t∈N , {α ′ a,t } t∈N and {α q,t } t∈N are positive scalars which satisfy the following: t∈N α a,t = t∈N α ′ a,t = t∈N α q,t = ∞ t∈N α 2 a,t + α ′ 2 a,t + α 2 q,t < ∞ lim t→∞ α ′ a,t α a,t = 0, lim t→∞ α q,t α a,t = 0. The first conditions in Assumption 2 are the classical Robbins-Monro conditions (Robbins & Monro, 1985) required for stochastic approximation algorithms. The last two conditions enable the different stochastic recursions to have separate timescales. Indeed, it ensures the w t recursion is faster compared to the recursions of θ t and w ′ t . This timescale divide is needed to obtain the desired asymptotic behaviour, as we describe in the next section. Assumption 3. The pre-determined, deterministic, sample length schedule {N t ∈ N} t∈N is positive and strictly monotonically increases to ∞ and inf t∈N Nt+1 Nt > 1. Assumption 3 states that the number of samples increases to infinity and is primarily required to ensure that the estimation error arising due to the estimation of sample quantiles eventually decays to 0. Practically, one can indeed consider a fixed, finite, positive integer for N t which is large enough to accommodate the acceptable error. Assumption 4. The sequence {θ t } t∈N satisfies θ t ∈ Θ, where Θ ⊂ IR n is a convex, compact set. Also, for θ ∈ Θ, let Q θ (s, a) ∈ [Q θ l , Q θ u ], ∀s ∈ S, a ∈ A. Assumption 4 assumes stability of the Expert, and minimally only requires that the values remain in a bounded range. We make no additional assumptions on the convergence properties of the Expert, as we simply need stability to prove the Actor tracks the update. Assumption 5. For θ ∈ Θ and s ∈ S, let P A∼π w ′ (•|s) (Q θ (s, A) ≥ ℓ) > 0, ∀ℓ ∈ [Q θ l , Q θ u ] and ∀w ′ ∈ W . Assumption 5 implies that there always exists a strictly positive probability mass beyond every threshold ℓ ∈ [Q θ l , Q θ u ]. This assumption is easily satisfied when Q θ (s, a) is continuous in a and π w (•|s) is a continuous probability density function. Assumption 6. sup w,w ′ ∈W, θ∈Θ,ℓ∈[Q θ l ,Q θ u ] E A∼π w ′ (•|S) I {Q θ (S,A)≥ℓ} ∇ w ln π w (A|S)- E A∼π w ′ (•|S) I {Q θ (S,A)≥ℓ} ∇ w ln π w (A|S) S 2 2 S < ∞ a.s., sup w,w ′ ∈W, θ∈Θ,ℓ∈[Q θ l ,Q θ u ] E A∼π w ′ (•|S) I {Q θ (S,A)≥ℓ} ∇ w ln π w (A|S) 2 2 S < ∞ a.s. Assumption 7. For s ∈ S, ∇ w ln π w (•|s) is locally Lipschitz continuous w.r.t. w. Assumptions 6 and 7 are technical requirements that can be more easily characterized when we consider π w to belong to the natural exponential family (NEF) of distributions. Definition 3. Natural exponential family of distributions (NEF) (Morris, 1982) : These probability distributions over IR m are represented by {π η (x) . = h(x)e η ⊤ T (x)-K(η) | η ∈ Λ ⊂ IR d }, ( ) where η is the natural parameter, h : IR m -→ IR, while T : IR m -→ IR d (called the sufficient statistic) and K(η) . = ln h(x)e η ⊤ T (x) dx (called the cumulant function of the family). The space Λ is defined as Λ . = {η ∈ IR d | |K(η)| < ∞}. Also, the above representation is assumed minimal.foot_4 A few popular distributions which belong to the NEF family include Binomial, Poisson, Bernoulli, Gaussian, Geometric and Exponential distributions. We parametrize the policy π w (•|S) using a neural network, which implies that when we consider NEF for the stochastic policy, the natural parameter η of the NEF is being parametrized by w. To be more specific, we have {ψ w : S → Λ|w ∈ IR m } to be the function space induced by the neural network of the actor, i.e., for a given state s ∈ S, ψ w (s) represents the natural parameter of the NEF policy π w (•|s). Further, ∇ w ln π w (A|S) = ln (h(A)) + ψ w (S t ) ⊤ T (A) -K(ψ w (S)) = ∇ w ψ w (S) (T (A) -∇ η K(ψ w (S))) . = ∇ w ψ w (S) T (A) -E A∼πw(•|S) [T (A)] . Therefore Assumption 7 can be directly satisfied by assuming that ψ w is twice continuously differentiable w.r.t. w. Assumption 8. For every θ ∈ Θ, s ∈ S and w ∈ W , f ρ θ (w; s) (from Eq. equation 5) exists and is unique. The above assumption ensures that the true (1 -ρ)-quantile is unique and the assumption is usually satisfied for most distributions and a well-behaved Q θ .

B.3 MAIN THEOREM

To analyze the algorithm, we employ here the ODE-based analysis as proposed in (Borkar, 2008; Kushner & Clark, 2012) . The actor recursions (Eqs. (6-7)) represent a classical two timescale stochastic approximation recursion, where there exists a bilateral coupling between the individual stochastic recursions ( 6) and ( 7). Since the step-size schedules {α a,t } t∈N and {α ′ a,t } t∈N satisfy α ′ a,t αa,t → 0, we have α ′ a,t → 0 relatively faster than α a,t → 0. This disparity induces a pseudoheterogeneous rate of convergence (or timescales) between the individual stochastic recursions which further amounts to the asymptotic emergence of a stable coherent behaviour which is quasiasynchronous. This pseudo-behaviour can be interpreted using multiple viewpoints. When viewed from the faster timescale recursion-controlled by α a,t -the slower timescale recursion-controlled by α ′ a,t -appears quasi-static, i.e., almost a constant. Likewise, when observed from the slower timescale, the faster timescale recursion seems equilibrated. The existence of this stable long run behaviour under certain standard assumptions of stochastic approximation algorithms is rigorously established in (Borkar, 1997) and also in Chapter 6 of (Borkar, 2008) . For our stochastic approximation setting (Eqs. (6-7)), we can directly apply this appealing characterization of the long run behaviour of the two timescale stochastic approximation algorithms-after ensuring the compliance of our setting to the pre-requisites demanded by the characterization-by considering the slow timescale stochastic recursion ( 7) to be quasi-stationary (i.e., w ′ t ≡ w ′ , a.s., ∀t ∈ N), while analyzing the limiting behaviour of the faster timescale recursion (6). Similarly, we let θ t to be quasi-stationary too (i.e., θ t ≡ θ, a.s., ∀t ∈ N). The asymptotic behaviour of the slower timescale recursion is further analyzed by considering the faster timescale temporal variable w t with the limit point so obtained during quasi-stationary analysis. Define the filtration {F t } t∈N , a family of increasing natural σ-fields, where F t . = σ ({w i , w ′ i , (S i , A i , R i , S ′ i ), Ξ i ; 0 ≤ i ≤ t}) . Theorem B.1. Let w ′ t ≡ w ′ , θ t ≡ θ, ∀t ∈ N a.s. Let Assumptions 1-8 hold. Then the stochastic sequence {w t } t∈N generated by the stochastic recursion (6) asymptotically tracks the ODE: d dt w(t) = Γ W w(t) ∇ w(t) E S∼ν,A∼π w ′ (•|S) I {Q θ (S,A)≥f ρ θ (w ′ ;S)} ln π w(t) (A|S) , t ≥ 0. ) In other words, lim t→∞ w t ∈ K a.s., where K is set of stable equilibria of the ODE (10) contained inside W . Proof. Firstly, we rewrite the stochastic recursion (6) under the hypothesis that θ t and w ′ t are quasi-stationary, i.e., θ t ≡ a.s. θ and w ′ t ≡ a.s. w ′ as follows: w t+1 . = Γ W w t + α a,t 1 N t A∈Ξt I {Q θ (St,A)≥ f ρ t+1 } ∇ w ln π w (A|S t ) where f ρ θ (w ′ ; S) . = f ρ (Q θ (S, •), π w ′ (•|S)) and ∇ wt . = ∇ w=wt , i.e. , the gradient w.r.t. w at w t . Define g θ (w) . = E St∼ν,A∼π w ′ (•|St) I {Q θ (St,A)≥f ρ θ (w ′ ;St)} ∇ w ln π w (A|S t ) . ( ) M t+1 . = 1 N t A∈Ξt I {Q θ (St,A)≥ f ρ t+1 } ∇ wt ln π w (A|S t )- E 1 N t A∈Ξt I {Q θ (St,A)≥ f ρ t+1 } ∇ wt ln π w (A|S t ) F t . ℓ θ t . = E 1 N t A∈Ξt I {Q θ (St,A)≥ f ρ t+1 } ∇ wt ln π w (A|S t ) F t - E St∼ν,A∼π w ′ (•|St) I {Q θ (St,A)≥f ρ θ (w ′ ;St)} ∇ wt ln π w (A|S t ) Then we can rewrite equation 11 = Γ W w t + α a,t E St∼ν,A∼π w ′ (•|St) I {Q θ (St,A)≥f ρ θ (w ′ ;St)} ∇ wt ln π w (A|S t ) - E St∼ν,A∼π w ′ (•|St) I {Q θ (St,A)≥f ρ θ (w ′ ;St)} ∇ wt ln π w (A|S t ) + E 1 N t A∈Ξt I {Q θ (St,A)≥ f ρ t+1 } ∇ wt ln π w (A|S t ) F t - E 1 N t A∈Ξt I {Q θ (St,A)≥ f ρ t+1 } ∇ wt ln π w (A|S t ) F t + 1 N t A∈Ξt I {Q θ (St,A)≥ f ρ t+1 } ∇ wt ln π w (A|S t ) . = Γ W g θ (w t ) + M t+1 + ℓ θ t , A few observations are in order: B1. {M t+1 } t∈N is a martingale difference noise sequence w.r.t. the filtration {F t } t∈N , i.e., M t+1 is F t+1 -measurable and integrable, ∀t ∈ N and E [M t+1 |F t ] = 0 a.s., ∀t ∈ N. B2. g θ is locally Lipschitz continuous. This follows from Assumption 7. B3. ℓ θ t → 0 a.s. as t → ∞. (By Lemma 2 below). B4. The iterates {w t } t∈N is bounded almost surely, i.e., sup t∈N ∥w t ∥ < ∞ a.s. This is ensured by the explicit application of the projection operator Γ W {•} over the iterates {w t } t∈N at every iteration onto the bounded set W . B5. ∃L ∈ (0, ∞) s.t. E ∥M t+1 ∥ 2 |F t ≤ L 1 + ∥w t ∥ 2 a.s. This follows from Assumption 6 (ii). Now, we rewrite the stochastic recursion (15) as follows: w t+1 . = w t + α a,t Γ W w t + ξ t g θ (w t ) + M t+1 + ℓ θ t -w t α a,t = w t + α a,t Γ W wt (g θ (w t )) + Γ W wt (M t+1 ) + Γ W wt ℓ θ t + o(α a,t ) , where Γ W is the Frechet derivative (Definition 3). The above stochastic recursion is also a stochastic approximation recursion with the vector field Γ W wt (g θ (w t )), the noise term Γ W wt (M t+1 ), the bias term Γ W wt ℓ θ t with an additional error term o(α a,t ) which is asymptotically inconsequential. Also, note that Γ W is single-valued map since the set W is assumed convex and also the limit exists since the boundary ∂W is considered smooth. Further, for w ∈ W , we have Γ W w (u) . = lim ϵ→0 Γ W {w + ϵu} -w ϵ = lim ϵ→0 w + ϵu -w ϵ = u (for sufficiently small ϵ), i.e., Γ W w (•) is an identity map for w ∈ W . Now by appealing to Theorem 2, Chapter 2 of (Borkar, 2008) along with the observations B1-B5, we conclude that the stochastic recursion (6) asymptotically tracks the following ODE almost surely: d dt w(t) = Γ W w(t) (g θ (w(t))), t ≥ 0 = Γ W w(t) E S∼ν,A∼π w ′ (•|S) I {Q θ (S,A)≥f ρ θ (w ′ ;S)} ∇ w(t) ln π w (A|S) = Γ W w(t) ∇ w(t) E S∼ν,A∼π w ′ (•|S) I {Q θ (S,A)≥f ρ θ (w ′ ;S)} ln π w (A|S) . The interchange of expectation and the gradient in the last equality follows from dominated convergence theorem and Assumption 7 (Rubinstein & Shapiro, 1993) . The above ODE is a gradient flow with dynamics restricted inside W . This further implies that the stochastic recursion (6) converges to a (possibly sample path dependent) asymptotically stable equilibrium point of the above ODE inside W . B.4 PROOF OF LEMMA 2 TO SATISFY CONDITION 3 In this section, we show that ℓ θ t → 0 a.s. as t → ∞, in Lemma 2. To do so, we first need to prove several supporting lemmas. Lemma 1 shows that, for a given Actor and Expert, the sample quantile converges to the true quantile. Using this lemma, we can then prove Lemma 2. In the following subsection, we provide three supporting lemmas about convexity and Lipschitz properties of the sample quantiles, required for the proof Lemma 1. For this section, we require the following characterization of f ρ (Q θ (s, •), w ′ ). Please refer Lemma 1 of (Homem-de Mello, 2007) for more details. f ρ (Q θ (s, •), w ′ ) = arg min ℓ∈[Q θ l ,Q θ u ] E A∼π w ′ (•|s) [Ψ(Q θ (s, A), ℓ)], where Ψ(y, ℓ) . = (y -ℓ)(1 -ρ)I {y≥ℓ} + (ℓ -y)ρI {ℓ≥y} . Similarly, the sample estimate of the true (1 -ρ)-quantile, i.e., f ρ . = Q (⌈(1-ρ)N ⌉) θ,s , ( Q (i) θ,s is the i-th order statistic of the random sample {Q θ (s, A)} A∈Ξ with Ξ . = {A i } N i=1 iid ∼ π w ′ (•|s) ) can be characterized as the unique solution of the stochastic counterpart of the above optimization problem, i.e., f ρ = arg min ℓ∈[Q θ l ,Q θ u ] 1 N A∈Ξ |Ξ|=N Ψ(Q θ (s, A), ℓ). Lemma 1. Assume θ t ≡ θ, w ′ t ≡ w ′ , ∀t ∈ N. Also, let Assumptions 3-5 hold. Then, for a given state s ∈ S, lim t→∞ f ρ t = f ρ (Q θ (s, •), w ′ ) a.s., where f ρ t . = Q (⌈(1-ρ)Nt⌉) θ,s , ( Q (i) θ,s is the i-th order statistic of the random sample {Q θ (s, A)} A∈Ξt with Ξ t . = {A i } Nt i=1 iid ∼ π w ′ (•|s)). Proof. The proof is similar to arguments in Lemma 7 of (Hu et al., 2007) . Since state s and expert parameter θ are considered fixed, we assume the following notation in the proof. Let f ρ t|s,θ . = f ρ t and f ρ |s,θ . = f ρ (Q θ (s, •), w ′ ), where f ρ t and f ρ (Q θ (s, •), w ′ ) are defined in Equations equation 19 and equation 20.

Consider the open cover

{B r (ℓ), ℓ ∈ [Q θ l , Q θ u ]} of [Q θ l , Q θ u ]. Since [Q θ l , Q θ u ] is compact, there exists a finite sub-cover, i.e., ∃{ℓ 1 , ℓ 2 , . . . , ℓ M } s.t. ∪ M i=1 B r (ℓ i ) = [Q θ l , Q θ u ]. Let ϑ(ℓ) . = E A∼π w ′ (•|S) [Ψ(Q θ (s, A), ℓ)] and ϑ t (ℓ) . = 1 Nt A∈Ξt,|Ξt|=Nt, Ξt iid ∼π w ′ (•|s) Ψ(Q θ (s, A), ℓ).

Now, by triangle inequality, we have for

ℓ ∈ [Q θ l , Q θ u ], |ϑ(ℓ) -ϑ t (ℓ)| ≤ |ϑ(ℓ) -ϑ(ℓ j )| + |ϑ(ℓ j ) -ϑ t (ℓ j )| + | ϑ t (ℓ j ) -ϑ t (ℓ)| ≤ L ρ |ℓ -ℓ j | + |ϑ(ℓ j ) -ϑ t (ℓ j )| + L ρ |ℓ j -ℓ| ≤ L ρ + L ρ r + |ϑ(ℓ j ) -ϑ t (ℓ j )|, where L ρ and L ρ are the Lipschitz constants of ϑ(•) and ϑ t (•) respectively. For δ > 0, take r = δ(L ρ + L ρ )/2. Also, by Kolmogorov's strong law of large numbers (Theorem 2.3.10 of (Sen & Singer, 2017)), we have ϑ t (ℓ) → ϑ(ℓ) a.s. This implies that there exists T ∈ N s.t. |ϑ(ℓ j ) -ϑ t (ℓ j )| < δ/2, ∀t ≥ T , ∀j ∈ [M ] . Then from Eq. ( 22), we have |ϑ(ℓ) -ϑ t (ℓ)| ≤ δ/2 + δ/2 = δ, ∀ℓ ∈ [Q θ l , Q θ u ]. This implies ϑ t converges uniformly to ϑ. By Lemmas 3 and 4, ϑ t and ϑ are strictly convex and Lipschitz continuous, and so because ϑ t converges uniformly to ϑ, this means that the sequence of minimizers of ϑ t converge to the minimizer of ϑ (see Lemma 5, Appendix B.6 for an explicit justification). These minimizers correspond to f ρ t and f ρ (Q θ (s, •), w ′ ) respectively, and so lim Nt→∞ f ρ t = f ρ (Q θ (s, •), w ′ ) a.s. Now, for δ > 0 and r . = δ(L ρ + L ρ )/2, we obtain the following from Eq. ( 22): |ϑ(ℓ) -ϑ t (ℓ)| ≤ δ/2 + |ϑ(ℓ j ) -ϑ t (ℓ j )| ⇔ {|ϑ(ℓ j ) -ϑ t (ℓ j )| ≤ δ/2, ∀j ∈ [M ]} ⇒ {|ϑ(ℓ) -ϑ t (ℓ)| ≤ δ, ∀ℓ ∈ [Q θ l , Q θ u ]} ⇒ P π w ′ |ϑ(ℓ) -ϑ t (ℓ)| ≤ δ, ∀ℓ ∈ [Q θ l , Q θ u ] ≥ P π w ′ |ϑ(ℓ j ) -ϑ t (ℓ j )| ≤ δ/2, ∀j ∈ [M ] = 1 -P π w ′ |ϑ(ℓ j ) -ϑ t (ℓ j )| > δ/2, ∃j ∈ [M ] ≥ 1 - M j=1 P π w ′ |ϑ(ℓ j ) -ϑ t (ℓ j )| > δ/2 ≥ 1 -M max j∈[M ] P π w ′ |ϑ(ℓ j ) -ϑ t (ℓ j )| > δ/2 ≥ 1 -2M exp -2N t δ 2 4(Q θ u -Q θ l ) 2 , ( ) where P π w ′ . = P A∼π w ′ (•|s). And the last inequality follows from Hoeffding's inequality (Hoeffding, 1963) along with the fact that E π w ′ ϑ t (ℓ j ) = ϑ(ℓ j ) and sup ℓ∈[Q θ l ,Q θ u ] |ϑ(ℓ)| ≤ Q θ u -Q θ l . Now, the sub-differential of ϑ(ℓ) is given by ∂ ℓ ϑ(ℓ) = ρ -P A∼π w ′ (•|s) (Q θ (s, A) ≥ ℓ) , ρ -1 + P A∼π w ′ (•|s) (Q θ (s, A) ≤ ℓ) . By the definition of sub-gradient we obtain c| f ρ t|s,θ -f ρ |s,θ | ≤ |ϑ( f ρ t|s,θ ) -ϑ(f ρ |s,θ )|, c ∈ ∂ ℓ ϑ(ℓ) ⇒ C| f ρ t|s,θ -f ρ |s,θ | ≤ |ϑ( f ρ t|s,θ ) -ϑ(f ρ |s,θ )|, where C . = max ρ -P A∼π w ′ (•|s) Q θ (s, A) ≥ f ρ |s,θ , ρ -1 + P A∼π w ′ (•|s) Q θ (s, A) ≤ f ρ |s,θ . Further, C| f ρ t|s,θ -f ρ |s,θ | ≤ |ϑ( f ρ t|s,θ ) -ϑ(f ρ |s,θ )| ≤ |ϑ( f ρ t|s,θ ) -ϑ t ( f ρ t|s,θ )| + | ϑ t ( f ρ t|s,θ ) -ϑ(f ρ |s,θ )| ≤ |ϑ( f ρ t|s,θ ) -ϑ t ( f ρ t|s,θ )| + sup ℓ∈[Q θ l ,Q θ u ] | ϑ t (ℓ) -ϑ(ℓ)| ≤ 2 sup ℓ∈[Q θ l ,Q θ u ] | ϑ t (ℓ) -ϑ(ℓ)|. From Eqs. ( 23) and ( 26), we obtain for ϵ > 0 P w ′ N α t | f ρ t|s,θ -f ρ |s,θ | ≥ ϵ ≤ P w ′ N α t sup ℓ∈[Q θ l ,Q θ u ] | ϑ t (ℓ) -ϑ(ℓ)| ≥ ϵ 2 ≤ 2M exp -2N t ϵ 2 16N 2α t (Q θ u -Q θ l ) 2 = 2M exp -2N 1-2α t ϵ 2 16(Q θ u -Q θ l ) 2 . For α ∈ (0, 1/2) and inf t∈N Nt+1 Nt ≥ τ > 1 (by Assumption 3), then ∞ t=1 2M exp -2N 1-2α t ϵ 2 16(Q θ u -Q θ l ) 2 ≤ ∞ t=1 2M exp -2τ (1-2α)t N 1-2α 0 ϵ 2 16(Q θ u -Q θ l ) 2 < ∞. Therefore, by Borel-Cantelli's Lemma (Durrett, 1991) , we have P w ′ N α t f ρ t|s,θ -f ρ |s,θ ≥ ϵ i.o = 0. Thus we have N α t f ρ t|s,θ -f ρ |s,θ → 0 a.s. as N t → ∞. Lemma 2. Almost surely, ℓ θ t → 0 as N t → ∞. Proof of Lemma 2: Consider E 1 N t A∈Ξt I {Q θ (St,A)≥ f ρ t+1 } ∇ wt ln π w (A|S t ) F t = E E Ξt 1 N t A∈Ξt I {Q θ (St,A)≥ f ρ t+1 } ∇ wt ln π w (A|S t ) S t = s, w ′ t Notice that, because of the conditions on π w ′ (•|s), we know that the sample average converges with an exponential rate in the number of samples, for arbitrary w ′ ∈ W . Namely, for ϵ > 0 and N ∈ N, we have P Ξ iid ∼π w ′ (•|s) 1 N A∈Ξ I {Q θ (s,A)≥f ρ (Q θ (s,•),π w ′ (•|s)} ∇ w ln π w (A|s)- E A∼π w ′ (•|s) I {Q θ (s,A)≥ f ρ (Q θ (s,•),π w ′ (•|s)} ∇ w ln π w (A|s) ≥ ϵ ≤ C 1 exp (-c 2 N c3 ϵ c4 ), ∀θ ∈ Θ, w, w ′ ∈ W, s ∈ S, where C 1 , c 2 , c 3 , c 4 > 0. Therefore, for α ′ > 0, we have P N α ′ t 1 N t A∈Ξt I {Q θ (s,A)≥ f ρ θ,s } ∇ wt ln π w (A|s) -E I {Q θ (s,A)≥ f ρ θ,s } ∇ wt ln π w (A|s) ≥ ϵ ≤ C 1 exp - c 2 N c3 t ϵ c4 N c4α ′ t = C 1 exp -c 2 N c3-c4α ′ t ϵ c4 ≤ C 1 exp -c 2 τ (c3-c4α ′ )t N c3-c4α ′ 0 ϵ c4 , where f ρ θ,s . = f ρ (Q θ (s, •), π w ′ (•|s)) and inf t∈N Nt+1 Nt ≥ τ > 1 (by Assumption 3). For c 3 -c 4 α ′ > 0 ⇒ α ′ < c 3 /c 4 , we have ∞ t=1 C 1 exp -c 2 τ (c3-c4α ′ )t N c3-c4α ′ 0 ϵ c4 < ∞. Therefore, by Borel-Cantelli's Lemma (Durrett, 1991) , we have P N α ′ t 1 N t A∈Ξt I {Q θ (s,A)≥ f ρ θ,s } ∇ wt ln π w (A|s) -E I {Q θ (s,A)≥ f ρ θ,s } ∇ wt ln π w (A|s) ≥ ϵ i.o. = 0. This implies that N α ′ t 1 N t A∈Ξt I {Q θ (s,A)≥ f ρ θ,s } ∇ wt ln π w (A|s) -E I {Q θ (s,A)≥ f ρ θ,s } ∇ wt ln π w (A|s) → 0 a.s. ( ) The above result implies that the sample average converges at a rate O(N α ′ t ), where 0 < α ′ < c 3 /c 4 independent of w, w ′ ∈ W . By Lemma 1, we have the sample quantiles f ρ t also converging to the true quantile at a rate O(N α t ) independent of w, w ′ ∈ W . Now the claim follows directly from Assumption 6 (ii) and bounded convergence theorem. ■ B.5 SUPPORTING LEMMAS FOR LEMMA 1 Lemma 3. Let Assumption 5 hold. For θ ∈ Θ, w ′ ∈ W , s ∈ S and ℓ ∈ [Q θ l , Q θ u ], we have 1. E A∼π w ′ (•|s) [Ψ(Q θ (s, A), ℓ)] is Lipschitz continuous. 2. 1 N A∈Ξ |Ξ|=N Ψ(Q θ (s, A), ℓ) (with Ξ iid ∼ π w ′ (•|s)) is Lipschitz continuous with Lipschitz constant independent of the sample length N . Proof. Let ℓ 1 , ℓ 2 ∈ [Q θ l , Q θ u ], ℓ 2 ≥ ℓ 1 . By Assumption 5 we have P A∼π w ′ (•|s) (Q θ (s, A) ≥ ℓ 1 ) > 0 and P A∼π w ′ (•|s) (Q θ (s, A) ≥ ℓ 2 ) > 0. Now, E A∼π w ′ (•|s) [Ψ(Q θ (s, A), ℓ 1 )] -E A∼π w ′ (•|s) [Ψ(Q θ (s, A), ℓ 2 )] = E A∼π w ′ (•|s) (Q θ (s, A) -ℓ 1 )(1 -ρ)I {Q θ (s,A)≥ℓ1} + (ℓ 1 -Q θ (s, A))ρI {ℓ1≥Q θ (s,A)} -E A∼π w ′ (•|s) (Q θ (s, A) -ℓ 2 )(1 -ρ)I {Q θ (s,A)≥ℓ2} + (ℓ 2 -Q θ (s, A))ρI {ℓ2≥Q θ (s,A)} = E A∼π w ′ (•|s) (Q θ (s, A) -ℓ 1 )(1 -ρ)I {Q θ (s,A)≥ℓ1} + (ℓ 1 -Q θ (s, A))ρI {ℓ1≥Q θ (s,A)} -(Q θ (s, A) -ℓ 2 )(1 -ρ)I {Q θ (s,A)≥ℓ2} + (ℓ 2 -Q θ (s, A))ρI {ℓ2≥Q θ (s,A)} = E A∼π w ′ (•|s) (1 -ρ)(ℓ 2 -ℓ 1 )I {Q θ (s,A)≥ℓ2} + ρ(ℓ 1 -ℓ 2 )I {Q θ (s,A)≤ℓ1} + + (-(1 -ρ)ℓ 1 -ρℓ 2 + ρQ θ (s, A) + (1 -ρ)Q θ (s, A)) I {ℓ1≤Q θ (s,A)≤ℓ2} ≤ (1 -ρ)|ℓ 2 -ℓ 1 | + (2ρ + 1) |ℓ 2 -ℓ 1 | = (ρ + 2)|ℓ 2 -ℓ 1 |. Similarly, we can prove the later claim also. This completes the proof of Lemma 3.  ∈ Θ, w ′ ∈ W , s ∈ S and ℓ ∈ [Q θ l , Q θ u ], we have E A∼π w ′ (•|s) [Ψ(Q θ (s, A), ℓ)] and 1 N A∈Ξ |Ξ|=N Ψ(Q θ (s, A), ℓ) (with Ξ iid ∼ π w ′ (•|s)) are strictly convex. Proof. For λ ∈ [0, 1] and ℓ 1 , ℓ 2 ∈ [Q l , Q u ] with ℓ 1 ≤ ℓ 2 , we have E A∈π w ′ (•|S) Ψ(Q θ (S, A), λℓ 1 + (1 -λ)ℓ 2 ) (28) = E A∈π w ′ (•|S) (1 -ρ) Q θ (S, A) -λℓ 1 -(1 -λ)ℓ 2 I {Q θ (S,A)≥λℓ1+(1-λ)ℓ2} + ρ λℓ 1 + (1 -λ)ℓ 2 -Q θ (S, A) I {Q θ (S,A)≤λℓ1+(1-λ)ℓ2} . Notice that Q θ (S, A) -λℓ 1 -(1 -λ)ℓ 2 I {Q θ (S,A)≥λℓ1+(1-λ)ℓ2} = λQ θ (S, A) -λℓ 1 + (1 -λ)Q θ (S, A) -(1 -λ)ℓ 2 I {Q θ (S,A)≥λℓ1+(1-λ)ℓ2} We consider how one of these components simplifies. E A∈π w ′ (•|S) λQ θ (S, A) -λℓ 1 I {Q θ (S,A)≥λℓ1+(1-λ)ℓ2} = λE A∈π w ′ (•|S) Q θ (S, A) -ℓ 1 I {Q θ (S,A)≥λℓ1} -Q θ (S, A) -ℓ 1 I λℓ1≤{Q θ (S,A)≤λℓ1+(1-λ)ℓ2} ≤ λE A∈π w ′ (•|S) Q θ (S, A) -ℓ 1 I {Q θ (S,A)≥λℓ1} ▷ -Q θ (S, A) -ℓ 1 ≤ 0 for λℓ 1 ≤ {Q θ (S, A) ≤ λℓ 1 + (1 -λ)ℓ 2 } ≤ λE A∈π w ′ (•|S) Q θ (S, A) -ℓ 1 I {Q θ (S,A)≥ℓ1} ▷ Q θ (S, A) -ℓ 1 ≤ 0 for I λℓ1≤{Q θ (S,A)≤ℓ1} Similarly, we get E A∈π w ′ (•|S) Q θ (S, A) -ℓ 2 I {Q θ (S,A)≥λℓ1+(1-λ)ℓ2} ≤ E A∈π w ′ (•|S) Q θ (S, A) -ℓ 2 I {Q θ (S,A)≥ℓ2} E A∈π w ′ (•|S) ℓ 1 -Q θ (S, A) I {Q θ (S,A)≤λℓ1+(1-λ)ℓ2} ≤ E A∈π w ′ (•|S) ℓ 1 -Q θ (S, A) I {Q θ (S,A)≤ℓ1} E A∈π w ′ (•|S) ℓ 2 -Q θ (S, A) I {Q θ (S,A)≤λℓ1+(1-λ)ℓ2} ≤ E A∈π w ′ (•|S) ℓ 2 -Q θ (S, A) I {Q θ (S,A)≤ℓ2} Therefore, for Equation equation 28, we get equation 28 ≤ λ(1 -ρ)E A∈π w ′ (•|S) Q θ (S, A) -ℓ 1 I {Q θ (S,A)≥ℓ1} + (1 -λ)(1 -ρ)E A∈π w ′ (•|S) Q θ (S, A) -ℓ 2 I {Q θ (S,A)≥ℓ2} + λρE A∈π w ′ (•|S) ℓ 1 -Q θ (S, A) I {Q θ (S,A)≤ℓ1} + (1 -λ)ρE A∈π w ′ (•|S) ℓ 2 -Q θ (S, A) I {Q θ (S,A)≤ℓ2} = λE A∈π w ′ (•|S) [Ψ(Q θ (S, A), ℓ 1 )] + (1 -λ)E A∈π w ′ (•|S) [Ψ(Q θ (S, A), ℓ 2 )] . We can prove the second claim similarly. This completes the proof of Lemma 4. Proof. Let c = lim inf n x * n . We employ proof by contradiction here. For that, we assume x * > c. Now, note that f (x * ) < f (c) and f (x * ) < f ((x * + c) /2) (by the definition of x * ). Also, by the strict convexity of f , we have f ((x * + c)/2) < (f (x * ) + f (c)) /2 < f (c). Therefore, we have f (c) > f ((x * + c)/2) > f (x * ). ( ) Let r 1 ∈ IR be such that f (c) > r 1 > f ((x * + c)/2). Now, since ∥f n -f * ∥ ∞ → 0 as n → ∞, there exists an positive integer N s.t. |f n (c) -f (c)| < f (c) -r 1 , ∀n ≥ N and ϵ > 0. Therefore, f n (c) -f (c) > r 1 -f (c) ⇒ f n (c) > r 1 . Similarily, we can show that f n ((x * + c)/2) > r 1 . Therefore, we have f n (c) > f n ((x * + c)/2). Similarily, we can show that f n ((x * + c)/2) > f n (x * ). Finally, we obtain f n (c) > f n ((x * + c)/2) > f n (x * ), ∀n ≥ N. Now, by the extreme value theorem of the continuous functions, we obtain that for n ≥ N , f n achieves minimum (say at x p in the closed interval [c, ( x * + c)/2]. Note that f n (x p ) ≮ f n ((x * + c)/2) (if so then f n (x p ) will be a local minimum of f n since f n (x * ) < f n ((x * + c)/2)). Also, f n (x p ) ̸ = f n ((x * + c)/2). Therefore, f n achieves it minimum in the closed interval [c, (x * + c)/2] at the point (x * + c)/2. This further implies that x * n > (x * + c)/2. Therefore, lim inf n x * n ≥ (x * + c)/2 ⇒ c ≥ (x * + c)/2 ⇒ c ≥ x * . This is a contradiction and implies lim inf n x * n ≥ x * . Now consider g n (x) = f n (-x). Note that g n is also continuous and strictly convex. Indeed, for λ ∈ [0, 1], we have g n (λx 1 + (1 -λ)x 2 ) = f n (-λx 1 -(1 -λ)x 2 ) < λf (-x 1 ) + (1 -λ)f (-x 2 ) = λg(x 1 ) + (1 -λ)g(x 2 ). Applying the result from Eq. ( 31) to the sequence {g n } n∈N , we obtain that lim inf n (-x * n ) ≥ -x * . This implies lim sup n x * n ≤ x * . Therefore, lim inf n x * n ≥ x * ≥ lim sup n x * n ≥ lim sup n x * n . Hence, lim inf n x * n = lim sup n x * n = x * C EXPERIMENTAL DETAILS C.1 HYPERPARAMETER DETAILS In this section, we outline the tuned hyperparameters for each algorithm on each environment in our experiments. For each algorithm, hyperparameters were tuned over an initial 10 runs with different random seeds. Each algorithm saw the same 10 initial random seeds. For a list of all hyperparameters swept, see Section 5.3. In Table 1 , we list the tuned hyperparameters for each algorithm when tuning across continuous-action environments. In Table 2 , we list the tuned hyperparameters for each algorithm when tuning across discrete-action environments. In Tables 3, 4 , and 5, we list the tuned hyperparameters when tuning per-environment for GreedyAC, VanillaAC, and SAC respectively. Finally, Table 6 outlines the hyperparamters used in the experiments on Swimmer. 

C.2 NORMALIZATION APPROACH

For each environment, we find the best return achieved by any agent, across all runs, as a simple approximation to a near-optimal return. Table 7 lists these returns for each environment. Then, to obtain a normalized score, we use 1 -BestValue-AlgValue |BestValue| , where the numerator is guaranteed to be nonnegative. If AlgValue = BestValue we get the highest value of 1. If AlgValue is half of BestValue, we get 0.5BestValue |BestValue| = 0.5. If AlgValue is significantly worse than BestValue, the score is much lower. The AlgValue that we normalize is the point depicted on the sensitivity plot. It corresponds to the Average Return across timesteps and across runs for the algorithm, with that hyperparameter setting in that environment. For the experiments in Figure 3 , where we tune across the complete set of discrete-or continuousaction environments, we first compute the normalized scores just described. Then, we compute the average normalized scores for each algorithm and hyperparameter setting across discrete-action and continuous-action environments separately. We then choose the hyperparameter setting for each algorithm for the discrete-and continuous-action environments based on the hyperparameter setting which resulted in the highest normalized scores. The learning curves for these hyperparameters, for each algorithm, are shown in Figure 3 .

Environment Continuous Discrete

Acrobot -56 -56 Mountain Car -65 -83 Pendulum 930 932 Table 7 : Approximate return achieved by an optimal policy. We approximate the return achievable by a near-optimal policy on environment E by finding the highest return achieved over all runs of all hyperparameters and all agents on environment E.

C.3 SENSITIVITY PLOTS

We plot parameter sensitivity curves, which include a line for each entropy scale, with the stepsize on the x-axis. Because there are two stepsizes, we have two sets of plots -one for the critic stepsize and one for the actor stepsize. When examining the sensitivity to the critic stepsize, we select the corresponding best actor stepsize. This means that for each point (critic stepsize, entropy scale) = (α, τ ) on the sensitivity plot for the critic stepsize, we find the best actor stepsize and report the performance for that triplet averaged over all 40 runs. We do the same procedure when plotting the actor stepsize on the x-axis, but maximizing over critic stepsize. Modern variants of SAC utilize a trick to automatically adapt the entropy scale hyperparameter during training (Haarnoja et al., 2018b) . In order to gauge which variant of SAC to use in this work, we performed an ablation study where we studied SAC with and without automatic entropy tuning. We ran SAC with automatic entropy tuning for 10 runs. Hyperparameters were swept in the same sets as listed in Section 5.3. Additionally, we swept entropy scale step-sizes β = 10 z for in z ∈ {-4, -3, -2} for automatic entropy tuning. Figure 8 shows the learning curves of SAC with automatic entropy tuning, over 10 runs, and SAC without automatic entropy tuning over the 40 runs conducted for the experiments in the main text. As can be seen in the figure, performing a grid search over the entropy scale hyperparameter never degrades performance compared to using automatic entropy tuning, and in some cases results in better performance than when using automatic entropy tuning. Because of this, we decided to use manual entropy tuning through a grid search in our experiments, which also allows us to characterize the sensitivity of SAC's performance with respect to the entropy scale hyperparameter.



CEM has been used for policy optimization, but for two very different purposes. It has been used to directly optimize the policy gradient objective(Mannor et al., 2003;Szita & Lörincz, 2006). CEM has also been used to solve for the maximal action-running CEM each time we want to find max ′ a Q(S ′ , a ′ )-for an algorithm called QT-Opt(Kalashnikov et al., 2018). A follow-up algorithm adds an explicit deterministic policy to minimize a squared error to this maximal action(Simmons-Edler et al., 2019) and another updates the actor with this action rather than the on-policy action(Shao et al., 2022). We do not directly use CEM, but rather extend the idea underlying CEM to provide a new policy update. See https://spinningup.openai.com/en/latest/spinningup/bench.html See https://spinningup.openai.com/en/latest/spinningup/bench.html For a distribution in NEF, there may exist multiple representations of the form (8). However, for the distribution, there definitely exists a representation where the components of the sufficient statistic are linearly independent and such a representation is referred to as minimal.



1]. Assume: (1) States S t are sampled from a fixed marginal distribution. (2) ∇ w ln π w (•|s) is locally Lipschitz w.r.t. w, ∀s ∈ S. (3) Parameters w t and θ t remain bounded almost surely. (4) Stepsizes are chosen for three different timescales: w t evolves faster than w ′

Figure 2: Learning curves when tuning hyperparameters perenvironment, averaged over 30 runs with standard errors.

Figure 3: Learning curves when tuning hyperparameters across-environments, averaged over 30 runs with standard errors.

Figure 4: A sensitivity region plot for entropy, for GreedyAC (top row) and SAC (bottom row) in the continuous action problems.

Figure4depicts the range of performance obtained across entropy scales. The plot is generated by filling in the region between the curves for each entropy scale. If this sensitivity region is broad, then the algorithm performed very differently across different entropy scales and so is sensitive to the entropy. SAC has much wider sensitivity regions than GreedyAC. Those of GreedyAC are generally narrow, indicating that the stepsize rather than entropy was the dominant factor. Further, the bands of performance are generally at the top of the plot. When SAC exhibits narrower regions than GreedyAC, those regions are lower on the plot, indicating overall poor performance.6 SCALING GREEDY-AC

Let Assumption 5 hold. Then, for θ

Let {f n ∈ C(IR, IR)} n∈N be a sequence of strictly convex, continuous functions converging uniformly to a strict convex function f . Let x * n = arg min x f n (x) and x * = arg min x∈IR f (x). Then lim n→∞ x * n = x * .

CA 1.0 1e-3 10.0 Mountain Car-DA 2.0 1e-3 -Pendulum-CA 1e-1 1e-2 10.0 Pendulum-DA 1.0 1e-3 -Table 3: Hyperparameters tuned per-environment for GreedyAC. Hyperparameters Chosen for GreedyAC on Swimmer.

Figure 6: Sensitivity curves for the critic step-size hyperparameter α for GreedyAC and SAC, with one line for each entropy scale tested. The critic step-size is plotted on a logarithmic scale on the x-axis.

Given a realization of the transition dynamics of the MDP in the form of a sequence of transition tuples O . = {(S t , A t , R t , S ′ t )} t∈N , where the state S t ∈ S is drawn using a latent sampling distribution ν, while A t ∈ A is the action chosen at state S t , the transitioned state S ∋ S ′

