GREEDY ACTOR-CRITIC: A NEW CONDITIONAL CROSS-ENTROPY METHOD FOR POLICY IMPROVEMENT

Abstract

Many policy gradient methods are variants of Actor-Critic (AC), where a value function (critic) is learned to facilitate updating the parameterized policy (actor). The update to the actor involves a log-likelihood update weighted by the actionvalues, with the addition of entropy regularization for soft variants. In this work, we explore an alternative update for the actor, based on an extension of the cross entropy method (CEM) to condition on inputs (states). The idea is to start with a broader policy and slowly concentrate around maximally valued actions, using a maximum likelihood update towards actions in the top percentile per state. The speed of this concentration is controlled by a proposal policy, that concentrates at a slower rate than the actor. We first provide a policy improvement result in an idealized setting, and then prove that our conditional CEM (CCEM) strategy tracks a CEM update per state, even with changing action-values. We empirically show that our GreedyAC algorithm, that uses CCEM for the actor update, performs better than Soft Actor-Critic and is much less sensitive to entropy-regularization.

1. INTRODUCTION

Many policy optimization strategies update the policy towards the Boltzmann policy. This strategy became popularized by Soft Q-Learning (Haarnoja et al., 2017) and Soft Actor-Critic (SAC) (Haarnoja et al., 2018a) , but has a long history in reinforcement learning (Kober & Peters, 2008; Neumann, 2011) . In fact, recent work (Vieillard et al., 2020a; Chan et al., 2021) has highlighted that an even broader variety of policy optimization methods can be seen as optimizing either a forward or reverse KL divergence to the Boltzmann policy, as in SAC. In fact, even the original Actor-Critic (AC) update (Sutton, 1984) can be seen as optimizing a reverse KL divergence, with zero-entropy. The use of the Boltzmann policy underlies many methods for good reason: it guarantees policy improvement (Haarnoja et al., 2018a) . More specifically, this is the case when learning entropyregularized action-values Q π τ for a policy π with regularization parameter τ > 0. The Boltzmann policy for a state is proportional to exp(Q π τ (s, a)τ -1 ). The level of emphasis on high-valued actions is controlled by τ : the higher the magnitude of the entropy level (larger τ ), the less the probabilities in the Boltzmann policy are peaked around maximally valued actions. This choice, however, has several limitations. The policy improvement guarantee is for the entropyregularized MDP, rather than the original MDP. Entropy regularization is used to encourage exploration (Ziebart et al., 2008; Mei et al., 2019) and improve the optimization surface (Ahmed et al., 2019; Shani et al., 2020) , resulting in a trade-off between improving the learning process and converging to the optimal policy. Additionally, SAC and other methods are well-known to be sensitive to the entropy regularization parameter (Pourchot & Sigaud, 2019) . Prior work has explored optimizing entropy during learning (Haarnoja et al., 2018b) , however, this optimization introduces yet another hyperparameter to tune, and this approach may be less performant than a simple grid search (see Appendix D). It is reasonable to investigate alternative policy improvement approaches that could potentially improve our actor-critic algorithms. In this work we propose a new greedification strategy towards this goal. The basic idea is to iteratively take the top percentile of actions, ranked according to the learned action-values. The procedure slowly concentrates on the maximal action(s), across states, for the given action-values. The update itself is simple: N ∈ N actions are sampled according to a proposal policy, the actions are sorted based on the magnitude of the action-values, and the policy is updated to increase the probability of the ⌈ρN ⌉ maximally valued actions for ρ ∈ (0, 1). We call this algorithm for the actor Conditional CEM (CCEM), because it is an extension of the well-known Cross-Entropy Method (CEM) (Rubinstein, 1999) to condition on inputsfoot_1 . We leverage theory for CEM to validate that our algorithm concentrates on maximally valued actions across states over time. We introduce GreedyAC, a new AC algorithm that uses CCEM for the actor. GreedyAC has several advantages over using Boltzmann greedification. First, we show that our new greedification operator ensures a policy improvement for the original MDP, rather than a different entropy-regularized MDP. Second, we can still leverage entropy to prevent policy collapse, but only incorporate it into the proposal policy. This ensures the agent considers potentially optimal actions for longer, but does not skew the actor. In fact, it is possible to decouple the role of entropy for exploration and policy collapse within GreedyAC: the actor could have a small amount of entropy to encourage exploration, and the proposal policy a higher level of entropy to avoid policy collapse. Potentially because of this decoupling, we find that GreedyAC is much less sensitive to the choice of entropy regularizer, as compared to SAC. This design of the algorithm may help it avoid getting stuck in a locally optimal action, and empirical evidence for CEM suggests it can be quite effective for this purpose (Rubinstein & Kroese, 2004) . In addition to our theoretical support for CCEM, we provide an empirical investigation comparing GreedyAC, SAC, and a vanilla AC, highlighting that GreedyAC performs consistently well, even in problems like the Mujoco environment Swimmer and pixel-based control where SAC performs poorly.

2. BACKGROUND AND PROBLEM FORMULATION

The interaction between the agent and environment is formalized by a Markov decision process (S, A, P, R, γ), where S is the state space, A is the action space, P : S × A × S → [0, ∞) is the one-step state transition dynamics, R : S × A × S → R is the reward function, and γ ∈ [0, 1] is the discount rate. We assume an episodic problem setting, where the start state S 0 ∼ d 0 for start state distribution d 0 : S → [0, ∞) and the length of the episode T is random, depending on when the agent reaches termination. At each discrete timestep t = 1, 2, . . . , T , the agent finds itself in some state S t and selects an action A t drawn from its stochastic policy π : S × A → [0, ∞). The agent then transitions to state S t+1 according to P and observes a scalar reward R t+1 . = R(S t , A t , S t+1 ). For a parameterized policy π w with parameters w, the agent attempts to maximize the objective J(w) = E πw [ T t=0 γ t R t+1 ], where the expectation is according to start state distribution d 0 , transition dynamics P, and policy π w . Policy gradient methods, like REINFORCE (Williams, 1992) , attempt to obtain (unbiased) estimates of the gradient of this objective to directly update the policy. The difficulty is that the policy gradient is expensive to sample, because it requires sampling return trajectories from states sampled from the visitation distribution under π w , as per the policy gradient theorem (Sutton et al., 1999) . Theory papers analyze such an idealized algorithm (Kakade & Langford, 2002; Agarwal et al., 2021) , but in practice this strategy is rarely used. Instead, it is much more common to (a) ignore bias in the state distribution (Thomas, 2014; Imani et al., 2018; Nota & Thomas, 2020) and (b) use biased estimates of the return, in the form of a value function critic. The action-value function Q π (s, a) . = E π [ T -t k=1 γ t R t+k |S t = s, A t = a] is the expected return from a given state and action, when following policy π. Many PG methods-specifically variants of Actor-Critic-estimate these action-values with parameterized Q θ (s, a), to use the update Q θ (s, a)∇ ln π w (a|s) or one with a baseline [Q θ (s, a) -V (s)]∇ ln π w (a|s) where the value function V (s) is also typically learned. The state s is sampled from a replay buffer, and a ∼ π w (•|s), for the update.



Code available at https://github.com/samuelfneumann/GreedyAC. CEM has been used for policy optimization, but for two very different purposes. It has been used to directly optimize the policy gradient objective(Mannor et al., 2003; Szita & Lörincz, 2006). CEM has also been used to solve for the maximal action-running CEM each time we want to find max ′ a Q(S ′ , a ′ )-for an algorithm called QT-Opt(Kalashnikov et al., 2018). A follow-up algorithm adds an explicit deterministic policy to minimize a squared error to this maximal action(Simmons-Edler et al., 2019) and another updates the actor with this action rather than the on-policy action(Shao et al., 2022). We do not directly use CEM, but rather extend the idea underlying CEM to provide a new policy update.

