GREEDY ACTOR-CRITIC: A NEW CONDITIONAL CROSS-ENTROPY METHOD FOR POLICY IMPROVEMENT

Abstract

Many policy gradient methods are variants of Actor-Critic (AC), where a value function (critic) is learned to facilitate updating the parameterized policy (actor). The update to the actor involves a log-likelihood update weighted by the actionvalues, with the addition of entropy regularization for soft variants. In this work, we explore an alternative update for the actor, based on an extension of the cross entropy method (CEM) to condition on inputs (states). The idea is to start with a broader policy and slowly concentrate around maximally valued actions, using a maximum likelihood update towards actions in the top percentile per state. The speed of this concentration is controlled by a proposal policy, that concentrates at a slower rate than the actor. We first provide a policy improvement result in an idealized setting, and then prove that our conditional CEM (CCEM) strategy tracks a CEM update per state, even with changing action-values. We empirically show that our GreedyAC algorithm, that uses CCEM for the actor update, performs better than Soft Actor-Critic and is much less sensitive to entropy-regularization.

1. INTRODUCTION

Many policy optimization strategies update the policy towards the Boltzmann policy. This strategy became popularized by Soft Q-Learning (Haarnoja et al., 2017) and Soft Actor-Critic (SAC) (Haarnoja et al., 2018a) , but has a long history in reinforcement learning (Kober & Peters, 2008; Neumann, 2011) . In fact, recent work (Vieillard et al., 2020a; Chan et al., 2021) has highlighted that an even broader variety of policy optimization methods can be seen as optimizing either a forward or reverse KL divergence to the Boltzmann policy, as in SAC. In fact, even the original Actor-Critic (AC) update (Sutton, 1984) can be seen as optimizing a reverse KL divergence, with zero-entropy. The use of the Boltzmann policy underlies many methods for good reason: it guarantees policy improvement (Haarnoja et al., 2018a) . More specifically, this is the case when learning entropyregularized action-values Q π τ for a policy π with regularization parameter τ > 0. The Boltzmann policy for a state is proportional to exp(Q π τ (s, a)τ -1 ). The level of emphasis on high-valued actions is controlled by τ : the higher the magnitude of the entropy level (larger τ ), the less the probabilities in the Boltzmann policy are peaked around maximally valued actions. This choice, however, has several limitations. The policy improvement guarantee is for the entropyregularized MDP, rather than the original MDP. Entropy regularization is used to encourage exploration (Ziebart et al., 2008; Mei et al., 2019) and improve the optimization surface (Ahmed et al., 2019; Shani et al., 2020) , resulting in a trade-off between improving the learning process and converging to the optimal policy. Additionally, SAC and other methods are well-known to be sensitive to the entropy regularization parameter (Pourchot & Sigaud, 2019) . Prior work has explored optimizing entropy during learning (Haarnoja et al., 2018b) , however, this optimization introduces yet another hyperparameter to tune, and this approach may be less performant than a simple grid search (see Appendix D). It is reasonable to investigate alternative policy improvement approaches that could potentially improve our actor-critic algorithms. Code available at https://github.com/samuelfneumann/GreedyAC. 1

