CASA: BRIDGING THE GAP BETWEEN POLICY IMPROVE-MENT AND POLICY EVALUATION WITH CONFLICT AVERSE POLICY ITERATION

Abstract

We study the problem of model-free reinforcement learning, which is often solved following the principle of Generalized Policy Iteration (GPI). While GPI is typically an interplay between policy evaluation and policy improvement, most conventional model-free methods with function approximation assume the independence of GPI steps, despite of the inherent connections between them. In this paper, we present a method that attempts to eliminate the inconsistency between policy evaluation step and policy improvement step, leading to a conflict averse GPI solution with gradient-based functional approximation. Our method is capital to balancing exploitation and exploration between policy-based and value-based methods and is applicable to existing policy-based and value-based methods. We conduct extensive experiments to study theoretical properties of our method and demonstrate the effectiveness of our method on Atari 200M benchmark.

1. INTRODUCTION

Model-free reinforcement learning has made many impressive breakthroughs in a wide range of Markov Decision Processes (MDP) (Vinyals et al., 2019; Pedersen, 2019; Badia et al., 2020) . Overall, the methods could be cast into two categories, value-based methods such as DQN (Mnih et al., 2015) and Rainbow (Hessel et al., 2017) , and policy-based methods such as TRPO (Schulman et al., 2015) , PPO (Schulman et al., 2017) and IMPALA (Espeholt et al., 2018) . Value-based methods learn state-action values and select the action according to their values. The main target of value-based methods is to approximate the fixed point of the Bellman equation through the generalized policy iteration (GPI) (Sutton & Barto, 2018) , which generally consists of policy evaluation and policy improvement. One characteristic of the value-based methods is that unless a more accurate state-action value is estimated by iterations of the policy evaluation, the policy will not be improved. Previous works equip value-based methods with many carefully designed structures to achieve more promising reward learning and sample efficiency (Wang et al., 2016; Schaul et al., 2015; Kapturowski et al., 2018) . Policy-based methods learn a parameterized policy directly without consulting state-action values. One characteristic of policy-based methods is that they incorporate a policy improvement phase in every training step, while in contrast, the value-based methods only change the policy after the action corresponding to the highest state-action values is changed. In principle, policy-based methods perform policy improvement more frequently than value-based methods. We notice that value-based and policy-based methods locate at the two extremes of GPI, where value-based methods won't improve the policy until a more accurate policy evaluation is achieved, while policy-based methods improve the policy for every training step even when the policy evaluation hasn't converged. To mitigate the defect of each, we pursuit a technique that is capable of balancing between the two extremes flexibly. We first study the gradients between policy improvement and policy evaluation and notice that they show a positive correlation statistically during the entire training process. To find out if there exists a way that the gradients of the policy improvement and the policy evaluation are parallel, we propose CASA, Critic AS an Actor, which satisfies a weaker compatible condition (Sutton et al., 1999) and enhances gradient consistency between policy improvement and policy evaluation. With further delving into the properties of CASA, we find CASA is an innovative combination of value-based and policy-based methods. When the policy-based methods are equipped with CASA, the collapse to the sub-optimal solution as the entropy goes to zero is prevented by the evaluation of the state-action values, which encourages exploration. When the value-based methods are equipped with CASA, the policy improvement via policy gradient is equivalent to the evaluation of the state-action values and a self-bootstrapped policy improvement, which enhances exploitation. To enable CASA for a large scale off-policy learning, we introduce Doubly-Robust Trace (DR-Trace), which exploits doubly-robust estimator (Jiang & Li, 2016) and guarantees the synchronous convergence of the state-action values and the state values. Our main contributions are as follows: (i) We present a novel method CASA which enhances gradient consistency between policy evaluation and policy improvement and present extensive studies on the behavior of the gradients. (ii) We demonstrate CASA could be freely applied to both policy-based and value-based algorithms with motivating examples. (iii) We present extensive empirical study on Atari benchmark , where our conflict averse algorithm brings substantial improvements over the baseline methods.

2. PRELIMINARY

Consider an infinite-horizon MDP, defined by a tuple (S, A, p, r, γ), where S is the state space, A is the action space, p : S × A × S → [0, 1] is the state transition probability function, r : S × A → R is the reward function, and γ is the discounted factor. The policy is a mapping π : S × A → [0, 1] which assigns a distribution over the action space given a state. The objective of reinforcement learning is to maximize the return, or cumulative discounted rewards, maximize J = E traj∼π t γ t r(s t , a t ) , where traj = {s 0 , a 0 , r 0 , . . . } is a trajectory sampled by π with policy-environment interaction. Value-based methods maximize J by estimating various type of value functions: the state value function is defined as V π (s) = E π [ t γ t r t |s 0 = s], the state-action value function is defined as Q π (s, a) = E π [ t γ t r t |s 0 = s, a 0 = a]; the advantage function is defined as A π (s, a) = Q π (s, a) -V π (s). The objective of maximizing the value functions in value-based methods can be improved through GPI until converging to the optimal policy. For the approximated state-value function Q θ that estimates Q π , the policy evaluation is conducted by: minimize E π [(Q π (s, a) -Q θ (s, a)) 2 ], where Q π is estimated by various methods, e.g., λ-return (Sutton, 1988) and ReTrace (Munos et al., 2016) . The policy improvement is usually achieved by greedily selecting actions with the highest state-action values.

