CASA: BRIDGING THE GAP BETWEEN POLICY IMPROVE-MENT AND POLICY EVALUATION WITH CONFLICT AVERSE POLICY ITERATION

Abstract

We study the problem of model-free reinforcement learning, which is often solved following the principle of Generalized Policy Iteration (GPI). While GPI is typically an interplay between policy evaluation and policy improvement, most conventional model-free methods with function approximation assume the independence of GPI steps, despite of the inherent connections between them. In this paper, we present a method that attempts to eliminate the inconsistency between policy evaluation step and policy improvement step, leading to a conflict averse GPI solution with gradient-based functional approximation. Our method is capital to balancing exploitation and exploration between policy-based and value-based methods and is applicable to existing policy-based and value-based methods. We conduct extensive experiments to study theoretical properties of our method and demonstrate the effectiveness of our method on Atari 200M benchmark.

1. INTRODUCTION

Model-free reinforcement learning has made many impressive breakthroughs in a wide range of Markov Decision Processes (MDP) (Vinyals et al., 2019; Pedersen, 2019; Badia et al., 2020) . Overall, the methods could be cast into two categories, value-based methods such as DQN (Mnih et al., 2015) and Rainbow (Hessel et al., 2017) , and policy-based methods such as TRPO (Schulman et al., 2015) , PPO (Schulman et al., 2017) and IMPALA (Espeholt et al., 2018) . Value-based methods learn state-action values and select the action according to their values. The main target of value-based methods is to approximate the fixed point of the Bellman equation through the generalized policy iteration (GPI) (Sutton & Barto, 2018) , which generally consists of policy evaluation and policy improvement. One characteristic of the value-based methods is that unless a more accurate state-action value is estimated by iterations of the policy evaluation, the policy will not be improved. Previous works equip value-based methods with many carefully designed structures to achieve more promising reward learning and sample efficiency (Wang et al., 2016; Schaul et al., 2015; Kapturowski et al., 2018) . Policy-based methods learn a parameterized policy directly without consulting state-action values. One characteristic of policy-based methods is that they incorporate a policy improvement phase in every training step, while in contrast, the value-based methods only change the policy after the action corresponding to the highest state-action values is changed. In principle, policy-based methods perform policy improvement more frequently than value-based methods. We notice that value-based and policy-based methods locate at the two extremes of GPI, where value-based methods won't improve the policy until a more accurate policy evaluation is achieved, while policy-based methods improve the policy for every training step even when the policy evaluation hasn't converged. To mitigate the defect of each, we pursuit a technique that is capable of balancing between the two extremes flexibly. We first study the gradients between policy improvement and policy evaluation and notice that they show a positive correlation statistically during the entire training process. To find out if there exists a way that the gradients of the policy improvement and the policy evaluation are parallel, we propose CASA, Critic AS an Actor, which satisfies a weaker compatible condition (Sutton et al., 1999) and enhances gradient consistency between policy improvement and policy evaluation. With further delving into the properties of CASA, we find CASA is an innovative combination of value-based and policy-based methods. When the policy-based methods are equipped with CASA, the collapse to the sub-optimal solution as the entropy goes to zero is prevented by the evaluation of the state-action values, which encourages exploration. When the value-based methods are equipped with CASA, the policy improvement via policy gradient is equivalent to the evaluation of the state-action values and a self-bootstrapped policy improvement, which enhances exploitation. To enable CASA for a large scale off-policy learning, we introduce Doubly-Robust Trace (DR-Trace), which exploits doubly-robust estimator (Jiang & Li, 2016) and guarantees the synchronous convergence of the state-action values and the state values. Our main contributions are as follows: (i) We present a novel method CASA which enhances gradient consistency between policy evaluation and policy improvement and present extensive studies on the behavior of the gradients. (ii) We demonstrate CASA could be freely applied to both policy-based and value-based algorithms with motivating examples. (iii) We present extensive empirical study on Atari benchmark , where our conflict averse algorithm brings substantial improvements over the baseline methods.

2. PRELIMINARY

Consider an infinite-horizon MDP, defined by a tuple (S, A, p, r, γ), where S is the state space, A is the action space, p : S × A × S → [0, 1] is the state transition probability function, r : S × A → R is the reward function, and γ is the discounted factor. The policy is a mapping π : S × A → [0, 1] which assigns a distribution over the action space given a state. The objective of reinforcement learning is to maximize the return, or cumulative discounted rewards, maximize J = E traj∼π t γ t r(s t , a t ) , where traj = {s 0 , a 0 , r 0 , . . . } is a trajectory sampled by π with policy-environment interaction. Value-based methods maximize J by estimating various type of value functions: the state value function is defined as V π (s) = E π [ t γ t r t |s 0 = s], the state-action value function is defined as Q π (s, a) = E π [ t γ t r t |s 0 = s, a 0 = a]; the advantage function is defined as A π (s, a) = Q π (s, a) -V π (s). The objective of maximizing the value functions in value-based methods can be improved through GPI until converging to the optimal policy. For the approximated state-value function Q θ that estimates Q π , the policy evaluation is conducted by: minimize E π [(Q π (s, a) -Q θ (s, a)) 2 ], where Q π is estimated by various methods, e.g., λ-return (Sutton, 1988) and ReTrace (Munos et al., 2016) . The policy improvement is usually achieved by greedily selecting actions with the highest state-action values. π*, Q* Q = Q π π = p g (Q ) π, Q Figure 1: The GPI process in our work. Unlike (Sutton & Barto, 2018) , we evaluate π by Q instead of V , and we improve π using policy gradient ascent (pg for brevity) instead of greedy. The learning procedure is shown by the black arrows, i.e., E → I → E → I • • • . π θ , Q θ Q θ = Q π π*, Q* π θ = pg( Q θ ) Figure 2 : GPI with function approximation. Due to the constraint of approximated function space, the ideal policy iteration cannot be actually achieved. The underlying process of GPI with function approximation can be regarded as doing policy improvement and policy evaluation in an ideal space then being projected back into the approximated function space (Sutton & Barto, 2018; Ghosh et al., 2020) . Policy-based methods maximize J by optimizing some parameterized policy π θ according to the policy gradient theorem (Sutton & Barto, 2018) , ∇ θ J = E π [Ψ(s, a)∇ θ log π θ (a|s)]. The vanilla policy gradient uses Ψ = ∞ t=0 γ t r t . Actor-critic algorithms approximate Ψ(s, a) in the form of baseline, e.g., IMPALA (Espeholt et al., 2018 ) adopts Ψ(s, a) = r + γV π (s ′ ) -V θ (s) and uses V-Trace to estimate V π .

3.1. MOTIVATION

We use V θ to estimate V π , Q θ to estimate Q π and π θ to represent the policy, where θ represents all parameters to be optimized. In this work, there is one backbone and two individual heads after the backbone. The advantage function and the policy share one head, and state value function is the other head. Hence the policy reuses all parameters of value functions except that temperature τ is only for the policy. We keep τ static in this work. We use E to represent the policy evaluation, which gives the ascent direction of the gradient by θ ← θ + ηE π [(Q π -Q θ )∇ θ Q θ ]. We use I to represent the policy improvement, which gives θ ← θ + ηE π [(Q π -V θ )∇ θ log π θ ]. Let's recap the GPI process as shown in Figure 1 . To get rid of the function approximation error, we first assume the approximation function enjoys infinite capacity. We use < x, y > to denote the angle between two vectors, where < x, y >= arccos( x•y ||x||•||y|| ) with arccos : [-1, 1] → [0, π]. We define an important notion β, which represents the angle between the gradient ascent directions of I and E, as follows, β def = < E π [(Q π -Q θ )∇ θ Q θ ], E π [(Q π -V θ )∇ θ log π θ ] > . When β = 0 i.e.cos(β) = 1, I and E become parallel to each other, which is the blue arrow in Figure 1 , and there is no conflict between the gradient ascent directions of I and E anymore. When β = π/2 i.e.cos(β) = 0, I and E are perpendicular. When β = π i.e.cos(β) = -1, I and E are toward exactly opposite directions. Next, we assume the representation capacity of the approximation function is limited. When the function approximation is involved, i.e. Q π is estimated by Q θ and π is approximated by π θ , from the view of operators (Ghosh et al., 2020) , each of I and E can be further decomposed into two operators, as shown in Figure 2 . One is to do the policy improvement and the policy evaluation, the other is to project into the restricted function space. When β > 0, GPI with function approximation would involve two projection Millions of frames operators in each iteration, which introduces inevitable approximation error. When β = 0, if the function approximation error is not considered, we find that the gradient conflict between I and E would be totally eliminated. If we consider the limitation of the approximation function, similar to the blue arrow in Figure 1 , one iteration (represented by two black arrows and two dotted arrows) can be united into one arrow and one dotted arrow (not shown in Figure 2 but analogy to the blue arrow in Figure 1 ), where the gradient conflict is eliminated and the two projection operators are reduced to one correspondingly. As stated above, if β = 0 holds, we can expect that the gradient conflict between the policy improvement and the policy evaluation is eliminated and the function approximation error could be reduced. However, β is usually estimated by sampling with stochasticity. It's difficult to let β = 0 by optimizing θ. Instead, we consider another notion χ by removing step sizes and taking expectation outside, where the angle of each state is fully controllable by θ. χ def = E π [cos < ∇ θ Q θ , ∇ θ log π θ >]. In fact, χ is highly correlated to compatible value function (Sutton et al., 1999) , and Theorem 3 shows that χ = 1 is the necessary condition for the compatible condition ∇ θ Q θ = ∇ θ log π θ , which is a weaker compatible condition. More details about compatible value function are in Appendix A. To further understand the behavior of β and χ, we track cos(β) and χ of two algorithms PPO and R2D2 as representatives for policy-based and value-based methods, respectively. We show an important fact in Figure 3 that both χ and cos(β) are statistically positive for both original version and adjusted versions, which means that arccos(χ) and β are likely to be less then π/2 with neural network approximated functions. The aforementioned conceptual and empirical findings inspire us to raise the following question on GPI: whether we can guarantee χ = 1, so that cos(β) is also closer to 1.

3.2. FORMULATION

Denote τ ∈ R + to be a positive temperature and sg to be a stop gradient operator. CASA can estimate V θ and A θ by any function parameterized by θ, where π θ and Q θ are derived as follows:          π θ (•|s) = softmax(A θ (s, •)/τ ), Āθ (s, a) = A θ (s, a) - a ′ sg(π θ (a ′ |s))A θ (s, a ′ ), Q θ (s, a) = Āθ (s, a) + sg(V θ (s)). Note that there exist two sg operators in equation 6. The first sg operator is used for computing advantage as Āθ = A θ -E π [A θ ] = A θ -sg(π θ ) • A θ , where the sg operator here guarantees the gradients between policy improvement and policy evaluation are parallel, which we elaborate later. Intuitively, this sg operator also means that we keep π θ unchanged when evaluating the policy π θ . The second sg operator exists in Q θ = Āθ + sg(V θ ). As (Chen & He, 2020) regards sg in siamese representation learning as a case of EM-algorithm (Dempster et al., 1977) , a similar interpretation exists here. Q θ = Āθ + sg(V θ ) decomposes the estimation of Q θ into a two stage problem, where the first is to estimate the advantage of each action without changing the expectation, the second is to estimate the expectation. The equation 6 includes a straightforward refinement of dueling-DQN. We know dueling-DQN estimates Q π by Q θ = A θ + V θ , but it cannot guarantee E π [A θ ] = 0 i.e. E π [Q θ ] = V θ due to the function approximation error. But if we estimate Q π by Q θ = A θ -E π [A θ ] + V θ , it satisfies the necessary condition E π [Q θ ] = V θ without loss of generality.

3.3. PATH CONSISTENCY BETWEEN POLICY EVALUATION AND POLICY IMPROVEMENT

For brevity, we omit θ and V, Q, A, π are all approximated functions. Denote the estimations of V and Q as V π and Q π respectively. For instance, one choice is to calculate V π and Q π by V-Trace (Espeholt et al., 2018) and ReTrace (Munos et al., 2016) respectively. At training time, the policy evaluation is achieved by updating θ to minimize, L V (θ) = E π [(V π -V ) 2 ], L Q (θ) = E π [(Q π -Q) 2 ], which gives the ascent direction of θ by: ∇ θ L V (θ) = E π (V π -V )∇ θ V , ∇ θ L Q (θ) = E π (Q π -Q)∇ θ Q . And we make the policy improvement by policy gradient, which gives the ascent direction of θ by: ∇ θ J (τ, θ) = E π τ (Q π -V )∇ θ log π , where J (τ, θ) = τ E π [ γ t r t ]. It takes an additional τ , which frees the scale of gradient from τ . The final gradient ascent direction of θ is given by: α 1 ∇ θ L V + α 2 ∇ θ L Q + α 3 ∇ θ J . With (V, Q, π) defined in equation 6, by Lemma E.1, we have, ∇ θ Q = (1 -π)∇ θ A = τ ∇ θ log π. For brevity, denote the shared gradient path as g = (1 -π)∇ θ A. Plugging equation 10 into equation 7 equation 8, we have, ∇ θ L Q = E π (Q π -Q)g , ∇ θ J = E π (Q π -V )g . By equation 11, ∇ θ L Q and ∇ θ J walk along the same vector direction of gradient path g for each state. By equation 10, this is exactly the case χ = 1. Since all parameters to estimate Q and π are shared except for τ , we call it Critic AS an Actor. If we make a subtraction between ∇ θ L Q and ∇ θ J , we have, ∇ θ J = ∇ θ L Q + E π [(Q -V )g] . We know E π [(Q -V )g] is a self-bootstrapped policy gradient with function approximated Q. Recalling the fact that the value-based methods improves the policy by greedily selecting actions according to Q, if we apply ∇ θ J on θ, it additionally utilizes Q to do policy improvement. This is a more greedy usage of Q to improve policy than its usual usage. If we exploit the structural information as (V, Q, π) defined by equation 6, by Lemma E.2, E π [(Q -V )g] = τ E π [(Q -V )∇ θ log π] = -τ 2 ∇ θ H[π], then we have, ∇ θ L Q = ∇ θ J + τ 2 ∇ θ H[π]. The equation 13 shows ∇ θ L Q is a policy gradient with an entropy regularization. If we apply ∇ θ L Q on θ for policy-based methods, an entropy regularization works implicitly by α 2 ∇ θ L Q in equation 9, which prevents the policy collapse to a sub-optimal solution.

3.4. DR-TRACE AND OFF-POLICY TRAINING

DR-Trace V-Trace / ReTrace δ DR t =r t + γV (s t+1 ) -Q(s t , a t ) δ V /Q t =r t + γV (s t+1 )/Q(s t+1 , a t+1 ) -V (s t )/Q(s t , a t ) V π E µ [V t + k≥0 γ k c [t:t+k-1] ρ t+k δ DR t+k ] E µ [V t + k≥0 γ k c [t:t+k-1] ρ t+k δ V t+k ] Q π E µ [Q t + k≥0 γ k c [t+1:t+k-1] (1 {k=0} + 1 {k>0} ρ t+k )δ DR t+k ] E µ [Q t + k≥0 γ k c [t+1:t+k] δ Q t+k ] ∇J E µ [ρ t (Q π t -V t )∇ log π] E µ [ρ t (r t + V π t+1 -V t )∇ log π] Table 1 : Comparison between DR-Trace and V-Trace/ReTrace. To enable off-policy training with behavior policy µ, one choice is to estimate V π and Q π in equation 7 and equation 8 by V-Trace and ReTrace. As CASA estimates (V, Q, π), applying Doubly Robust (Jiang & Li, 2016) is feasible and suitable. We propose DR-Trace and find the convergence rate and the fixed point of DR-Trace are the same as V-Trace's according to its convergence proof. For completeness, we provide DR-Trace and its comparison with V-Trace/ReTrace in Table 1 . More details are in Appendix D.

4.1. BASIC SETUP

We employ a Learner-Actor pipeline (Espeholt et al., 2018) for large-scale training. Motivation and ablation experiments on PPO and R2D2 don't use LSTM, only experiments on CASA+DR-Trace use LSTM (Schmidhuber, 1997), which is for comparison with other algorithms. We use burn-in (Kapturowski et al., 2018) when LSTM is used. All estimated values share the same backbone, which is followed by two fully connected layers for each individual head. We use no intrinsic reward and no entropy regularization in any experiment. We find that using life information can greatly increase the performance of some games. However, to be general, we will not end the episode if life is lost. All hyperparameters are in Appendix F. For brevity, we denote ∇L V = E π [(V π -V θ )∇V θ ], ∇L Q = E π [(Q π -Q θ )∇Q θ ] and ∇J = E π [(Q π - V θ )∇ log π θ ], where expectation is batch-wise average in our implementation. When we write < a, b > with a, b ∈ {∇L V , ∇L Q , ∇J }, we firstly calculate batch-wise averaged gradient of a and b, then we calculate the angle in-between. When we write cos < ∇Q, ∇ log π > or χ, we mean E π [cos < ∇ θ Q θ , ∇ θ log π θ >], which firstly calculates element-wise cosines and then takes batch-wise average. To avoid numerical problem, we calculate x•y ||x||•||y|| by x•y max(||x||,10 -8 )•max(||y||,10 -8 ) .

4.2. APPLICATION OF CASA ON REPRESENTATIVE ALGORITHMS

CASA is applicable to existing algorithms. We take PPO and R2D2 for demonstration. The application of CASA on PPO is straightforward. Applying CASA on R2D2 is impossible as either ϵ-greedy policy PPO PPO+CASA R2D2 R2D2+CASA ⇒ (V, A) = (V θ , A θ ) ⇒ (V, A) = (V θ , A θ ) Func. (V, logit) = (V θ , logit θ ) π = softmax(A/τ ) (V, A) = (V θ , A θ ) π = softmax(A/τ ) Approx. π = softmax(logit) Ā = A -sg(π) • A Q = A + V Ā = A -sg(π) • A Q = Ā + sg(V ) Q = Ā + sg(V ) Gradient 0.5∇L V + ∇J ⇒ 0.5∇L V + ∇L Q + ∇J ∇L Q ⇒ 0.5∇L V + ∇L Q + ∇J Table 2 : Examples of applying CASA on policy-based methods (PPO) and value-based methods (R2D2). or arg max Q policy breaks the gradient. This problem is the same as calculating the gradients of policy improvement for value-based methods. We use a surrogate policy π surrogate = softmax(A/τ ), which is discussed in Appendix B. Table 2 summarizes adjustments of function approximations and training gradients. Since PPO+CASA and R2D2+CASA have the same function approximation, recalling the fact that valuebased methods improve the policy when a more accurate evaluation is achieved and policy-based methods improve the policy for every step, we can balance the two flexibly with χ = 1 by α 1 , α 2 , α 3 in equation 9. In Figure 3 , algorithms with CASA show much higher cos(β) and χ. PPO+CASA does more exploration than the original PPO, as the entropy of π doesn't easily drop to zero. R2D2+CASA tends to distinct the state-action values, where we use the entropy of Q to measure how greedy the current state-action values are.

4.3. BEHAVIOR OF GRADIENTS ON DIFFERENT

STRUCTURES PPO+CASA Q = A θ -sg(π θ ) • A θ + sg(V θ ) type 1 Q = A θ -π θ • A θ + sg(V θ ) type 2 Q = A θ -sg(π θ ) • A θ + V θ type 3 Q = A θ + sg(V θ ) type 4 Q = A θ + V θ type 5 Q = Q θ Table 3 : Behavior of gradient on different types. Type 1&2 are CASA-like structures, where type 1 removes sg of π and type 2 removes sg of V θ . Type 3&4 are dueling-like structures, where type 3 adds sg to V for dueling-Q and type 4 is dueling-Q. Type 5 uses a new head to estimate Q θ separately, which can be considered as an auxiliary task to estimate Q π . Though we show that CASA satisfies ∇Q ∝ ∇ log π, which means χ = 1, it's unknown if the structure of CASA is unique. As Q = A -E π [A] + sg(V ) is a direct refinement of dueling-DQN, we try several different structures of PPO+CASA. All settings of estimating state-action values are shown in Table 3 . We always use 0.5 • ∇L V + ∇L Q + ∇J as the training gradient. We present Breakout and Qbert in Figure 4 . For the sake of clarity, we group PPO+CASA and type 3 as sg-V group, type 2 and type 4 as no-sg-V group. The sg-V group has higher χ and higher cos(β), which is closer to the compatible condition and the consistency between two GPI steps, and no-sg-V group is always worst than its contrast in sg-V group. PPO+CASA has χ = 1 and the highest cos(β). Type 1 has less returns than PPO+CASA. Hence, when applying a CASA-like structure, stopping the gradient of π is always preferred. Type 5 uses an individual head to estimate Q π , which performs the worst. Hence, a well-designed CASA-like or dueling-like structure is always preferred. By scatter plot and box plot in Figure 4 , χ and cos(β) are positive correlated depending on different structures. This phenomenon answers part of the last question of Section 3.1: for these specific designed structures, χ and cos(β) show positive correlation. 

5. RELATED WORKS

Both value-based or policy-based approaches comply with the principle of GPI, but two GPI steps are coarsely related to each other such that jointly optimizing both functions might potentially bring conflicts. Despite of such crucial issue in GPI with function approximation, most decent model-free algorithms adopt a standard policy improvement/evaluation regime without considering conflict diminishing properties. The issue of reducing conflicts among multiple models trained simultaneously was considered in earlier machine learning literature, such as for robust parameter estimation for multiple estimators under incomplete data (Robins & Rotnitzky, 1995; Lunceford & Davidian, 2004; Kang & Schafer, 2007) and multitask learning with gradient similarity measure (Chen et al., 2020; Yu et al., 2020; Javaloy & Valera, 2022) . When the idea is introduced to reinforcement learning, earliest attempts tackle conservative and safe policy iteration problems (Kakade & Langford, 2002; Hazan & Kale, 2011; Pirotta et al., 2013) . Recently, more works have emerged to study GPI in a fine-grained manner. In (Ghosh et al., 2020) , a new Bellman operator is introduced which implements GPI with a policy improvement operator and a projection operator, where the projection attempts to find the best approximation of policy among realizable policies. In (Raileanu & Fergus, 2021) , the policy and value updates are decoupled by approximating two networks with representation regularization. In (Cobbe et al., 2021) , GPI is separated into a policy improvement and a feature distillation step. On contrast to the aforementioned works, we tackle the conflicts in GPI at the gradient-level, with theoretical analysis. Our work is related to (Nachum et al., 2017) , which utilizes both the unbiasedness and stability of on-policy training and the data efficiency of off-policy training to form a soft consistency error. Our work bridges the gap between the two GPI steps from an alternative angle of establishing a closer relationship between policy and value functions in their forms, without the focus on off-policy correction. Due to the difficulty of controlling the gap between GPI steps, we instead consider χ. The condition χ = 1 is close to compatible value function (Sutton et al., 1999; Kakade, 2001) , shown in Section 3.1 and Appendix A.

6. LIMITATION

It's noticeable that CASA is only applied on discrete action space for now. We further find CASA applicable to any function approximation that is able to estimate advantage functions of all actions. We provide additional discussion on continuous action space in Appendix C. Since π shares all parameters of value functions, it brings χ = 1 but sacrifices the freedom of π to be parameterized by other parameters. We conjecture that CASA is one endpoint of a trade-off curve between χ and the freedom of π, where the other endpoint is that π shares no parameter with value functions.

7. ETHICS AND REPRODUCIBILITY STATEMENT

This paper is aimed at academic issues in deep reinforcement learning, and the experiment used is also in the early stage, but it may provide opportunities for malicious applications of reinforcement learning in the future. We describe all details to reproduce the main experimental results in Appendix F.

8. CONCLUSION

This paper attempts to eliminate gradient inconsistency between policy improvement and policy evaluation. The proposed innovative actor-critic design Critic AS an Actor (CASA) enhances consistency of two GPI steps by satisfying a weaker compatible condition. We present both theoretical proof and empirical evaluation for CASA. The results show that our proposed method achieves state-of-the-art performance standards with noticeable performance gain over several strong baselines when evaluated on ALE 200 million (200M) benchmark. We also present several ablation studies, which demonstrates the effectiveness of the proposed method's theoretical properties. Future work includes studying the connection between the compatible condition and the gradient consistency between policy improvement and policy evalution. 

A COMPATIBLE VALUE FUNCTION

The original policy gradient with compatible value function is stated as follow. Theorem 1 (Sutton et al. (1999) ). Let Q w be a state-action function with parameter w and π θ be a policy function with parameter θ. If Q w satisfies E π [(Q π -Q w )∇ w Q w ] = 0 and ∇ w Q w = ∇ θ log π θ , then ∇ θ J = E π [Q w ∇ θ log π θ ]. If we let w = θ in Theorem 1, where Q w and π θ share parameters, we have the following theorem. Theorem 2. Let Q θ be a state-action function with parameter θ and π θ be a policy function with parameter θ. If Q θ satisfies E π [(Q π -Q θ )∇ θ Q θ ] = 0 and ∇ θ Q θ = ∇ θ log π θ , then ∇ θ J = E π [Q θ ∇ θ log π θ ]. Define χ def = E π [cos < ∇ θ Q θ , ∇ θ log π θ >]. We show that χ = 1 is the necessary condition for the compatible condition ∇ θ Q θ = ∇ θ log π θ . Theorem 3. i) If ∇ θ Q θ ∝ ∇ θ log π θ for all states, then χ = 1. ii) If χ = 1, then ∇ θ Q θ ∝ ∇ θ log π θ for all states. By Theorem 3, χ = 1 is equivalent to ∇ θ Q θ ∝ ∇ θ log π θ , and ∇ θ Q θ ∝ ∇ θ log π θ is the necessary condition for ∇ θ Q θ = ∇ θ log π θ , hence χ = 1 is the necessary condition for ∇ θ Q θ = ∇ θ log π θ . Proof. i) Since ∇ θ Q θ ∝ ∇ θ log π θ , we have < ∇ θ Q θ , ∇ θ log π θ >= 0. By definition of χ, we have χ = E π [cos < ∇ θ Q θ , ∇ θ log π θ >] = E π [1] = 1. ii) Since χ ≤ 1 and cos(x) is monotonic decreasing as x goes from 0 to π, the equality χ = 1 only holds when all states satisfy cos < ∇ θ Q θ , ∇ θ log π θ >= 0, which means ∇ θ Q θ ∝ ∇ θ log π θ .

B GRADIENTS BETWEEN POLICY IMPROVEMENT AND POLICY EVALUATION Function Approximation Train Gradients

Cosine of Interested Angles  PPO (V, logit) = (V θ , logit θ ) 0.5∇L V + ∇J π = softmax(logit) PPO ver.1 (Q, logit) = (Q θ , logit θ ), 0.5∇L V + ∇J cos < ∇L Q , ∇J > π = softmax(logit) cos < ∇Q, ∇ log π > V = sg(π) • Q PPO ver.2 (Q, logit) = (Q θ , logit θ ), 0.5∇L V + ∇L Q + ∇J cos < ∇L Q , ∇J > pi = softmax(logit) cos < ∇Q, ∇ log π > V = sg(π) • Q PPO+CASA (V, A) = (V θ , A θ ), 0.5∇L V + ∇L Q + ∇J cos < ∇L Q , ∇J > π = softmax(A/τ ), cos < ∇Q, ∇ log π > Ā = A -sg(π) • A Q = Ā + sg(V ) (V, A) = (V θ , A θ ) ∇L Q cos < ∇L Q , ∇J > Q = A + V π = softmax(A/τ ) R2D2 ver.1 (V, A) = (V θ , A θ ) 0.5∇L V + ∇L Q cos < ∇L Q , ∇J > Q = A + V π = softmax(A/τ ) R2D2+CASA (V, A) = (V θ , A θ ), 0.5∇L V + ∇L Q + ∇J cos < ∇L Q , ∇J > π = softmax(A/τ ), Ā = A -sg(π) • A Q = Ā + sg(V ) Table 6 : R2D2 is the original R2D2. R2D2 ver.1 is adapted version to include ∇L V for training. R2D2+CASA is applying CASA on R2D2, which is described in Sec. 4.2. To understand the behavior of β def = < E π [(Q π -Q θ )∇ θ Q θ ], E π [(Q π -V θ )∇ θ log π θ ] > and χ def = E π [cos < ∇ θ Q θ , ∇ θ log π θ >] in reinforcement learning algorithms, we choose PPO as a representative for policy-based methods and R2D2 as a representative for value-based algorithms. Define L V (θ) = E π [(V π -V θ ) 2 ], L Q (θ) = E π [(Q π -Q θ ) 2 ], and ∇ θ J (θ) = E π [(Q π -V θ )∇ θ log π] . We usually have above three kinds of loss functions in reinforcement learning, which aim to estimate the state values, state-action values and the policy. We do not talk about the estimations of V π and Q π as they are estimated as their usual way of PPO's and R2D2's. All hyperparameters are listed in Appendix F. For brevity, we write cos < ∇Q, ∇ log π >= E π [cos < ∇ θ Q θ , ∇ θ log π θ >], and cos < ∇L Q , ∇J >= cos < E π [(Q π -Q θ )∇ θ Q θ ], E π [(Q π -V θ )∇ θ log π θ ] >, cos < ∇L V , ∇J >= cos < E π [(V π -V θ )∇ θ V θ ], E π [(Q π -V θ )∇ θ log π θ ] >, cos < ∇L V , ∇L Q >= cos < E π [(V π -V θ )∇ θ V θ ], E π [(Q π -Q θ )∇ θ Q θ ] > . The fact that PPO only has ∇ θ L V and ∇ θ J and R2D2 only has ∇ θ L Q is the main difficulty to track cos(β) and χ. To solve the problem, we adjust PPO and R2D2 with different versions. For PPO, we displace the estimation of V θ by sg(π) • Q θ , where Q θ is estimated by function approximation and V θ is estimated by taking the expectation of Q θ . All versions of PPO are listed in Table 5 . For R2D2, we point out that though we apply ϵ-greedy to interact with environments, ϵ is only used for exploration and the final target policy of value-based methods is simply arg max Q θ . Because arg max Q θ breaks the gradient, we use a surrogate policy to approximate the gradient of policy improvement. Since R2D2 uses dueling structure and softmax (A θ /τ ) = softmax(Q θ /τ ) τ →0+ -→ arg max Q θ , we use π surrogate = softmax(A θ /τ ) to calculate the policy gradient. We only use π surrogate on learner to calculate the gradient, where the policy that interacts with environments is still ϵ-greedy. All versions of R2D2 are listed in Table 6 .

C ON DISCUSSING APPLICATION OF CASA ON CONTINUOUS ACTION SPACE

As we can see CASA is only applied to discrete action space in the main context, we make a discussion on whether CASA is applicable on continuous action space. For brevity, we let τ = 1 and write equation 6 as:      π = softmax(A), Ā = A -E π [A], Q = Ā + sg(V ). The difficulty comes from estimating two quantities, one is softmax(A), the other is E π [A]. This comes from the fact that discrete action space is countable so these two quantities are expressed in a closed-form, while continuous action space is uncountable so an accurate estimation of these two quantities is intractable. We can surely apply Monte Carlo methods to approximate, but a more elegant close-form expression may be preferred. Then this becomes another problem: how to estimate (state-action values / advantages / policy probabilities) of all actions in a continuous action space efficiently without loss of generality? This is another representational design problem, which is out of scope of this paper, so we don't touch much about it. But with the hope of inspiring a better solution to this problem, we provide one practical way of applying CASA on continuous action space based on kernel-based machine learning. Let a 0 , . . . , a k to be basis actions in the action space. Let A(s, a 0 ), . . . , A(s, a k ) to be advantage functions for tuples of states and basis actions. They can either share parameters or be isolated. Let K(•, •) be a kernel function defined on the product of two action spaces. For any a in the action space, we can estimate A(s, a) by a decomposition such like A(s, a) = 1 Z a (K(a 0 , a)A(s, a 0 ) + • • • + K(a k , a)A(s, a k )), where Z a = k i=0 K(a i , a) is a normalization constant. Since K(•, a ) is a closed-form function of a, and |{A(s, a 0 ), . . . , A(s, a k )}| is finite, we can make a closedform expression of both softmax(A) and E π [A]. Then we can apply CASA directly on this expression, with one function estimates V and the other function estimates advantages of all actions in a closed-form with only state as input. The policy is defined directly by softmax of all advantages. In details, we define             π(

D DR-TRACE

As CASA estimates (V, Q, π), we would ask i) how to guarantee that πV T race = πReT race , ii) how to exploit (V, Q, π) to make a better estimation. Though we can apply V-Trace to estimate V and ReTrace to estimate Q with proper hyperparameters to guarantee πV T race = πReT race , it's more reasonable to estimate (V, Q) together. Inspired by Doubly Robust, which is shown to maximally reduce the variance, we introduce DR-Trace, which estimates V by V π DR (s t ) def = E µ [V (s t ) + k≥0 γ k c [t:t+k-1] ρ t+k δ DR t+k ], where µ is the behavior policy,  δ DR t def = r t + γV (s t+1 ) -Q(s t , Q π DR (s t , a t ) def = E st+1,rt∼p(•,•|st,at) [r t + γV π DR (s t+1 )] = E µ [Q(s t , a t ) + k≥0 γ k c [t+1:t+k-1] ρt,k δ DR t+k ], where  ρt,k = 1 {k=0} + 1 {k>0} ρ t+k . Theorem 4. Define Ā = A -E π [A], Q = Ā + sg(V ), T (Q) def = E µ [Q(s t , a t ) + k≥0 γ k c [t+1:t+k-1] ρt,k δ DR t+k ], S (V ) def = E µ [V (s t ) + k≥0 γ k c [t:t+k-1] ρ t,k δ DR t+k ], U (Q, V ) = (T (Q) -E π [Q] + S (V ), S (V )), U (n) (Q, V ) = U (U (n-1) (Q, V )), then U (n) (Q, V ) → (Q π , V π ) that = A -E [A], Q = Ā + sg(V ), then ∇Q = (1 -π)∇A. Proof. As Q = Ā + sg(V ) = A -sg(π) • A + sg(V ), it's obvious that ∇Q = (1 -π)∇A. For log π, it's a standard derivative of cross entropy, so we have ∇ log π = (1-π)∇(A/τ ) = (1-π) ∇A τ . Lemma E.2. Define Ā = A -E π [A], Q = Ā + sg(V ), π = sof tmax(A/τ ), then E π [(Q -V )∇ log π] = -τ ∇H[π]. Proof. Since π = exp(A/τ )/Z, Z = A exp(A/τ ), we have A = τ log π + τ log Z. Based on the observation that E π [f (s)∇ log π(•|s)] = 0, we have E π [E π [A] • ∇ log π] = 0, E π [log Z • ∇ log π] = 0. On the one hand, E π [(Q -V )∇ log π] = E π [A∇ log π] -E π [E π [A] • ∇ log π] = τ E π [log π∇ log π] + τ E π [log Z • ∇ log π] = τ E π [log π∇ log π] . On the other hand, ∇H[π] = -∇ A π i log π i = - A ∇π i • log π i - A π i ∇ log π i = - A π i ∇ log π i • log π i - A π i ∇π i π i = -E π [log π∇ log π] . Hence, E π [(Q -V )∇ log π] = -τ ∇H[π]. Theorem E.1. Define Ā = A -E π [A], Q = Ā + sg(V ). Define T (Q) def = E µ [Q(s t , a t ) + k≥0 γ k c [t+1:t+k-1] ρt,k δ DR t+k ], S (V ) def = E µ [V (s t ) + k≥0 γ k c [t:t+k-1] ρ t,k δ DR t+k ], U (Q, V ) = (T (Q) -E π [Q] + S (V ), S (V )), U (n) (Q, V ) = U (U (n-1) (Q, V )), then U (n) (Q, V ) → (Q π , V π ) that corresponds to π(a|s) = min {ρµ(a|s), π(a|s)} b∈A min {ρµ(b|s), π(b|s)} . as n → +∞. Remark. T (Q) -E π [Q] + S (V ) is exactly how Q is updated at training time. Since Q = Ā + sg(V ) , if we gradient ascent on Q and V in directions ∇L Q (θ) and ∇L V (θ) respectively, change of Q comes from two aspects. One comes from ∇L Q (θ), which changes A, the other comes from ∇L V (θ), which changes V . Because the gradient of V is stopped when estimating Q, the latter is captured by "minus old baseline, add new baseline", which is -E π [Q] + S (V ) in Theorem E.1. Proof. Define T (Q) = -E π [Q] + T (Q), U (Q, V ) = ( T (Q), S (V )), U (Q, V ) = U ( U (n-1) (Q, V )). By Lemma E.3, T (n) (Q) converges to some A * as n → ∞. This process will not influence the estimation of V as the gradient of V is stopped when estimating Q. According to the proof, A * does not depend on V . By Lemma E.4, S (n) (V ) converges to some V * as n → ∞. Hence, we have U (n) (Q, V ) → (A * , V * ) as n → +∞. By definition, U (Q, V ) = ( T (Q) + S (V ), S (V )), we can regard T (Q) + S (V ) as Q and regard S (V ) as V , then U (2) (Q, V ) = U ( T (Q) + S (V ), S (V )) = (T ( T (Q) + S (V )) -S (V ) + S (2) (V ), S (2) (V )) = ( T (2) (Q) + S (2) (V ), S (2) (V )). By induction, U (n) (Q, V ) = ( T (n) (Q) + S (n) (V ), S (n) (V )) → (A * + V * , V * ) as n → +∞. Same as (Espeholt et al., 2018) , π(a|s) = min {ρµ(a|s), π(a|s)} b∈A min {ρµ(b|s), π(b|s)} . is the policy s.t. the Bellman equation holds, which is E µ [ρ t (r t + γV t+1 -V t )|F t ] = 0, and U (Q π , V π ) = (Q π , V π ). So we have (A * + V * , V * ) = (Q π , V π ). Lemma E.3. Define Ā = A -E π [A], Q = Ā + sg(V ), then operator T (Q) def = E µ [Q(s t , a t ) + k≥0 γ k c [t+1:t+k-1] ρt,k δ DR t+k ] is a contraction mapping w.r.t. Q. Remark. Note that T (Q) is exactly equation D. Since Q = A + sg(V ), the gradient of V is stopped when estimating Q, updating Q will not change V , which is equivalent to updating A. Without loss of generality, we assume V is fixed as V * in the proof. Proof. Ā = A -E π [A] shows E π [ Ā] = 0, which guarantees that no matter how we update A, we always have E π [Q] = V * . Based on above observations, define T (Q) def = -E π [Q] + T (Q). It's obvious that we only need to prove T (Q) is a contraction mapping. For brevity, we denote Q t = Q(s t , a t ), A t = A(s t , a t ), V * t = V * (s t ). Noticing that ρt,0 = 1, let F represent filtration, we can rewrite T as T (Q) = E µ [A t + k≥0 γ k c [t+1:t+k-1] ρt,k δ DR t+k ] = E µ [-V * t + k≥0 γ k c [t+1:t+k-1] ρt,k r t+k + k≥0 γ k+1 c [t+1:t+k-1] ∆ k ], where ∆ k = E µ ρt,k V * t+k+1 -c t+k ρt,k+1 Q t+k+1 |F t+k . (17) By definition of Q, E µ [V * t+k+1 |F t+k ] = E µ [E π [Q t+k+1 |F t+k+1 ]|F t+k ], we can rewrite equation 17 as ∆ k = E µ [(ρ t,k π t+k+1 µ t+k+1 -c t+k ρt,k+1 )Q t+k+1 |F t+k ]. For any Q 1 = A 1 + sg(V * ), Q 2 = A 2 + sg(V * ), since E µ [(ρ t,k π t+k+1 µ t+k+1 -c t+k ρt,k+1 )|F t+k ] ≥ 0, by equation 16 equation 18, we have || T (Q 1 ) -T (Q 2 )|| ≤ C||Q 1 -Q 2 ||, where C = E µ [ k≥0 γ k+1 c [t+1:t+k-1] (ρ t,k π t+k+1 µ t+k+1 -c t+k ρt,k+1 )] = E µ [1 -1 + k≥0 γ k+1 c [t+1:t+k-1] (ρ t,k -c t+k ρt,k+1 )] = 1 -(1 -γ)E µ [ k≥0 γ k c [t+1:t+k-1] ρt,k ] ≤ 1 -(1 -γ) < 1. Hence, T (Q) is a contraction mapping and converges to some fixed function, which we denote as A * . So T (Q) is also a contraction mapping and converges to A * + V * . Lemma E.4. Define Q = A + sg(V ) with E π [A] = 0, then operator S (V ) def = E µ [V (s t ) + k≥0 γ k c [t:t+k-1] ρ t,k δ DR t+k ] is a contraction mapping w.r.t. V . Remark. Note that S (V ) is exactly equation D. Proof. Same as Lemma E.3, we can get ∆ k = E µ (ρ t+k -c t+k ρ t+k+1 ) V t+k+1 -c t+k ρ t+k+1 A * t+k+1 |F t+k , so we have ∆ 1 k -∆ 2 k = E µ (ρ t+k -c t+k ρ t+k+1 ) • (V 1 t+k+1 -V 2 t+k+1 )|F t+k . The remaining proof is identical to (Espeholt et al., 2018) 's. G EVALUATION OF CASA ON ATARI GAMES Random scores and average human's scores are from (Badia et al., 2020) . Human World Records (HWR) are from (Toromanoff et al., 2019 ). Rainbow's scores are from (Hessel et al., 2017 ). IMPALA's scores are from (Espeholt et al., 2018 ). LASER's scores are from (Schmitt et al., 2020) 



s, a) = exp(A(s, a))/ a exp(A(s, a))da, Ā(s, a) = A(s, a)a sg(π(s, a))A(s, a)da, Q(s, a) = Ā(s, a) + sg(V (s)). (15) Then it satisfies the consistency of CASA on continuous action space. ∇ log π(s, a) = ∇A(s, a) -∇ a exp(A(s, a))da a exp(A(s, a))da = ∇A(s, a) -a exp(A(s, a))∇A(s, a)da a exp(A(s, a))da = ∇A(s, a)a exp(A(s, a)) a exp(A(s, a))da ∇A(s, a)da = ∇A(s, a)a π(s, a)∇A(s, a)da = ∇ Ā(s, a) = ∇Q(s, a).

Figure 5: Ablation study for w/wo DR-Trace on Breakout, ChopperCommand and Krull.

Figure4: The ablation results evaluated on Breakout (top row) and Qbert (bottom row). From left to right is the return curve, χ, cos(β), scatter plot of (χ, cos(β)) and box plot of (χ, cos(β)). Each scatter point is one batch sampled from every consecutive 100 batches. Each box is the interquartile range of scatter points.4.4 EVALUATION OF CASA ON ATARI GAMESWe present an extensive evaluation on CASA, where we train CASA + DR-Trace on 57 Atari games and report the results in terms of two metrics. The first is Human Normalized Score (HNS), which normalizes the reward by random policy and human expert policies. The other is Standardized Atari BEnchmark for RL (SABER), which normalizes the reward by random policy and human world records, where the normalized score is capped by 200%. SABER is considered because recent studies show that the median HNS could easily get hacked by the algorithm since it is sensitive to improvement on a small subset of games. Table4summarizes the results. Evaluation scores for the methods on Atari benchmark presented in %. Though LASER outperforms CASA in Median HNS and Mean SABER, CASA outperforms it in median SABER and mean HNS. Overall, the aforementioned results demonstrate the conflict-averse strategy efficiently boosts the performance in largescale training scenarios and outperform strong on/off-policy algorithms. Hyperparameters and individual games are presented in Appendix F and Appendix G, respectively.

Ziyu Wang, TomSchaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and  Nando Freitas. Dueling network architectures for deep reinforcement learning. In International conference on machine learning, pp. 1995-2003, 2016. Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.

PPO is the original PPO. PPO ver.1 and PPO ver.2 are adapted versions to calculate ∇L Q . PPO+CASA is applying CASA on PPO, which is described in Sec. 4.2.

a t ) is one-step Doubly Robust error,

, no sweep at 200M.

F HYPERPARAMETERS

Our python packages are shown in Table 7 . All experiments follow the shared hyperparameters as in Table 8 . The specific hyperparameters for PPO, R2D2 and CASA+DR-Trace are shown in Table 9 , Table 10 and Table 11 . The only exceptions are V -loss scaling, Q-loss scaling and π-loss scaling, which may be zero depending on some specific ablation settings. We will state these three hyperparameters every time in all experiments.

Parameter Value

Atari 

