LEARNING VALUE FUNCTIONS IN DEEP POLICY GRA-DIENTS USING RESIDUAL VARIANCE

Abstract

Policy gradient algorithms have proven to be successful in diverse decision making and control tasks. However, these methods suffer from high sample complexity and instability issues. In this paper, we address these challenges by providing a different approach for training the critic in the actor-critic framework. Our work builds on recent studies indicating that traditional actor-critic algorithms do not succeed in fitting the true value function, calling for the need to identify a better objective for the critic. In our method, the critic uses a new state-value (resp. state-actionvalue) function approximation that learns the value of the states (resp. state-action pairs) relative to their mean value rather than the absolute value as in conventional actor-critic. We prove the theoretical consistency of the new gradient estimator and observe dramatic empirical improvement across a variety of continuous control tasks and algorithms. Furthermore, we validate our method in tasks with sparse rewards, where we provide experimental evidence and theoretical insights.

1. INTRODUCTION

Model-free deep reinforcement learning (RL) has been successfully used in a wide range of problem domains, ranging from teaching computers to control robots to playing sophisticated strategy games (Silver et al., 2014; Schulman et al., 2016; Lillicrap et al., 2016; Mnih et al., 2016) . Stateof-the-art policy gradient algorithms currently combine ingenious learning schemes with neural networks as function approximators in the so-called actor-critic framework (Sutton et al., 2000; Schulman et al., 2017; Haarnoja et al., 2018) . While such methods demonstrate great performance in continuous control tasks, several discrepancies persist between what motivates the conceptual framework of these algorithms and what is implemented in practice to obtain maximum gains. For instance, research aimed at improving the learning of value functions often restricts the class of function approximators through different assumptions, then propose a critic formulation that allows for a more stable policy gradient. However, new studies (Tucker et al., 2018; Ilyas et al., 2020) indicate that state-of-the-art policy gradient methods (Schulman et al., 2015; 2017) fail to fit the true value function and that recently proposed state-action-dependent baselines (Gu et al., 2016; Liu et al., 2018; Wu et al., 2018) do not reduce gradient variance more than state-dependent ones. These findings leave the reader skeptical about actor-critic algorithms, suggesting that recent research tends to improve performance by introducing a bias rather than stabilizing the learning. Consequently, attempting to find a better baseline is questionable, as critics would typically fail to fit it (Ilyas et al., 2020) . In Tucker et al. (2018) , the authors argue that "much larger gains could be achieved by instead improving the accuracy of the value function". Following this line of thought, we are interested in ways to better approximate the value function. One approach addressing this issue is to put more focus on relative state-action values, an idea introduced in the literature on advantage reinforcement learning (Harmon & Baird III) followed by works on dueling (Wang et al., 2016) neural networks. More recent work (Lin & Zhou, 2020 ) also suggests that considering the relative action values, or more precisely the ranking of actions in a state leads to better policies. The main argument behind this intuition is that it suffices to identify the optimal actions to solve a task. We extend this principle of relative action value with respect to the mean value to cover both state and state-action-value functions with a new objective for the critic: minimizing the variance of residual errors. In essence, this modified loss function puts more focus on the values of states (resp. state-actions) relative to their mean value rather than their absolute values, with the intuition that solving a task corresponds to identifying the optimal action(s) rather than estimating the exact value of each state. In summary, this paper: • Introduces Actor with Variance Estimated Critic (AVEC), an actor-critic method providing a new training objective for the critic based on the residual variance. • Provides evidence for the improvement of the value function approximation as well as theoretical consistency of the modified gradient estimator. • Demonstrates experimentally that AVEC, when coupled with state-of-the-art policy gradient algorithms, yields a significant performance boost on a set of challenging tasks, including environments with sparse rewards. • Provides empirical evidence supporting a better fit of the true value function and a substantial stabilization of the gradient.

2. RELATED WORK

Our approach builds on three lines of research, of which we give a quick overview: policy gradient algorithms, regularization in policy gradient methods, and exploration in RL. Policy gradient methods use stochastic gradient ascent to compute a policy gradient estimator. This was originally formulated as the REINFORCE algorithm (Williams, 1992) . Kakade & Langford (2002) later created conservative policy iteration and provided lower bounds for the minimum objective improvement. Peters et al. (2010) replaced regularization by a trust region constraint to stabilize training. In addition, extensive research investigated methods to improve the stability of gradient updates, and although it is possible to obtain an unbiased estimate of the policy gradient from empirical trajectories, the corresponding variance can be extremely high. To improve stability, Weaver & Tao (2001) show that subtracting a baseline (Williams, 1992) from the value function in the policy gradient can be very beneficial in reducing variance without damaging the bias. However, in practice, these modifications on the actor-critic framework usually result in improved performance without a significant variance reduction (Tucker et al., 2018; Ilyas et al., 2020) . Currently, one of the most dominant on-policy methods are proximal policy optimization (PPO) (Schulman et al., 2017) and trust region policy optimization (TRPO) (Schulman et al., 2015) , both of which require new samples to be collected for each gradient step. Another direction of research that overcomes this limitation is off-policy algorithms, which therefore benefit from all sample transitions; soft actor-critic (SAC) (Haarnoja et al., 2018) is one such approach achieving state-of-the-art performance. Several works also investigate regularization effects on the policy gradient (Jaderberg et al., 2016; Namkoong & Duchi, 2017; Kartal et al., 2019; Flet-Berliac & Preux, 2019; 2020) ; it is often used to shift the bias-variance trade-off towards reducing the variance while introducing a small bias. In RL, regularization is often used to encourage exploration and takes the form of an entropy term (Williams & Peng, 1991; Schulman et al., 2017) . Moreover, while regularization in machine learning generally consists in smoothing over the observation space, in the RL setting, Thodoroff et al. (2018) show that it is possible to smooth over the temporal dimension as well. Furthermore, Zhao et al. (2016) analyze the effects of a regularization using the variance of the policy gradient (the idea is reminiscent of SVRG descent (Johnson & Zhang, 2013) ) which proves to provide more consistent policy improvements at the expense of reduced performance. In contrast, as we will see later, AVEC does not change the policy network optimization procedure nor involves any additional computational cost. Exploration has been studied under different angles in RL, one common strategy is -greedy, where the agent explores with probability by taking a random action. This method, just like entropy regularization, enforces uniform exploration and has achieved recent success in game playing en-vironments (Mnih et al., 2013; Van Hasselt et al., 2015; Mnih et al., 2016) . On the other hand, for most policy-based RL, exploration is a natural component of any algorithm following a stochastic policy, choosing sub-optimal actions with non-zero probability. Furthermore, policy gradient literature contains exploration methods based on uncertainty estimates of values (Kaelbling, 1993; Tokic, 2010) , and algorithms which provide intrinsic exploration or curiosity bonus to encourage exploration (Schmidhuber, 2006; Bellemare et al., 2016; Flet-Berliac et al., 2021) . While existing research may share some motivations with our method, no previous work in RL applies the variance of residual errors as an objective loss function. In the context of linear regression, Brown (1947) considers a median-unbiased estimator minimizing the risk with respect to the absolutedeviation loss function (Pham-Gia & Hung, 2001) (similar in spirit to the variance of residual errors), their motivation is nonetheless different to ours. Indeed, they seek to be robust to outliers whereas, when considering noiseless RL problems, one usually seeks to capture those (sometimes rare) signals corresponding to the rewards.

3.1. BACKGROUND AND NOTATIONS

We consider an infinite-horizon Markov Decision Problem (MDP) with continuous states s ∈ S, continuous actions a ∈ A, transition distribution s t+1 ∼ P(s t , a t ) and reward function r t ∼ R(s t , a t ). Let π θ (a|s) denote a stochastic policy with parameter θ, we restrict policies to being Gaussian distributions. In the following, π and π θ denote the same object. The agent repeatedly interacts with the environment by sampling action a t ∼ π(.|s t ), receives reward r t and transitions to a new state s t+1 . The objective is to maximize the expected sum of discounted rewards: J(π) E τ ∼π ∞ t=0 γ t r (s t , a t ) , where γ ∈ [0, 1) is a discount factor (Puterman, 1994) , and τ = (s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , . . . ) is a trajectory sampled from the environment using policy π. We denote the value of a state s in the MDP framework while following a policy π by V π (s) E τ ∼π [ ∞ t=0 γ t r (s t , a t ) |s 0 = s] and the value of a state-action pair of performing action a in state s and then following policy π by Q π (s, a) E τ ∼π [ ∞ t=0 γ t r (s t , a t ) |s 0 = s, a 0 = a]. Finally, the advantage function which quantifies how an action a is better than the average action in state s is denoted A π (s, a) Q π (s, a) -V π (s).

3.2. CRITICS IN DEEP POLICY GRADIENTS

In this section, we consider the case where the value functions are learned using function estimators and then used in an approximation of the gradient. Without loss of generality, we consider the algorithms that approximate the state-value function V . The analysis holds for algorithms that approximate the state-action-value function Q. Let f φ : S → R be an estimator of V π with φ its parameter. f φ is traditionally learned through minimizing the mean squared error (MSE) against V π . At iteration k, the critic minimizes: L AC = E s f φ (s) -V π θ k (s) 2 , ( ) where the states s are collected under policy π θ k , and V π θ k (s) is an empirical estimate of V (see Section 4.3 for details). Similarly, using f φ : S × A → R instead, one can fit an empirical target Qπ .

4. METHOD: ACTOR WITH VARIANCE ESTIMATED CRITIC

In this section, we introduce AVEC and discuss its correctness, motivations and implementation.

4.1. DEFINING AN ALTERNATIVE CRITIC

Recent work (Ilyas et al., 2020) empirically demonstrates that while the value network succeeds in the supervised learning task of fitting V π (resp. Qπ ), it does not fit V π (resp. Q π ). We address this deficiency in the estimation of the critic by introducing an alternative value network loss. Following empirical evidence indicating that the problem is the approximation error and not the estimator per se, AVEC adopts a loss that can provide a better approximation error, and yields better estimators of the value function (as will be shown in Section 5.3). At update k: L AVEC = E s (f φ (s) -V π θ k (s)) -E s f φ (s) -V π θ k (s) 2 , with states s collected using π θ k . Note that the gradient flows in f φ twice using Eq. 3. Then, we define our bias-corrected estimator: g φ : S → R such that g φ (s) = f φ (s) + E s [ V π θ k (s) -f φ (s)]. Analogously to Eq. 3, we define an alternative critic for the estimation of Q π by replacing V π by Qπ and f φ (s) by f φ (s, a). Proposition (AVEC Policy Gradient). If f φ : S × A → R satisfies the parameterization assumption (Sutton et al., 2000) then g φ provides an unbiased policy gradient: ∇ θ J (π θ ) = E (s,a)∼π θ [∇ θ log(π θ (s, a))g φ (s, a)] . Proof. See Appendix A. This result also holds for the estimation of V π θ with f φ : S → R.

4.2. BUILDING MOTIVATION

Here, we present the intuition behind using AVEC for actor-critic algorithms. Tucker et al. (2018) and Ilyas et al. (2020) indicate that the approximation error V π -V π is problematic, suggesting that the variance of the empirical targets V π (s t ) is high. Using L AVEC , our approach reduces the variance term of the MSE (or distance to V π ) but mechanistically also increases the bias. Our intuition is that since the bias is already quite substantial (Ilyas et al., 2020) , it may be possible to reduce the variance enough so that even though the bias increases, the total MSE reduces. State-value function estimation. In this case, optimizing the critic with L AVEC can be interpreted as fitting V π (s) = V π (s) -E s [ V π (s )] using the MSE. We show that the targets V π are better estimations of V π (s) = V π (s) -E s [V π (s )] than V π are of V π . To illustrate this, consider T independent random variables (X i ) i∈{1,...,T } . We denote X i = X i -1 T T j=1 X j and V(X) the variance of X. Then, V (X i ) = V (X i ) -2 T V (X i ) + 1 T 2 T j=1 V (X j ) and V(X i ) < V(X i ) as long as ∀i 1 T T j=1 V(X j ) < 2V(X i ), or more generally when state-values are not strongly negatively correlatedfoot_0 and not very discordant. This entails that V π has a more compact span, and is consequently easier to fit. This analysis shows that the variance term of the MSE is reduced compared to traditional actor-critic algorithms, but does not guarantee it counterbalances the bias increase. Nevertheless, in practice, the bias is so high that the difference due to learning with AVEC is only marginal and the total MSE decreases. We empirically demonstrate this claim in Section 5.3. State-action-value function estimation. In this case, Eq. 3 translates into replacing V π (s) by Qπ (s, a) and f φ (s) by f φ (s, a) and the rationale for optimizing the residual variance of the value function instead of the full MSE becomes more straightforward: the practical use of the Q-function is to disentangle the relative values of actions for each state (Sutton et al., 2000) . AVEC's effect on relative values is illustrated in a didactic regression with one variable example in Fig. 1 where grey markers are observations and the blue line is our current estimation. Minimizing the MSE, the line is expected to move towards the orange one in order to reduce errors uniformly. Minimizing the residual variance, it is expected to move near the red one. In fact, L AVEC tends to further penalize observations that are far away from the mean, implying that AVEC allows a better recovery of the "shape" of the target near extrema. In particular, we see in the figure that the maximum and minimum observation values are quickly identified. Would the approximators be linear and the target state-values independent, the two losses become equivalent since ordinary least squares would provide minimum-variance mean-unbiased estimation. It should be noted that, as in all the works related to ours, we consider noiseless tasks, i.e. the transition matrix is deterministic. As such, there are no outliers and extreme state-action values correspond to learning signals. In this context, high estimation errors indicate where (in the state or action-state space) the training of the value function should be improved.

4.3. IMPLEMENTATION

We apply this new formulation to three of the most dominant deep policy gradient methods to study whether it results in a better estimation of the value function. A better estimation of the value function implies better policy improvements. We now describe how AVEC incorporates its residual variance objective into the critics of PPO (Schulman et al., 2017) , TRPO (Schulman et al., 2015) and SAC (Haarnoja et al., 2018) . Let B be a batch of transitions. In PPO and TRPO, AVEC modifies the learning of V φ (line 12 of Algorithm 1) using: L 1 AVEC (φ) = E s∼B (f φ (s) -V π (s)) -E s∼B f φ (s) -V π (s) 2 , then V φ = f φ (s) + E s∼B [ V π (s) -f φ (s)], where V π (s t ) = f φold (s t ) + A t such that f φold (s t ) are the estimates given by the last value function and A t is the advantage of the policy, i.e. the returns minus the expected values (A t is often estimated using generalized advantage estimation (Schulman et al., 2016) . In SAC, AVEC modifies the objective function of (Q φi ) i=1,2 (line 13 of Algorithm 2 in Appendix C) using: L 2 AVEC (φ i ) = E (s,a)∼B (f φi (s, a) -Qπ (s, a)) -E (s,a)∼B f φi (s, a) -Qπ (s, a) 2 , then Q φi = f φi (s, a) + E (s,a)∼B [ Qπ (s, a) -f φi (s, a)] , where Qπ (s, a) is estimated using temporal difference (see Haarnoja et al. (2018) ): Qπ (s t , a t ) = r(s t , a t ) + γE st+1∼π [V ψ (s t+1 )] with ψ the value function parameter (see Algorithm 2). The reader may have noticed that L 1 AVEC and L 2 AVEC slightly differ from Eq. 3. The residual variance of the value function (L AVEC ) is not tractable since a priori state-values are dependent and their joint law is unknown. Consequently, in practice, we use the empirical variance proxy assuming independence (cf. Appendix D). Greensmith et al. (2004) provide some support for this approximation by showing that weakly dependent variables tend to concentrate more than independent ones. Finally, notice that AVEC does not modify any other part of the considered algorithms whatsoever, which makes its implementation straightforward and keeps the same computational complexity.

5. EXPERIMENTAL STUDY

In this section, we conduct experiments along four orthogonal directions. We point out that a comparison to variance-reduction methods is not considered in this paper: Tucker et al. (2018) demonstrated that their implementations diverge from the unbiased methods presented in the respective papers and unveiled that not only do they fail to reduce the variance of the gradient, but that their unbiased versions do not improve performance either. Note that in all experiments we choose the hyperparameters providing the best performance for the considered methods which can only penalyze AVEC (cf. Appendix E). In all the figures hereafter (except Fig. 3c and 3d ), lines are average performances and shaded areas represent one standard deviation. (.%) is the change in performance due to AVEC.

5.1. CONTINUOUS CONTROL

Algorithm 1 AVEC coupled with PPO or TRPO. J ALGO denotes the policy loss of either algorithm (described in Schulman et al. (2017; 2015) ). 1: Input parameters: λ π ≥ 0, λ V ≥ 0 2: Initialize policy parameter θ and value function parameter φ 3: for each update step do for each environment step do 6: a t ∼ π θ (s t ) 7: s t+1 ∼ P (s t , a t ) 8: B ← B ∪ {(s t , a t , r t , s t+1 )} 9: end for 10: for each gradient step do 11: θ ← θ -λ π ∇θ J ALGO (π θ ) 12: φ ← φ -λ V ∇φ L 1 AVEC (φ) 13: end for 14: end for For ease of comparison with other methods, we evaluate AVEC on the MuJoCo (Todorov et al., 2012) and the PyBullet (Coumans & Bai, 2016) continuous control benchmarks (see Appendix G for details) using OpenAI Gym (Brockman et al., 2016) . Note that the PyBullet versions of the locomotion tasks are harder than the MuJoCo equivalentsfoot_1 . We choose a representative set of tasks for the experimental evaluation; their action and observation space dimensions are reported in Appendix H. We assess the benefits of AVEC when coupled with the most prominent policy gradient algorithms, currently state-of-the-art methods: PPO (Schulman et al., 2017) and TRPO (Schulman et al., 2015) , both on-policy methods, and SAC (Haarnoja et al., 2018) , an off-policy maximum entropy deep RL algorithm. We provide the list of hyperparameters and further implementation details in Appendix D and E. Table 1 reports the results while Fig. 2 and 8 show the total average return for SAC and PPO. TRPO results are provided in Appendix F for readability. When coupled with SAC and PPO, AVEC brings very significant improvement (on average +26% for SAC and +39% for PPO) in the performance of the policy gradient algorithms, improvement which is consistent across tasks. As for TRPO, while the improvement in performance is less striking, AVEC still manages to be more efficient in terms of sampling in all tasks. Overall, AVEC improves TRPO, PPO and SAC in terms of performance and efficiency. This does not imply that our method would also improve other policy gradient methods that use the traditional actor-critic framework, but since we evaluate our method coupled with three of the best performing on-and off-policy algorithms, we believe that these experiments are sufficient to prove the relevance of AVEC. Furthermore, in our experiments we do not seek the best hyperparameters for the AVEC variants, we simply adopt the parameters allowing us to optimally reproduce the baselines. Alternatively, if one seeks to evaluate AVEC independently of a considered baseline, further hyperparameter tuning should produce better results. Notice that since no additional calculations are needed in AVEC's implementation, computational complexity remains unchanged.

5.2. SPARSE REWARD SIGNALS

Domains with sparse rewards are challenging to solve with uniform exploration as agents receive no feedback on their actions before starting to collect rewards. In such conditions AVEC performs better, suggesting that the shape of the value function is better approximated, encouraging exploration. The relative value estimate of an unseen state is more accurate: in Section 4.2, AVEC identifies extreme state-values (e.g., non-zero rewards in tasks with sparse rewards) faster. In Fig. 3a and 3b, we report the performance of AVEC in the Acrobot and MountainCar environments: both have sparse rewards. AVEC enhances TRPO and PPO in both experiments. When PPO and AVEC-PPO both reach the best possible performance, AVEC-PPO exhibits better sample efficiency. Fig. 3c and 3d illustrate how the agent improves its exploration strategy in MountainCar: while the PPO agent remains stuck at the bottom of the hill (red), the graph suggest that AVEC-PPO learns the difficult locomotion principles in the absence of rewards and visits a much larger part of the state space (green). This improved performance in sparse environments can be explained by the fact that AVEC is able to pick up on experienced positive reward more easily. Moreover, the reconstructed shape of the value function is more accurate around such rewarding states, which pushes the agent to explore further around experienced states with high values. We observe that PPO better fits the empirical target than when equipped with AVEC, which is to be expected since vanilla PPO optimizes the MSE directly. This result put aside the remarkable improvement in the performance of AVEC-PPO (Fig. 2 ) suggests that AVEC might be a better estimator of the true value function. We examine this claim below because if true, it would indicate that it is indeed possible to simultaneously improve the performance of the agents and the stability of the method. Learning the True Target. A fundamental premise of policy gradient methods is that optimizing the objective based on an empirical estimation of the value function leads to a better policy. Which is why we investigate the quality of fit of the true target. To approximate the true value function, we fit the returns sampled from the current policy using a large number of transitions (3 • 10 5 ). Fig. 5 shows that g φ is far closer to the true value function half of the time (horizon is 10 6 ) than the estimator obtained with MSE, then as close to it. Comparing Fig. 5 with Fig. 4 , we see that the distance to the true target is close to the estimation error for AVEC-PPO, while for PPO, it is at least two orders of magnitude higher at all times. We further investigate these results in Fig. 9 in Appendix B.2 where we study the variation of the squared bias and variance components of the MSE to the true target (MSE = Var + Bias 2 ). We find, as expected, that using AVEC reduces the variance term significantly while slightly increasing the bias term, which Fig. 5 confirms is negligible since the total MSE is substantially reduced ( g φ (AVEC) -V π 2 ≤ V φ (PPO) -V π 2 ) where V φ (PPO) is the value function estimator in PPO. For completeness, we also analyze the distance to the true target for the Q-function estimator in SAC and AVEC-SAC in AntBullet and HalfCheetahBullet in Appendix B.3, with similar results and interpretation. We conclude that AVEC improves the value function approximation and we expect that the gradient is more stable. Empirical Variance Reduction. We choose to study the gradient variance using the average pairwise cosine similarity metric as it allows a comparison with Ilyas et al. ( 2020), with which we share the same experimental setup and scales. Fig. 6 shows that AVEC yields a higher average (10 batches per iteration) pairwise cosine similarity, which means closer batch-estimates of the gradient and, in turn, indicates smaller gradient variance. Further analysis with additional tasks is included in Appendix B.4. The variance reduction effect observed in several environments suggests that AVEC is the first method since the introduction of the value function baseline to further reduce the variance of the gradient and improve performance.

5.4. ABLATION STUDY

In this section, we examine how changing the relative importance of the bias and the residual variance in the loss of the value network affects learning. For this study, we choose difficult tasks of PyBullet and use PPO because it is more efficient than TRPO and requires less computations than SAC. For an estimator ŷn of (y i ) i∈{1,...,n} , we write Bias = 1 n n i=1 (ŷ i -y i ) and Var = 1 n-1 n i=1 (ŷ i -y i -Bias) 2 . Consequently: MSE = Var + Bias 2 . We denote L α = Var + αBias 2 , with α ∈ R. In Fig. 7 , Bias-α means that we use L α and Var-α means that we use L 1 α . We observe that while no consistent order on the choices of α is identified, AVEC seems to outperform all other weightings. Note that, for readability purposes, the graphs have been split and the curves of AVEC-PPO and PPO are the same in Fig. 7a and 7c , and in Fig. 7b and 7d . A more extensive hyperparameter study with more α values might provide even higher performances, nevertheless we believe that the stability of an algorithm is crucial for a reliable performance. As such, the tuning of hyperparameters to achieve good results should remain mild.

6. DISCUSSION

In this work, we introduce a new training objective for the critic in actor-critic algorithms to better approximate the true value function. In addition to being well-motivated by recent studies on the behaviour of deep policy gradient algorithms, we demonstrate that this modification is both theoretically sound and intuitively supported by the need to improve the approximation error of the critic. The application of Actor with Variance Estimated Critic (AVEC) to state-of-the-art policy gradient methods produces considerable gains in performance (on average +26% for SAC and +39% for PPO) over the standard actor-critic training, without any additional hyperparameter tuning. First, for SAC-like algorithms where the critic learns a state-action-value function, our results strongly suggest that state-actions with extreme values are identified more quickly. Second, for PPO-like methods where the critic learns the state-values, we show that the variance of the gradient is reduced and empirically demonstrate that this is due to a better approximation of the state-values. In sparse reward environments, the theoretical intuition behind a variance estimated critic is more explicit and is also supported by empirical evidence. In addition to corroborating the results in Ilyas et al. (2020) proving that the value estimator fails to fit V π , we propose a method that succeeds in improving both the sample complexity and the stability of prominent actor-critic algorithms. Furthermore, AVEC benefits from its simplicity of implementation since no further assumptions are required (such as horizon awareness Tucker et al. (2018) to remedy the deficiency of existing variance-reduction methods) and the modification of current algorithms represents only a few lines of code. In this paper, we have demonstrated the benefits of a more thorough analysis of the critic objective in policy gradient methods. Despite our strongly favourable results, we do not claim that the residual variance is the optimal loss for the state-value or the state-action-value functions, and we note that the design of comparably superior estimators for critics in deep policy gradient methods merits further study. In future work, further analysis of the bias-variance trade-off and extension of the results to stochastic environments is anticipated; we consider the problem of noise separation in the latter, as this is the first obstacle to accessing the variance and distinguishing extreme values from outliers. A UNBIASED AVEC POLICY GRADIENT In this section, we consider the case in which the state-action-value function of a policy π θ is approximated. We prove that given some assumptions on this estimator function, we can use it to yield a valid gradient direction, i.e., we are able to prove policy improvement when following this direction. In this setting, the critic minimizes the following loss: E (s,a)∼π ( Qπ θ (s, a) -f φ (s, a) -E (s,a)∼π [ Qπ θ (s, a) -f φ (s, a)]) 2 . When a local optimum is reached, the gradient of the latter expression is zero: ∇ φ L AVEC = E (s,a)∼π ( Qπ θ (s, a) -f φ (s, a) -E (s,a)∼π [ Qπ θ (s, a) -f φ (s, a)])( ∂f φ (s, a) ∂φ -E (s,a)∼π [ ∂f φ (s, a) ∂φ ]) = 0. In the expression above, the expected value of the partial derivative disappears because the term in the first bracket is centered: E (s,a)∼π ( Qπ θ (s, a) -f φ (s, a) -E (s,a)∼π [ Qπ θ (s, a) -f φ (s, a)])E (s,a)∼π [ ∂f φ (s, a) ∂φ ] = E (s,a)∼π ∂f φ (s, a) ∂φ : = 0 E (s,a)∼π [ Qπ θ (s, a) -f φ (s, a) -E (s,a)∼π [ Qπ θ -f φ ]] = 0. Simplifying the gradient at the local optimum becomes: E (s,a)∼π ( Qπ θ (s, a) -f φ (s, a) -E (s,a)∼π [ Qπ θ (s, a) -f φ (s, a)])( ∂f φ (s, a) ∂φ ) = 0. Then, if we denote g φ = f φ (s, a)+E (s,a)∼π [ Qπ (s, a)-f φ (s, a)], and use the policy parameterization assumption: ∂f φ (s, a) ∂φ = ∂π θ (s, a) ∂θ 1 π θ (s, a) , we obtain: ∇ θ J = E (s,a)∼π θ [∇ θ log(π θ (s, a))g φ (s, a)] . Proof. By combining the parameterization assumption in Eq. 5 with Eq. 4, we have: E (s,a)∼π θ ( Qπ θ (s, a) -g φ (s, a)) ∂π θ (s, a) ∂θ 1 π θ (s, a) = 0. Since the expression above is null, we have the following: ∇ θ J = E (s,a)∼π θ [∇ θ log(π θ (s, a)) Qπ θ (s, a)] = E (s,a)∼π θ [∇ θ log(π θ (s, a)) Qπ θ (s, a)] -E (s,a)∼π θ [( Qπ θ (s, a) -g φ (s, a)) ∂π θ (s, a) ∂θ 1 π θ (s, a) ] = E (s,a)∼π θ [∇ θ log(π θ (s, a))g φ (s, a)]. Remark. While the proof seems more or less generic, the assumption in Eq. 5 is extremely constraining to the possible approximators. Sutton et al. (2000) quotes J. Tsitsiklis who believes that a linear g φ in the features of the policy may be the only feasible solution for this condition. Concretely, such an assumption cannot hold since neural networks are the standard approximators used in practice. Moreover, empirical analysis (Ilyas et al., 2020) indicates that commonly used algorithms fail to fit the true value function. However, this does not rule out the usefulness of the approach but rather begs for more questioning of the true effect of such biased baselines.

B ADDITIONAL EXPERIMENTS

B.1 CONTINUOUS CONTROL: WALKER2D Fig. 8 shows the total average return for AVEC coupled with SAC and PPO on the Walker2d task. Similar to considered other continuous control tasks from MuJoCo and PyBullet, AVEC brings a significant performance improvement (+26% for SAC and +33% for PPO), confirming the generality of our approach. 

B.4 VARIANCE REDUCTION

In Fig. 11 , we study the empirical variance of the gradient in measuring the average pairwise cosine similarity (10 gradient measurements) in two additional tasks: HopperBullet and Walker2DBullet. We also vary the trajectory size used in the estimation of the gradient. 

C IMPLEMENTATION OF AVEC COUPLED WITH SAC

In Algorithm 2, J V is the squared residual error objective to train the soft value function. See Haarnoja et al. (2018) for further details and notations about SAC, not directly relevant here. Algorithm 2 AVEC coupled with SAC. 1: Input parameters: β ∈ [0, 1], λ V ≥ 0, λ Q ≥ 0, λ π ≥ 0 2: Initialize policy parameter θ, value function parameter ψ and ψ and Q-functions parameters φ 1 and φ 2 3: D ← ∅ 4: for each iteration do 5: for each step do 6: ψ ← ψ -λ V ∇ψ J V (ψ) 13: a t ∼ φ i ← φ i -λ Q ∇φi L 2 AVEC (φ i ) for i ∈ {1, 2} 14: θ ← θ -λ π ∇θ J(π θ ) 15: ψ ← βψ + (1 -β) ψ 16: end for 17: end for

D IMPLEMENTATION DETAILS

Theoretically, L AVEC is defined as the residual variance of the value function (cf. Eq. 3). However, state-values for a non-optimal policy are dependent and the variance is not tractable without access to the joint law of state-values. Consequently, to implement AVEC in practice we use the best-known proxy at hand, which is the empirical variance formula assuming independence: L AVEC = 1 T -1 T t=1 f φ (s t ) -V π (s t ) - 1 T T t=1 f φ (s t ) -V π (s t ) 2 , where T is the size of the sampled trajectory.

F COMPARATIVE EVALUATION OF AVEC WITH TRPO

In order to evaluate the performance gains in using AVEC instead of the usual actor-critic framework, we produce some additional experiments with the TRPO (Schulman et al., 2015) algorithm. Fig. 12 shows the learning curves while Table 5 reports the results. 



Greensmith et al. (2004) analyze the dependent case: in general, weakly dependent variables tend to concentrate more than independent ones. Bullet Physics SDK GitHub Issue.



Figure 1: Comparison of simple models derived when L AVEC is used instead of the MSE.

(a) We validate the superiority of AVEC compared to the traditional actor-critic training. (b) We evaluate AVEC in environments with sparse rewards. (c) We clarify the practical implications of using AVEC by examining the bias in both the empirical and true value function estimations as well as the variance in the empirical gradient. (d) We provide an ablation analysis and study the bias-variance trade-off in the critic by considering two continuous control tasks.

Figure2: Comparative evaluation (6 seeds) of AVEC with SAC and PPO on PyBullet ("TaskBullet") and MuJoCo ("Task") tasks. X-axis: number of timesteps. Y-axis: average total reward.

Figure 3: (a,b): Comparative evaluation (6 seeds) of AVEC in sparse reward tasks. X-axis: number of timesteps. Y-axis: average total reward. (c,d): Respectively state visitation frequency and phase portrait of visited states of AVEC-TRPO (green) and TRPO (red) in MountainCar.

Figure 4: L 2 distance to V π .

Figure 5: L 2 distance to V π . X-axis: we run PPO and AVEC-PPO and ∀t ∈ {1, 2, 4, 6, 9} • 10 5 we stop training, use the current policy to collect 3•10 5 transitions and estimate V π .

Figure 6: Average gradient cosine-similarity.

Figure 7: Sensitivity (6 seeds) of AVEC-PPO with respect to (a,b): the bias; (c,d): the variance. X-axis: number of timesteps. Y-axis: average total reward.

Figure 8: Comparative evaluation (6 seeds) of AVEC with SAC (left) and PPO (right) on the Walker2d MuJoCo task. Lines are average performances and shaded areas represent one standard deviation.

Figure11: Average cosine similarity between gradient measurements. AVEC empirically reduces the variance compared to PPO or PPO without a baseline (PPO-nobaseline). Trajectory size used in estimation of the gradient variance: 3000 (upper row), 6000 (middle row), 9000 (lower row). Lines are average performances and shaded areas represent one standard deviation.

Figure 12: Comparative evaluation of AVEC with TRPO. We run with 6 different seeds: lines are average performances and shaded areas represent one standard deviation.

Average total reward of the last 100 episodes over 6 runs of 10 6 timesteps. Comparative evaluation of AVEC with SAC and PPO. ± corresponds to a single standard deviation over trials and

π θ (a t |s t ) 7:s t+1 ∼ P (s t , a t ) D ← D ∪ {(s t , a t , r t , s t+1 )}

Average total reward of the last 100 episodes over 6 runs of 10 6 timesteps. Comparative evaluation of AVEC with TRPO. ± corresponds to a single standard deviation over trials and (.%) is the change in performance due to AVEC.

B.2 VARIATION OF THE BIAS AND VARIANCE TERMS: PPO

In Fig. 9 , we show the variation of the bias and variance terms in the MSE between the estimators (of AVEC-PPO and PPO) and the true target: E[ g φ -V π 2 2 ] = Bias(AVEC) 2 + Var(AVEC) and E[ V φ (PPO) -V π 2 2 ] = Bias(PPO) 2 + Var(PPO) where V φ (PPO) is the value function estimator in PPO. We observe that the variance reduction is more substantial than that of the bias. Using those results and Fig. 5 showing that the distance of the estimator to V π is lower when using AVEC confirms that the variance reduction effect counterbalances the bias increase. Note that the % Variation of the Var term is always negative in our experiments, and that the shaded areas that suggest otherwise are merely due to a false assumption of symmetrical deviations, itself due to the assumption of Gaussianity needed to construct confidence intervals. . X-axis: we run PPO and AVEC-PPO and for every t ∈ {1, 2, 4, 6, 9} • 10 5 , we stop training, use the current policy to interact with the environment for 3 • 10 5 transitions, and use these transitions to estimate the true value function. Lines are average variations and shaded areas represent one standard deviation (5 seeds).

B.3 LEARNING THE TRUE TARGET: SAC

In Fig. 10 , we compare the error between the Q-function estimator and the true Q-function for SAC and AVEC-SAC in AntBullet and HalfCheetahBullet. We note a modest but consistent reduction in this error when using AVEC coupled with SAC, echoing the significant performance gains in Fig. 2 . 

E EXPERIMENT DETAILS

In all experiments we choose to use the same hyperparameter values for all tasks as the bestperforming ones reported in the literature or in their respective open source implementation documentation. We thus ensure the best performance for the conventional actor-critic framework. In other words, since we are interested in evaluating the impact of this new critic, everything else is kept as is. This experimental protocol may not benefit AVEC.In Table 2 , 3 and 4, we report the list of hyperparameters common to all continuous control experiments. 

Environment Description

Ant-v2 Make a four-legged creature walk forward as fast as possible. AntBulletEnv-v0Idem. Ant is heavier, encouraging it to typically have two or more legs on the ground (source: Py-Bullet Guide -url).

HalfCheetah-v2

Make a 2D cheetah robot run. HalfCheetahBulletEnv-v0 Idem.

Humanoid-v2

Make a three-dimensional bipedal robot walk forward as fast as possible, without falling over.

Reacher-v2

Make a 2D robot reach to a randomly located target.

Walker2d-v2

Make a 2D robot walk forward as fast as possible.

Acrobot-v1

Swing the end of a two-joint acrobot up to a given height.

MountainCar-v0

Get an under powered car to the top of a hill.H DIMENSIONS OF STUDIED TASKS 

