A DEEPER LOOK AT DISCOUNTING MISMATCH IN ACTOR-CRITIC ALGORITHMS

Abstract

We investigate the discounting mismatch in actor-critic algorithm implementations from a representation learning perspective. Theoretically, actor-critic algorithms usually have discounting for both actor and critic, i.e., there is a γ t term in the actor update for the transition observed at time t in a trajectory and the critic is a discounted value function. Practitioners, however, usually ignore the discounting (γ t ) for the actor while using a discounted critic. We investigate this mismatch in two scenarios. In the first scenario, we consider optimizing an undiscounted objective (γ = 1) where γ t disappears naturally (1 t = 1). We then propose to interpret the discounting in critic in terms of a bias-variance-representation tradeoff and provide supporting empirical results. In the second scenario, we consider optimizing a discounted objective (γ < 1) and propose to interpret the omission of the discounting in the actor update from an auxiliary task perspective and provide supporting empirical results.

1. INTRODUCTION

Actor-critic algorithms have enjoyed great success both theoretically (Williams, 1992; Sutton et al., 2000; Konda, 2002; Schulman et al., 2015a) and empirically (Mnih et al., 2016; Silver et al., 2016; Schulman et al., 2017; OpenAI, 2018) . There is, however, a longstanding gap between the theory behind actor-critic algorithms and how practitioners implement them. Let γ, γ A , and γ C be the discount factors for defining the objective, updating the actor, and updating the critic respectively. Theoretically, no matter whether γ = 1 or γ < 1, we should always use γ A = γ C = γ (Sutton et al., 2000; Schulman et al., 2015a) or at least keep γ A = γ C if Blackwell optimality (Veinott, 1969; Weitzman, 2001) foot_0 is considered. Practitioners, however, usually use γ A = 1 and γ C < 1 in their implementations (Dhariwal et al., 2017; Caspi et al., 2017; Zhang, 2018; Kostrikov, 2018; Achiam, 2018; Liang et al., 2018; Stooke & Abbeel, 2019) . Although this mismatch and its theoretical disadvantage have been recognized by Thomas (2014) ; Nota & Thomas (2020) , whether and why it yields benefits in practice has not been systematically studied. In this paper, we empirically investigate this mismatch from a representation learning perspective. We consider two scenarios separately. Scenario 1: The true objective is undiscounted (γ = 1). The theory prescribes to use γ A = γ C = γ = 1. Practitioners, however, usually use γ A = γ = 1 but γ C < 1, introducing bias. We explain this mismatch with the following hypothesis: Hypothesis 1. γ C < 1 optimizes a bias-variance-representation trade-off. It is easy to see that γ C < 1 reduces the variance in bootstrapping targets. Besides this, we further provide empirical evidence showing that when γ C < 1, it may become easier to find a good representation compared to γ C = 1. Consequently, although using γ C < 1 introduces bias, it can facilitate representation learning. For our empirical study, we make use of recently introduced techniques, such fixed horizon temporal different learning (De Asis et al., 2019) and distributional reinforcement learning (Bellemare et al., 2017) to disentangle the various effects the discount factor has on the learning process. Scenario 2: The true objective function is discounted (γ < 1). Theoretically, there is a γ t term for the actor update on a transition observed at time t in a trajectory (Sutton et al., 2000; Schulman et al., 2015a) . Practitioners, however, usually ignore this term while using a discounted critic, i.e., γ A = 1 and γ C = γ < 1 are used. We explain this mismatch with the following hypothesis: Hypothesis 2. Using γ C = γ < 1 and γ A = 1 is effectively similar to using γ C = γ A = γ < 1 plus an auxiliary loss that sometimes facilitates representation learning. Our empirical study involves implementing the auxiliary task explicitly by using an additional policy for optimizing the difference term between the loss of γ A = 1 and the loss of γ A < 1. We also design new benchmarking environments where the sign of the reward function is flipped after a certain time step such that later transitions differ from earlier ones. In that setting, γ A = 1 becomes harmful.

2. BACKGROUND

γ define the objective γ A update the actor γ C update the critic Table 1 : Roles of the different discount factors Markov Decision Processes: We consider an infinite horizon MDP with a finite state space S, a finite action space A, a bounded reward function r : S → R, a transition kernel p : S × S × A → [0, 1], an initial state distribution µ 0 , and a discount factor γ ∈ [0, 1]. 2 The initial state S 0 is sampled from µ 0 . At time step t, an agent in state S t takes action A t ∼ π(•|S t ), where π : A × S → [0, 1] is the policy it follows. The agent then gets a reward R t+1 . = r(S t ) and proceeds to the next state S t+1 ∼ p(•|S t , A t ). The return of the policy π at time step t is defined as G t . = ∞ i=1 γ i-1 R t+i , which allows us to define the state value function v γ π (S) . = E[G t |S t = s] and the state-action value function q γ π (s, a) . = E[G t |S t = s, A t = a]. We consider episodic tasks where we assume there is an absorbing state s ∞ ∈ S such that r(s ∞ ) = 0 and p(s ∞ |s ∞ , a) = 1 holds for any a ∈ A. When γ < 1, v γ π and q γ π are always well defined. When γ = 1, to ensure v γ π and q γ π are well defined, we further assume finite expected episode length. Let T π s be a random variable denoting the first time step that an agent hits s ∞ when following π given S 0 = s. We assume T max . = sup π∈Π max s E[T π s ] < ∞, where π is parameterized by θ and Π is the corresponding function class. Similar assumptions are also used in stochastic shortest path problems (e.g., Section 2.2 of Bertsekas & Tsitsiklis (1996) ). In our experiments, all the environments have a hard time limit of 1000, i.e., T max = 1000. This is standard practice, classic RL environments also have an upper limit on their episode lengths (e.g. 27k in Bellemare et al. (2013, ALE) ). Following Pardo et al. (2018) , we add the (normalized) time step t in the state to keep the environment Markovian. We measure the performance of a policy π with J γ (π) . = E S0∼µ0 [v γ π (S 0 )]. Vanilla Policy Gradient: Sutton et al. (2000) compute ∇ θ J γ (π) as ∇ θ J γ (π) . = s d γ π (s) a q γ π (s, a)∇ θ π(a|s), where d γ π (s) . = ∞ t=0 γ t Pr(S t = s|µ 0 , p, π) for γ < 1 and d γ π (s) . = E[ T π S 0 t=0 Pr(S t = s|S 0 , p, π)] for γ = 1.foot_3 Note d γ π remains well-defined for γ = 1 when T max < ∞. In order to optimize the policy performance J γ (π), one can follow (1) and, at time step t, update θ t as θ t+1 ← θ t + αγ t A q γC π (S t , A t )∇ θ log π(A t |S t ), where α is a learning rate. If we replace q γC π with a learned value function, the update rule (2) becomes an actor-critic algorithm, where the actor refers to π and the critic refers to the learned approximation of q γC π . In practice, an estimate for v γC π instead of q γC π is usually learned. Theoretically, we should have γ A = γ C = γ. Practitioners, however, usually ignore the γ t A term in (2), and use γ C < γ A = 1. What this update truly optimizes remains an open problem (Nota & Thomas, 2020) . TRPO and PPO: To improve the stability of actor-critic algorithms, Schulman et al. (2015a) propose Trust Region Policy Optimization (TRPO), based on the performance improvement lemma: Lemma 1. (Theorem 1 in Schulman et al. (2015a) ) For γ < 1 and any two policies π and π , J γ (π ) ≥ J γ (π) + s d γ π (s) a π (a|s)Adv γ π (s, a) - where Adv γ π (s, a) . = E s ∼p(•|s,a) [r(s) + γv γ π (s ) -v γ π (s)] is the advantage, (π, π ) . = max s D KL (π(•|s)||π (•|s)) , and D KL refers to the KL divergence. To facilitate our empirical study, we first make a theoretical contribution by extending Lemma 1 to the undiscounted setting. We have the following lemma: Lemma 2. Assuming T max < ∞, for γ = 1 and any two policies π and π , J γ (π ) ≥ J γ (π) + s d γ π (s) a π (a|s)Adv γ π (s, a) -4 max s,a |Adv γ π (s, a)|T 2 max (π, π ). The proof of Lemma 2 is provided in the appendix. A practical implementation of Lemmas 1 and 2 is to compute a new policy θ via gradient ascent on the clipped objective: L(θ) . = ∞ t=0 γ t A min π θ (At|St) π θ old (At|St) Adv γC π θ old (S t , A t ), clip( π θ (At|St) π θ old (At|St) )Adv γC π θ old (S t , A t ) , where S t and A t are sampled from π θold , and clip(x) . = max(min(x, 1 + ), 1 -) with a hyperparameter. Theoretically, we should have γ A = γ C , but practical algorithms like Proximal Policy Optimization (Schulman et al., 2017, PPO)  usually use γ C < γ A = 1. Policy Evaluation: We now introduce several policy evaluation techniques we use in our empirical study. Let v be our estimate of v γ π . At time step t, Temporal Difference learning (TD, Sutton (1988  )) updates v as v(S t ) ← v(S t ) + α(R t+1 + γv(S t+1 ) -v(S t )). (S t ) ← vi (S t ) + α(R t+1 + vi-1 (S t+1 ) -vi (S t )) (i = 1, . . . H), where v0 (s) . = 0. In other words, to learn vH , we need to learn {v i } i=1,...,H simultaneously. De Asis et al. ( 2019) call (4) Fixed Horizon Temporal Difference learning (FHTD). As G t is a random variable, Bellemare et al. (2017) propose to learn its full distribution instead of its expectation only, yielding the Distributional Reinforcement Learning (RL) paradigm. They use a categorical distribution with 51 atoms uniformly distributed in [-V max , V max ] to approximate the distribution of G t , where V max is a hyperparameter. In this paper, we refer to the corresponding policy evaluation algorithm as C51. Methodology: We consider MuJoCo robot simulation tasks from OpenAI gym (Brockman et al., 2016) as our benchmark. Given its popularity in understanding deep RL algorithms (Henderson et al., 2017; Ilyas et al., 2018; Engstrom et al., 2019; Andrychowicz et al., 2020) and designing new deep RL algorithms (Fujimoto et al., 2018; Haarnoja et al., 2018) , we believe our empirical results are relevant to most practitioners. We choose PPO, a simple yet effective and widely used algorithm, as the representative actor-critic algorithm for our empirical study. PPO is usually equipped with generalized advantage estimation (Schulman et al., 2015b, GAE) , which has a tunable hyperparameter γ. The roles of γ and γ are similar. To reduce its confounding effect, we do not use GAE in our experiments, i.e., the advantage estimation for our actor is simply the TD error R t+1 + γ C v(S t+1 ) -v(S t ). The PPO pseudocode we follow is provided in Alg. 1 in the appendix and we refer to it as the default PPO implementation. We use the standard architecture and optimizer across all tasks, in particular, the actor and the critic do not share layers. We conduct a thorough learning rate search in Ant for each algorithmic configuration (i.e., a curve in a figure) and then use the same learning rate for all other tasks. When using FHTD and C51, we also include H and V max in the grid search. All details are provided in the appendix. We report the average episode return of the ten most recent episodes against the number of interactions with the environment. Curves are averages over ten independent runs with shaded regions indicating standard errors.

3. OPTIMIZING THE UNDISCOUNTED OBJECTIVE (SCENARIO 1)

When our goal is to optimize the undiscounted objective J γ=1 (π), one theoretically grounded option is to use γ A = γ C = γ = 1. By using γ A = 1 and γ C < 1, practitioners introduce bias. We first empirically confirm that introducing bias in this way indeed has empirical advantages. A simple first hypothesis is that γ C < 1 leads to lower variance in Monte Carlo return bootstrapping targets than γ C = 1, it thus optimizes a bias-variance trade-off. However, we further show that there are empirical advantages from γ C < 1 that cannot uniquely be explained by this bias-variance trade-off, indicating that there are additional factors beyond variance. We then show empirical evidence identifying representation learning as an additional factor, leading to the bias-variance-representation trade-off from Hypothesis 1. All the experiments in this section use γ A = 1. Bias-variance trade-off: To investigate the advantages of using γ C < 1, we first test default PPO with γ C ∈ {0.95, 0.97, 0.99, 0.995, 1}. We find that the best discount factor is always with γ C < 1 and that γ C = 1 usually leads to a performance drop (Figure 1 ). In default PPO, although the advantage is computed as the one-step TD error, the update target for updating the critic v(S t ) is almost always a Monte Carlo return. As there is no γ t A term in the actor update, we should theoretically use γ C = γ A = 1 when computing the Monte Carlo return, which usually leads to high variance. Consequently, a simple hypothesis for the empirical advantages of using γ C < 1 is a bias-variance trade-off. We find, however, that there is more at play.

Beyond bias-variance trade-off:

To reduce the effect of γ C in controlling the variance, we benchmark PPO-TD (Algorithm 2 in the appendix). PPO-TD is the same as default PPO except that the critic is updated with one-step TD, i.e., the update target for v(S t ) is now R t+1 + γ C v(S t+1 ). Although Figure 2 shows that PPO-TD (γ C = 1) outperforms PPO (γ C = 1) by a large margin, indicating bias-variance may be at play, Figure 3 suggests that for PPO-TD as well, γ C < 1 is still preferable to γ C = 1. To further study this phenomenon, we benchmark PPO-TD-Ex (Algorithm 3 in the appendix), in which we provide N extra transitions to the critic by sampling multiple actions at any single state and using an averaged bootstrapping target. The update target for v(S t ) in PPO-TD-Ex is 1 N +1 N i=0 R i t+1 + γ C v(S i t+1 ). Here R 0 t+1 and S 0 t+1 refer to the original reward and successor state. To get R i t+1 and S i t+1 for i ∈ {1, . . . , N }, we first sample an action A i t from the sampling policy, then reset the environment to S t , and finally execute A i t to get R i t+1 and S i t+1 . Importantly, we do not count those N extra transitions in the x-axis when plotting. The advantage for the actor update in PPO-TD-Ex is estimated with R 0 t+1 + v(S 0 t+1 ) -v(S t ) regardless of γ C to further control the influence of variance. The critic v is not trained on the extra successor states {S i t+1 } i=1,...,N . So the quality of the prediction v(S i t+1 ) depends mainly on the generalization of v. Intuitively, if v generalizes well, providing proper amount of transitions this way should improve or maintain the overall performance as they help reduce variance. As shown by Figure 4 , PPO-TD-Ex (γ C = 0.99) roughly follows this intuition. However, surprisingly, providing extra data to PPO-TD-Ex (γ C = 1) leads to a significant performance drop (Figure 5 ). This drop suggests that the larger variance from the randomness of S t+1 is not the only issue when using γ C = 1 to train the critic. The quality of the estimate v, at least in terms of making prediction on untrained states {S i t+1 } 1,...,N , is lower when γ C = 1 is used than γ C < 1. In other words, the generalization of v is poor when γ C = 1. The curves for PPO-TD-Ex (γ C = 0.995) are a mixture of γ C = 0.99 and γ C = 1 and are provided in Figure 16 in the appendix. In the undiscounted setting, we should theoretically have R t+1 + v(S t+1 ) as the update target for the critic. When γ C < 1 is used instead, the update target becomes R t+1 + γ C v(S t+1 ) and the variance resulting from the randomness of S t+1 becomes less pronounced. So here, γ C trades off bias with variance, similar to that in Monte Carlo return bootstrapping targets in default PPO. We refer to this effect of γ C as variance control. However, γ C can also affect the difficulty of learning a good estimate v for v γC π ; we refer to this effect of γ C as learnability control (Lehnert et al., 2018; Laroche & van Seijen, 2018; Romoff et al., 2019) . Inspired by the poor generalization of v when γ C = 1, we investigate learnability control mainly from the representation learning perspective. By representation learning, we refer to learning the bottom layers (backbone) of a neural network. The last layer of the neural network is then interpreted as a linear function approximator whose features are the output of the backbone. This interpretation of representation learning is widely used in the RL community, see e.g. Jaderberg et al. (2016) ; Chung et al. (2018) ; Veeriah et al. (2019) .

Bias-representation trade-off:

To separate variance control and learnability control, ideally we should investigate the update target R t+1 + γ C,1 v(S t+1 ), where v is trained to approximate v γC,2 π and γ C,2 < γ C,1 = 1. Learning an estimate v for v γC,2 π , however, implies to use the update target R t+1 + γ C,2 v(S t+1 ): the two effects of γ C,2 then get mixed again. To solve this dilemma, we consider the update target R t+1 + vH-1 (S t+1 ), where vH-1 (S t+1 ) is trained to approximate v H-1 π , i.e., we use FHTD to train the critic in PPO, which we refer to as PPO-FHTD (Algorithm 4 in the appendix). PPO-FHTD implements γ C,1 = 1 directly, and manipulating H changes the horizon of the policy evaluation problem, which is also one of the effects of manipulating γ C,2 . We test two parameterizations for PPO-FHTD to investigate representation learning. In the first parameterization, to learn v H π , we parameterize {v i π } i=1,...,H as H different heads over the same representation layer (backbone). In the second parameterization, we always learn {v i π } i=1,...,1024 as 1024 different heads over the same representation layer, whatever H we are interested in. To approximate v H π , we then simply use the output of the H-th head. A diagram (Figure 13 ) in the appendix further illustrates the difference between the two parameterizations. Figure 6 shows that by tuning H for FHTD, PPO-FHTD with the first parameterization matches or exceeds the performance of PPO-TD (γ C < 1) in most tasks, and that the best H is always Figure 6 shows that the performance of PPO-FHTD (H = 1024) is very close to PPO-TD (γ C = 1), indicating that learning {v i π } i=1,...,1023 is not an additional overhead for the network in terms of learning v H=1024 π , i.e., increasing H does not pose additional challenges in terms of network capacity. However, Figure 7 suggests that for the second parameterization, H = 1024 is almost always among the best choices of H. Comparing Figures 6 and 7 , we conclude that in the tested domains, learning v H π with different H requires different representations. This suggests that we can interpret the results in Figure 6 as a bias-representation trade-off. Using a larger H is less biased but representation learning may become harder due to the longer policy evaluation horizon. Consequently, an intermediate H achieves the best performance in Figure 6 . As reducing H cannot bring in advantages in representation learning under the second parameterization, the less biased H, i.e., the larger H, usually performs better in Figure 7 . Overall, γ C optimizes a bias-representation trade-off by changing the policy evaluation horizon H. We further conjecture that representation learning may be harder for a longer horizon because good representations can become rarer. We provide a simulated example to support this. Consider policy evaluation on the simple Markov Reward Process (MRP) from Figure 8. We assume the reward for each transition is fixed and is randomly generated in [0, 1]. Let x s ∈ R K be the feature vector for a state s; we set its i-th component as x s [i] . = tanh(ξ), where ξ is a random variable uniformly distributed in [-2, -2]. We chose this feature setup as we use tanh as the activation function in our PPO. We use X ∈ R N ×K to denote the feature matrix. To create state aliasing (McCallum, 1997) , which is common under function approximation, we first randomly split the N states into S 1 and S 2 such that |S 1 | = αN and |S 2 | = (1 -α)N , where α is the proportion of states to be aliased. Then for every s ∈ S 1 , we randomly select an ŝ ∈ S 2 and set x s ← x ŝ. Finally, we add Gaussian noise N (0, 0.1 2 ) to each element of X. We use N = 100 and K = 30 in our simulation and report the normalized representation error (NRE) as a function of γ. For a feature matrix X, the NRE is computed analytically as NRE(γ) . = minw ||Xw-vγ ||2 ||vγ ||2 , where v γ is the analytically computed true value function of the MRP. We report the results in Figure 9 , where each data point is averaged over 10foot_4 randomly generated feature matrices (X) and reward functions. In this MRP, the average representation error becomes larger as γ increases, which suggests that learning a good representation under a large γ and state aliasing may be harder than with a smaller γ. We report the unnormalized representation error in Figure 17 in the appendix, where the trend is much clearer. Overall, though we do not claim that there is a monotonic relationship between the discount factor and the difficulty of representation learning, our empirical study does suggest that representation learning is a key factor at play in the misuse of the discounting in actor-critic algorithms, beyond the widely recognized bias-variance trade-off. In the appendix, we provide additional experiments involving distributional RL to further support the bias-variance-representation trade-off hypothesis, under the assumption that the benefits of distributional RL comes mainly from the improved representation learning (Bellemare et al., 2017; Munos, 2018; Petroski Such et al., 2019) . When our goal is to optimize the discounted objective J γ<1 (π), theoretically we should have the γ t A term in the actor update and use γ C < 1. Practitioners, however, usually ignore this γ t A (i.e., set γ A = 1), introducing bias. Figure 10 shows that even if we use the discounted return as the performance metric, the biased implementation of PPO still outperforms the theoretically grounded implementation DisPPO in the domains we tested. Here PPO refers to the default PPO implementation where γ A = 1, γ C = γ < 1, and DisPPO (Alg. 6 in the appendix) adds the missing γ t A term in PPO by using γ A = γ C = γ < 1. We propose to interpret the empirical advantages of PPO over DisPPO with Hypothesis 2. For all experiments in this section, we use γ C = γ < 1. An auxiliary task perspective: The biased policy update implementation of (2) ignoring γ t A can be decomposed into two parts as ∆ t = γ t ∆ t + (1 -γ t )∆ t , where ∆ t . = q γC π (S t , A t )∇ θ log π(A t |S t ). We propose to interpret the difference term between the biased implementation (∆ t ) and the theoretically grounded implementation (γ t ∆ t ), i.e., the (1-γ t )q γC π (S t , A t )∇ θ log π(A t |S t ) term, as the gradient of an auxiliary objective with a dynamic weighting 1 -γ t . Let J s,µ (π) . = a π(a|s)q γ µ (s, a); we have ∇ θ J s,µ (π)| µ=π = E a∼π(•|s) [q γ π (s, a)∇ θ log π(a|s)]. This objective changes every time step (through µ). Inspired by the decomposition, we augment PPO with this auxiliary task, yielding AuxPPO (Algorithm 7 and Figure 13 in the appendix). In AuxPPO, we have two policies π and π parameterized by θ and θ respectively. The two policies are two heads over the same neural network backbone, where π is used for interaction with the environment and π is the policy for the auxiliary task. AuxPPO optimizes θ and θ simultaneously by considering the following joint loss L(θ,θ ) . = ∞ t=0 γ t min π θ (At|St) π θ old (At|St) Adv γC π θ old (S t , A t ), clip( π θ (At|St) π θ old (At|St) )Adv γC π θ old (S t , A t ) + ∞ t=0 (1 -γ t ) min π θ (At|St) π θ old (At|St) Adv γC π θ old (S t , A t ), clip( π θ (At|St) π θ old (At|St) )Adv γC π θ old (S t , A t ) , where S t and A t are obtained by executing θ old . We additionally synchronize θ with θ periodically to avoid an off-policy learning issue. Flipped rewards: Besides AuxPPO, we also design novel environments with flipped rewards to investigate Hypothesis 2. Recall we include the time step in the state, this allows us to simply create a new environment by defining a new reward function r (s, t) . = r(s)I t≤t0 -r(s)I t>t0 , where I is the indicator function. During an episode, within the first t 0 steps, this new environment is the same as the original one. After t 0 steps, the sign of the reward is flipped. We select t 0 such that γ t0 is sufficiently small, e.g., we define t 0 . = min t {γ t < 0.05}. With this criterion for selecting t 0 , the later transitions (i.e., transitions after t 0 steps) have little influence on the evaluation objective, the discounted return. Consequently, the later transitions affect the overall learning process mainly through representation learning. DisPPO rarely makes use of the later transitions due to the γ t A term in the gradient update. AuxPPO makes use of the later transitions only through representation learning. PPO exploits the later transitions for representation learning and the later transitions also affect the control policy of PPO directly. Results: When we consider the original environments, Figure 11 shows that in 8 out 12 tasks, PPO outperforms DisPPO, even if the performance metric is the discounted episodic return. In all those 8 tasks, by using the difference term as an auxiliary task, AuxPPO is able to improve upon DisPPO. In 6 out of those 8 tasks, AuxPPO is able to roughly match the performance of PPO at the end of training. For γ ∈ {0.93, 0.9} in Ant, the improvement of AuxPPO is not clear and we conjecture that this is because the learning of the π-head (the control head) in AuxPPO is much slower than the learning of π in PPO due to the γ t C term. Overall, this suggests that the benefit of PPO over DisPPO comes mainly from representation learning. When we consider the environments with flipped rewards, PPO is outperformed by DisPPO and AuxPPO by a large margin in 11 out of 12 tasks. The transitions after t 0 steps are not directly relevant when the performance metric is the discounted return. However, learning on those transitions may still improve representation learning provided that those transitions are similar to the earlier transitions, which is the case in the original environments. PPO and AuxPPO, therefore, outperform DisPPO. However, when those transitions are much different from the earlier transitions, which is the case in the environments with flipped rewards, learning to control on them directly becomes distracting. PPO, therefore, is outperformed by DisPPO. Different from PPO, AuxPPO does not learn to control on later transitions. Provided that the network has enough capacity, the control head π θ in AuxPPO will not be affected much by the irrelevant transitions. The performance of AuxPPO is, therefore, similar to DisPPO. To summarize, Figure 11 suggests that using γ A = 1 is simply an inductive bias that all transitions are equally important. When this inductive bias is helpful for learning, γ A = 1 implicitly implements auxiliary tasks thus improving representation learning and the overall performance. When this inductive bias is detrimental, however, γ A = 1 can lead to significant performance drops. AuxPPO appears to be a safe choice that does not depend much on the correctness of this inductive bias.

5. RELATED WORK

The mismatch in actor-critic algorithm implementations has been previously studied. Thomas (2014) focuses on the natural policy gradient setting and shows that the biased implementation ignoring γ t A can be interpreted as the gradient of the average reward objective under a strong assumption that the state distribution is independent of the policy. Nota & Thomas (2020) prove that without this strong assumption, the biased implementation is not the gradient of any stationary objective. This does not contradict our auxiliary task perspective as our objective J s,µ (π) changes at every time step. Nota & Thomas (2020) further provide a counterexample showing that following the biased gradient can lead to a policy of poor performance w.r.t. both discounted and undiscounted objectives. Both Thomas (2014) and Nota & Thomas (2020) , however, focus on theoretical disadvantages of the biased gradient and regard ignoring γ t A as the source of the bias. We instead regard the introduction of γ C < 1 in the critic as the source of the bias in the undiscounted setting and investigate its empirical advantages, which are more relevant to practitioners. Moreover, our representation learning perspective for investigating this mismatch is to our knowledge novel. Although we propose the bias-variance-representation trade-off, we do not claim that is all that γ affects. The discount factor also has many other effects (e.g., Sutton ( 1995 2020) study the effect of the bootstrapping parameter λ in TD(λ) in generalization. Our work studies the effect of the discount factor γ in representation learning in the context of the misuse of the discounting in actor-critic algorithms, sharing a similar spirit of Bengio et al. (2020) .

6. CONCLUSION

In this paper, we investigate the longstanding mismatch between theorists and practitioners in actorcritic algorithms from a representation learning perspective. Although the theoretical understanding of policy gradient algorithms have recently been significantly advanced (Agarwal et al., 2019; Wu et al., 2020) , this mismatch has drawn little attention. We hope our empirical study can help practitioners understand actor-critic algorithms better and therefore design more efficient actor-critic algorithms in the setting of deep RL, where representation learning emerges as a major consideration. We hope our empirical study can draw more attention to the mismatch, which could enable the community to finally close this longstanding gap.

A PROOF OF LEMMA 2

Proof. The proof is based on Appendix B in Schulman et al. (2015a) , where perturbation theory is used to prove the performance improvement bound (Lemma 1). To simplify notation, we use a vector and a function interchangeably, i.e., we also use r and µ 0 to denote the reward vector and the initial distribution vector. J(π) and d π (s) are shorthand for J γ (π) and d γ π (s) with γ = 1. All vectors are column vectors. Let S + be the set of states excluding s ∞ , i.e., S + . = S/{s ∞ }, we define P π ∈ R |S + |×|S + | such that P π (s, s ) . = a π(a|s)p(s |s, a). Let G . = ∞ t=0 P t π . According to standard Markov chain theories, G(s, s ) is the expected number of times that s is visited before s ∞ is hit given S 0 = s. T max < ∞ implies that G is well-defined and we have G = (I - P π ) -1 . Moreover, T max < ∞ also implies ∀s, s G(s, s ) ≤ T max , i.e., ||G|| ∞ ≤ T max . We have J(π) = µ 0 Gr. Let G . = (I -P π ) -1 , we have J(π ) -J(π) = µ 0 (G -G)r. Let ∆ . = P π -P π , we have G -1 -G -1 = -∆, Left multiply by G and right multiply by G, G -G = -G ∆G, G = G + G ∆G (Expanding G in RHS recursively) = G + G∆G + G ∆G∆G. So we have J(π ) -J(π) = µ 0 G∆Gr + µ 0 G ∆G∆Gr. It is easy to see µ 0 G = d π and Gr = v π . So ( a π(a|s)Adv π (s, a) = 0 by Bellman equation) µ 0 G∆Gr = d π ∆v π = s d π (s) We now bound µ 0 G ∆G∆Gr. First, |(∆Gr)(s)| = | s a π (a|s) -π(a|s) p(s |s, a)v π (s )| = | a π (a|s) -π(a|s) r(s) + s p(s |s, a)v π (s ) -v π (s) | = | a π (a|s) -π(a|s) Adv π (s, a)| ≤ 2 max s D T V (π (•|s), π(•|s)) max s,a |Adv π (s, a)|, where D T V is the total variation distance. So ||∆Gr|| ∞ ≤ 2 max s D T V (π (•|s), π(•|s)) max s,a |Adv π (s, a)|. Moreover, for any vector x, |(∆x)(s)| ≤ 2 max s D T V (π (•|s), π(•|s))||x|| ∞ , ||∆x|| ∞ ≤ 2 max s D T V (π (•|s), π(•|s))||x|| ∞ . So ||∆|| ∞ ≤ 2 max s D T V (π (•|s), π(•|s)), |µ 0 G ∆G∆Gr| ≤ ||µ 0 || 1 ||G || ∞ ||∆|| ∞ ||G|| ∞ ||∆Gr|| ∞ ≤ 4T 2 max max s D 2 T V (π (•|s), π(•|s)) max s,a |Adv π (s, a)| ≤ 4T 2 max max s D KL (π(•|s)||π (•|s)) max s,a |Adv π (s, a)|, which completes the proof. Note this perturbation-based proof of Lemma 2 holds only for r : S → R. For r : S × A → R, we can turn to the coupling-based proof as Schulman et al. (2015a) , which, however, complicates the presentation and deviates from the main purpose of this paper. We, therefore, leave it for future work.

B EXPERIMENT DETAILS B.1 METHODOLOGY

We use HalfCheetah, Walker, Hopper, Ant, Humanoid, and HumanoidStandup as our benchmarks. We exclude other tasks as we find PPO plateaus quickly there. The tasks we consider have a hard time limit of 1000. Following Pardo et al. (2018) , we add time step information into the state, i.e., there is an additional scalar t/1000 in the observation vector. Following Achiam (2018), we estimate the KL divergence between the current policy θ and the sampling policy θ old when optimizing the loss (3). When the estimated KL divergence is greater than a threshold, we stop updating the actor and update only the critic with current data. We use Adam (Kingma & Ba, 2014) as the optimizer and perform grid search for the initial learning rates of Adam optimizers. Let α A and α C . = βα A be the learning rates for the actor and critic respectively. For each algorithmic configuration (i.e., a curve in a figure), we tune α A ∈ {0.125, 0.25, 0.5, 1, 2} × 3 • 10 -4 and β ∈ {1, 3} with grid search in Ant with 3 independent runs maximizing the average return of the last 100 training episodes. In particular, α A = 3 • 10 -4 and β = 3 is roughly the default learning rates for the PPO implementation in Achiam (2018) . We then run this algorithmic configuration with the best α A and α C in all tasks. Overall, we find after removing GAE, smaller learning rates are preferred. When we use FHTD, we additionally consider H ∈ {16, 32, 64, 128, 256, 512, 1024} in the grid search. When we use C51, we additionally consider V max ∈ {20, 40, 80, 160, 320, 640, 1280, 2560, 5120, 10240, 81920, 163840, 327680} in the grid search. We use PPO-TD with γ C = 0.99 as an example to study how the best hyperparameter configuration in Ant transfers to other games. As shown in Figure 12 , the best learning rates of Ant (α A = 3 • 10 -4 and β = 3) yields reasonably good performance in all the other games except Humanoid. In the paper, we do not draw a conclusion from a single task. So an outlier is unlikely to affect the overall conclusion. In the discounted setting, we consider only Ant, HalfCheetah and their variants. For Walker2d, Hopper, and Humanoid, we find the average episode length of all algorithms are smaller than t 0 , i.e., the flipped reward rarely takes effects. For HumanoidStandup, the scale of the reward is too large. To summarize, other four environments are not well-suited for the purpose of our empirical study. Moreover, in the discounted setting, we performed the grid search of the learning rates for both Ant and HalfCheetah. 

B.2 ALGORITHM DETAILS

The pseudocode of all implemented algorithms are provide in Algorithms 1 -7 with their architectures illustrated in Figure 13 . For hyperparameters that are not included in the grid search, we use the same value as Dhariwal et al. (2017) ; Achiam (2018) . In particular, for the rollout length, we set K = 2048. For the optimization epochs, we set K opt = 320. For the minibatch size, we set B = 64. For the maximum KL divergence, we set KL target = 0.01. We clip π θ (a|s) π θ old (a|s) into [-0.2, 0.2]. We use N s = 51 supports for PPO-C51. We use two-hidden-layer neural networks for function approximation. Each hidden layer has 64 hidden units and a tanh activation function. The output layer of the actor network has a tanh activation function and is interpreted as the mean of an isotropic Gaussian distribution, whose standard derivation is a global state-independent variable as suggested by Schulman et al. (2015a) . Algorithm 1: PPO Input: θ, ψ: parameters of π, v α A , α C : Initial learning rates of the Adam optimizers for θ, ψ K, K opt , B: rollout length, number of optimization epochs, and minibatch size KL target : maximum KL divergence threshold Initialize a buffer M θ old ← θ for i = 0, . . . , K -1 do for j = 0, . . . , N do Hypothesis 1 and the previous empirical study suggest that representation learning may be the main bottleneck of PPO-TD (γ C = 1). To further support this, we benchmark PPO-C51 (γ C = 1) (Algorithm 5 in the appendix), where the critic of PPO is trained with C51. C51 is usually considered to improve representation learning by implicitly providing auxiliary tasks (Bellemare et al., 2017; Munos, 2018; Petroski Such et al., 2019) . Figure 14 shows that training the critic with C51 indeed leads to a performance improvement and PPO-C51 (γ C = 1) sometimes outperforms PPO-TD (γ C < 1) by a large margin. Figure 15 further shows that when V max is optimized for PPO-C51, the benefit for using γ C < 1 in PPO-C51 is less pronounced than that in PPO-TD, indicating the role of γ C < 1 and distributional learning may overlap. Figures 6, 7 , & 9, suggest that the overlapping is representation learning. S 0 ∼ µ 0 while True do Initialize a buffer M θ old ← θ for i = 0, . . . , K -1 do A i ∼ π θ old (•|S i ) Execute A i , get R i+1 , S i+1 if S i+1 is a terminal state then m i ← 0, S i+1 ∼ µ 0 else m i ← 1 end end G K ← v(S K ) for i = K -1, . . . , 0 do G i ← R i+1 + γ C m i G i+1 Adv i ← R i+1 + γ C m i vψ (S i+1 ) -vψ (S i ) Store (S i , A i , G i , Adv i ) in M end Normalize Adv i in M as Adv i ← Advi-mean({Advi}) std({Advi}) for o = 1, . . . , K opt do Sample a minibatch {(S i , A i , G i , Adv i )} i=1,...,B from M L(ψ) ← 1 2B B i=1 (v ψ (S i ) -G i ) 2 / * No gradient through G i * / L(θ) ← 1 B B i=1 min{ π θ (Ai|Si) π θ old (Ai|Si) Adv i , clip( π θ (Ai|Si) π θ old (Ai|Si) )Adv i } Perform one gradient update to ψ minimizing L(ψ) with Adam if 1 B B i=1 log π θ old (A i |S i ) -log π θ (A i |S i ) < KL A j i ∼ π θ old (•|S i ), R j i+1 ← r(S i , A j i ), S j i+1 ∼ p(•|S i , A j i ) if S j i+1 is a terminal state then m j i ← 0, S j i+1 ∼ µ 0 else m j i ← 1 end end S i+1 ← S 0 i+1 end for i = K -1, . . . , 0 do Adv i ← R 0 i+1 + γ C m 0 i vψ (S 0 i+1 ) -vψ (S 0 i ) for j = 0, . . . , N do S j i ← S j i+1 end Store ({S j i , A j i , m j i , r j i , S j i } j=0,...,N , Adv i ) in M end Normalize Adv i in M as Adv i ← Advi-mean({Advi}) std({Advi}) for o = 1, . . . , K opt do Sample a minibatch {({S j i , A j i , m j i , r j i , S j i } j=0,...,N , Adv i )} i=1,...,B from M y i ← 1 N +1 N j=0 r j i + γ C m j i vψ (S j i ) L(ψ) ← 1 2B B i=1 (v ψ (S 0 i ) -y i ) 2 / * No gradient through y i * / L(θ) ← 1 B B i=1 min{ π θ (A 0 i |S 0 i ) π θ old (A 0 i |S 0 i ) Adv i , clip( π θ (A 0 i |S 0 i ) π θ old (A 0 i |S 0 i ) )Adv i } Perform one gradient update to ψ minimizing L(ψ) with Adam if 1 B B i=1 log π θ old (A 0 i |S 0 i ) -log π θ (A 0 i |S 0 i ) < KL S 0 ∼ µ 0 , t ← 0 while True do Initialize a buffer M θ old ← θ, θ ← θ for i = 0, . . . , K -1 do A i ∼ π θ old (•|S i ), t i ← t Execute A i , get R i+1 , S i+1 if S i+1 is a terminal state then m i ← 0, S i+1 ∼ µ 0 , t ← 0 else m i ← 1, t ← t + 1 end end G K ← v(S K ) for i = K -1, . . . , 0 do G i ← R i+1 + γ C m i G i+1 Adv i ← R i+1 + γ C m i vψ (S i+1 ) -vψ (S i ) Store (S i , A i , G i , Adv i , t i ) in M end Normalize Adv i in M as Adv i ← Advi-mean({Advi}) std({Advi}) for o = 1, . . . , K opt do Sample a minibatch {(S i , A i , G i , Adv i , t i )} i=1,...,B from M L(ψ) ← 1 2B B i=1 (v ψ (S i ) -G i ) 2 / * No gradient through G i * / L(θ, θ ) ← 1 B B i=1 γ ti C min{ π θ (Ai|Si) π θ old (Ai|Si) Adv i , clip( π θ (Ai|Si) π θ old (Ai|Si) )Adv i }+ (1 -γ ti C ) min{ π θ (Ai|Si) π θ old (Ai|Si) Adv i , clip( π θ (Ai|Si) π θ old (Ai|Si) )Adv i } Perform one gradient update to ψ minimizing L(ψ) with Adam if 1 B B i=1 log π θ old (A i |S i ) -log π θ (A i |S i ) < KL

C.2 OTHER COMPLEMENTARY RESULTS

Figure 16 shows how PPO-TD-Ex (γ C = 0.995) reacts to the increase of N . Figure 17 shows the unnormalized representation error in the MRP experiment. Figure 18 shows the average episode length for the Ant environment in the discounted setting. For HalfCheetah, it is always 1000. 



Blackwell optimality states that, in finite MDPs, there exists a γ0 < 1 such that for all γ ≥ γ0, the optimal policies for the γ-discounted objective are the same. maxs,a |Adv γ π (s,a)|γ (π,π ) (1-γ) 2 Following Schulman et al. (2015a), we consider r : S → R instead of r : S × A → R for simplicity. Sutton et al. (2000) do not explicitly define d γ π when γ = 1, which, however, can be easily deduced from Chapter 13.2 inSutton & Barto (2018). The trend that NRE decreases as α increases is merely an artifact from how we generate vγ. See Section B.1 for more details about task selection.



Figure 1: The default PPO implementation with different discount factors.

Figure 4: PPO-TD-Ex (γ C = 0.99).

Figure 6: PPO-FHTD with the first parameterization. The best H and γ C are used for each game.

Figure 8: A simple MRP.

Figure 9: Normalized representation error as a function of the discount factor. Shaded regions indicate one standard derivation. 4

Figure 10: Comparison between PPO and DisPPO with γ = 0.995

Figure 11: Curves without any marker are obtained in the original Ant / HalfCheetah. Diamond-marked curves are obtained in Ant / HalfCheetah with r . 5

); Jiang et al. (2016); Laroche et al. (2017); Laroche & van Seijen (2018); Lehnert et al. (2018); Fedus et al. (2019); Van Seijen et al. (2019); Amit et al. (2020)), which we leave for future work. In Scenario 1, using γ C < 1 helps reduce the variance. Variance reduction in RL itself is an active research area (see, e.g., Papini et al. (2018); Xu et al. (2019); Yuan et al. (2020)). Investigating those variance reduction techniques with γ C = 1 is a possibility for future work. Recently, Bengio et al. (

a|s) -π(a|s)) r(s)+ s p(s |s, a)v π (s ) -v π (s)( a (π (a|s) -π(a|s))f (s) = 0 holds for any f that dependes only on s) )Adv π (s, a).

Figure12: PPO-TD (γ C = 0.99) with different learning rates. A curve labeled with (x, β) corresponds to an initial learning rate for the actor and critic of α A = x × 3 • 10 -4 and α C = βα A respectively. The best learning rates for Ant (α A = 3 • 10 -4 and β = 3) yields reasonably good performance in all the other games except Humanoid.

target then Perform one gradient update to θ maximizing L(θ) with Adam end end end Algorithm 3: PPO-TD-Ex Input: θ, ψ: parameters of π, v α A , α C : Initial learning rates of the Adam optimizers for θ, ψ K, K opt , B: rollout length, number of optimization epochs, and minibatch size KL target : maximum KL divergence threshold N : number of extra transitions p, r: transition kernel and reward function of the oracle S 0 ∼ µ 0 while True do

Figure 13: Architectures of the algorithms

Figure 14: For PPO-C51, we set γ C = 1.

Figure 16: PPO-TD-Ex (γ C = 0.995).

Figure 17: Unnormalized representation error (RE) as a function of the discount factor. Shaded regions indicate one standard derivation. RE is computed analytically as RE(X, γ) . = min w ||Xwv γ || 2

Figure 26: Comparison between PPO and DisPPO with γ = 0.995. The larger version of Figure 10.

Instead of the infinite horizon discounted return G t , De Asis et al. (2019) propose to consider the H-step return G H

C : Initial learning rates of the Adam optimizers for θ, ψ K, K opt , B: rollout length, number of optimization epochs, and minibatch size KL target : maximum KL divergence threshold

annex

Algorithm 2: PPO-TD Input: θ, ψ: parameters of π, v α A , α C : Initial learning rates of the Adam optimizers for θ, ψ K, K opt , B: rollout length, number of optimization epochs, and minibatch size KL target : maximum KL divergence threshold S 0 ∼ µ 0 while True do Initialize a buffer M θ old ← θ for i = 0, . . . , K -1 doStore (S i , A i , m i , r i , S i , Adv i ) in M end Normalize Adv i in M as Adv i ← Advi-mean({Advi}) std({Advi})Perform one gradient update to θ maximizing L(θ) with Adam end end end Algorithm 4: PPO-FHTD Input: θ, ψ: parameters of π, {v j } j=1,...,H α A , α C : Initial learning rates of the Adam optimizers for θ, ψ K, K opt , B: rollout length, number of optimization epochs, and minibatch size KL target : maximum KL divergence thresholdPerform one gradient update to θ maximizing L(θ) with Adam end end end Algorithm 5: PPO-C51 Input: θ, ψ: parameters of π, {v j } j=1,...,Ns with N s being the number of supports and vj being the probability of each support α A , α C : Initial learning rates of the Adam optimizers for θ, ψ K, K opt , B: rollout length, number of optimization epochs, and minibatch size KL target : maximum KL divergence thresholdPerform one gradient update to θ maximizing L(θ) with Adam end end end Algorithm 6: DisPPO Input: θ, ψ: parameters of π, v α A , α C : Initial learning rates of the Adam optimizers for θ, ψ K, K opt , B: rollout length, number of optimization epochs, and minibatch size KL target : maximum KL divergence threshold 

