A DEEPER LOOK AT DISCOUNTING MISMATCH IN ACTOR-CRITIC ALGORITHMS

Abstract

We investigate the discounting mismatch in actor-critic algorithm implementations from a representation learning perspective. Theoretically, actor-critic algorithms usually have discounting for both actor and critic, i.e., there is a γ t term in the actor update for the transition observed at time t in a trajectory and the critic is a discounted value function. Practitioners, however, usually ignore the discounting (γ t ) for the actor while using a discounted critic. We investigate this mismatch in two scenarios. In the first scenario, we consider optimizing an undiscounted objective (γ = 1) where γ t disappears naturally (1 t = 1). We then propose to interpret the discounting in critic in terms of a bias-variance-representation tradeoff and provide supporting empirical results. In the second scenario, we consider optimizing a discounted objective (γ < 1) and propose to interpret the omission of the discounting in the actor update from an auxiliary task perspective and provide supporting empirical results.

1. INTRODUCTION

Actor-critic algorithms have enjoyed great success both theoretically (Williams, 1992; Sutton et al., 2000; Konda, 2002; Schulman et al., 2015a) and empirically (Mnih et al., 2016; Silver et al., 2016; Schulman et al., 2017; OpenAI, 2018) . There is, however, a longstanding gap between the theory behind actor-critic algorithms and how practitioners implement them. Let γ, γ A , and γ C be the discount factors for defining the objective, updating the actor, and updating the critic respectively. Theoretically, no matter whether γ = 1 or γ < 1, we should always use et al., 2000; Schulman et al., 2015a) or at least keep γ A = γ C if Blackwell optimality (Veinott, 1969; Weitzman, 2001) foot_0 is considered. Practitioners, however, usually use γ A = 1 and γ C < 1 in their implementations (Dhariwal et al., 2017; Caspi et al., 2017; Zhang, 2018; Kostrikov, 2018; Achiam, 2018; Liang et al., 2018; Stooke & Abbeel, 2019) . Although this mismatch and its theoretical disadvantage have been recognized by Thomas (2014); Nota & Thomas (2020), whether and why it yields benefits in practice has not been systematically studied. In this paper, we empirically investigate this mismatch from a representation learning perspective. We consider two scenarios separately. γ A = γ C = γ (Sutton Scenario 1: The true objective is undiscounted (γ = 1). The theory prescribes to use γ A = γ C = γ = 1. Practitioners, however, usually use γ A = γ = 1 but γ C < 1, introducing bias. We explain this mismatch with the following hypothesis: Hypothesis 1. γ C < 1 optimizes a bias-variance-representation trade-off. It is easy to see that γ C < 1 reduces the variance in bootstrapping targets. Besides this, we further provide empirical evidence showing that when γ C < 1, it may become easier to find a good representation compared to γ C = 1. Consequently, although using γ C < 1 introduces bias, it can facilitate representation learning. For our empirical study, we make use of recently introduced techniques, such fixed horizon temporal different learning (De Asis et al., 2019) and distributional reinforcement learning (Bellemare et al., 2017) to disentangle the various effects the discount factor has on the learning process. Scenario 2: The true objective function is discounted (γ < 1). Theoretically, there is a γ t term for the actor update on a transition observed at time t in a trajectory (Sutton et al., 2000; Schulman et al., 2015a) . Practitioners, however, usually ignore this term while using a discounted critic, i.e., γ A = 1 and γ C = γ < 1 are used. We explain this mismatch with the following hypothesis: Hypothesis 2. Using γ C = γ < 1 and γ A = 1 is effectively similar to using γ C = γ A = γ < 1 plus an auxiliary loss that sometimes facilitates representation learning. Our empirical study involves implementing the auxiliary task explicitly by using an additional policy for optimizing the difference term between the loss of γ A = 1 and the loss of γ A < 1. We also design new benchmarking environments where the sign of the reward function is flipped after a certain time step such that later transitions differ from earlier ones. In that setting, γ A = 1 becomes harmful. .

2. BACKGROUND

= E[G t |S t = s] and the state-action value function q γ π (s, a) . = E[G t |S t = s, A t = a]. We consider episodic tasks where we assume there is an absorbing state s ∞ ∈ S such that r(s ∞ ) = 0 and p(s ∞ |s ∞ , a) = 1 holds for any a ∈ A. When γ < 1, v γ π and q γ π are always well defined. When γ = 1, to ensure v γ π and q γ π are well defined, we further assume finite expected episode length. Let T π s be a random variable denoting the first time step that an agent hits s ∞ when following π given S 0 = s. We assume T max . = sup π∈Π max s E[T π s ] < ∞, where π is parameterized by θ and Π is the corresponding function class. Similar assumptions are also used in stochastic shortest path problems (e.g., Section 2.2 of Bertsekas & Tsitsiklis (1996) ). In our experiments, all the environments have a hard time limit of 1000, i.e., T max = 1000. This is standard practice, classic RL environments also have an upper limit on their episode lengths (e.g. 27k in Bellemare et al. (2013, ALE) ). Following Pardo et al. (2018) , we add the (normalized) time step t in the state to keep the environment Markovian. We measure the performance of a policy π with J γ (π) . = E S0∼µ0 [v γ π (S 0 )]. Vanilla Policy Gradient: Sutton et al. (2000) compute ∇ θ J γ (π) as ∇ θ J γ (π) . = s d γ π (s) a q γ π (s, a)∇ θ π(a|s), where d γ π (s) . = ∞ t=0 γ t Pr(S t = s|µ 0 , p, π) for γ < 1 and d γ π (s) . = E[ T π S 0 t=0 Pr(S t = s|S 0 , p, π)] for γ = 1.foot_3 Note d γ π remains well-defined for γ = 1 when T max < ∞. In order to optimize the policy performance J γ (π), one can follow (1) and, at time step t, update θ t as θ t+1 ← θ t + αγ t A q γC π (S t , A t )∇ θ log π(A t |S t ), where α is a learning rate. If we replace q γC π with a learned value function, the update rule (2) becomes an actor-critic algorithm, where the actor refers to π and the critic refers to the learned approximation of q γC π . In practice, an estimate for v γC π instead of q γC π is usually learned. Theoretically, we should have γ A = γ C = γ. Practitioners, however, usually ignore the γ 



Blackwell optimality states that, in finite MDPs, there exists a γ0 < 1 such that for all γ ≥ γ0, the optimal policies for the γ-discounted objective are the same. maxs,a |Adv γ π (s,a)|γ (π,π ) (1-γ) 2 Following Schulman et al. (2015a), we consider r : S → R instead of r : S × A → R for simplicity. Sutton et al. (2000) do not explicitly define d γ π when γ = 1, which, however, can be easily deduced from Chapter 13.2 inSutton & Barto (2018).



t A term in (2), and use γ C < γ A = 1. What this update truly optimizes remains an open problem (Nota & Thomas, 2020).

TRPO and PPO: To improve the stability of actor-critic algorithms, Schulman et al. (2015a) propose Trust Region Policy Optimization (TRPO), based on the performance improvement lemma: Lemma 1. (Theorem 1 in Schulman et al. (2015a)) For γ < 1 and any two policies π and π , J γ (π ) ≥ J γ (π) + s d γ π (s) a π (a|s)Adv γ π (s, a) -

Roles of the different discount factorsMarkov Decision Processes: We consider an infinite horizon MDP with a finite state space S, a finite action space A, a bounded reward function r : S → R, a transition kernel p : S × S × A → [0, 1], an initial state distribution µ 0 , and a discount factor γ ∈ [0, 1].2 The initial state S 0 is sampled from µ 0 . At time step t, an agent in state S t takes action A t ∼ π(•|S t ), where π : A × S → [0, 1] is the policy it follows. The agent then gets a reward R t+1 . = r(S t ) and proceeds to the next state S t+1 ∼ p(•|S t , A t ). The return of the policy π at time step t is defined as G t . =

