A DEEPER LOOK AT DISCOUNTING MISMATCH IN ACTOR-CRITIC ALGORITHMS

Abstract

We investigate the discounting mismatch in actor-critic algorithm implementations from a representation learning perspective. Theoretically, actor-critic algorithms usually have discounting for both actor and critic, i.e., there is a γ t term in the actor update for the transition observed at time t in a trajectory and the critic is a discounted value function. Practitioners, however, usually ignore the discounting (γ t ) for the actor while using a discounted critic. We investigate this mismatch in two scenarios. In the first scenario, we consider optimizing an undiscounted objective (γ = 1) where γ t disappears naturally (1 t = 1). We then propose to interpret the discounting in critic in terms of a bias-variance-representation tradeoff and provide supporting empirical results. In the second scenario, we consider optimizing a discounted objective (γ < 1) and propose to interpret the omission of the discounting in the actor update from an auxiliary task perspective and provide supporting empirical results.

1. INTRODUCTION

Actor-critic algorithms have enjoyed great success both theoretically (Williams, 1992; Sutton et al., 2000; Konda, 2002; Schulman et al., 2015a) and empirically (Mnih et al., 2016; Silver et al., 2016; Schulman et al., 2017; OpenAI, 2018) . There is, however, a longstanding gap between the theory behind actor-critic algorithms and how practitioners implement them. Let γ, γ A , and γ C be the discount factors for defining the objective, updating the actor, and updating the critic respectively. Theoretically, no matter whether γ = 1 or γ < 1, we should always use γ A = γ C = γ (Sutton et al., 2000; Schulman et al., 2015a) or at least keep γ A = γ C if Blackwell optimality (Veinott, 1969; Weitzman, 2001) foot_0 is considered. Practitioners, however, usually use γ A = 1 and γ C < 1 in their implementations (Dhariwal et al., 2017; Caspi et al., 2017; Zhang, 2018; Kostrikov, 2018; Achiam, 2018; Liang et al., 2018; Stooke & Abbeel, 2019) . Although this mismatch and its theoretical disadvantage have been recognized by Thomas (2014); Nota & Thomas (2020), whether and why it yields benefits in practice has not been systematically studied. In this paper, we empirically investigate this mismatch from a representation learning perspective. We consider two scenarios separately. Scenario 1: The true objective is undiscounted (γ = 1). The theory prescribes to use γ A = γ C = γ = 1. Practitioners, however, usually use γ A = γ = 1 but γ C < 1, introducing bias. We explain this mismatch with the following hypothesis: Hypothesis 1. γ C < 1 optimizes a bias-variance-representation trade-off. It is easy to see that γ C < 1 reduces the variance in bootstrapping targets. Besides this, we further provide empirical evidence showing that when γ C < 1, it may become easier to find a good representation compared to γ C = 1. Consequently, although using γ C < 1 introduces bias, it can facilitate representation learning. For our empirical study, we make use of recently introduced techniques, such fixed horizon temporal different learning (De Asis et al., 2019) and distributional reinforcement learning (Bellemare et al., 2017) to disentangle the various effects the discount factor has on the learning process. Scenario 2: The true objective function is discounted (γ < 1). Theoretically, there is a γ t term for the actor update on a transition observed at time t in a trajectory (Sutton et al., 2000; Schulman 



Blackwell optimality states that, in finite MDPs, there exists a γ0 < 1 such that for all γ ≥ γ0, the optimal policies for the γ-discounted objective are the same.

