CORRECTING MOMENTUM IN TEMPORAL DIFFERENCE LEARNING Anonymous authors Paper under double-blind review

Abstract

A common optimization tool used in deep reinforcement learning is momentum, which consists in accumulating and discounting past gradients, reapplying them at each iteration. We argue that, unlike in supervised learning, momentum in Temporal Difference (TD) learning accumulates gradients that become doubly stale: not only does the gradient of the loss change due to parameter updates, the loss itself changes due to bootstrapping. We first show that this phenomenon exists, and then propose a first-order correction term to momentum. We show that this correction term improves sample efficiency in policy evaluation by correcting target value drift. An important insight of this work is that deep RL methods are not always best served by directly importing techniques from the supervised setting.

1. INTRODUCTION

Temporal Difference (TD) learning (Sutton, 1988 ) is a fundamental method of Reinforcement Learning (RL), which works by using estimates of future predictions to learn to make predictions about the present. To scale to large problems, TD has been used in conjunction with deep neural networks (DNNs) to achieve impressive performance (e.g. Mnih et al., 2013; Schulman et al., 2017; Hessel et al., 2018) . Unfortunately, the naive application of TD to DNNs has been shown to be brittle (Machado et al., 2018; Farebrother et al., 2018; Packer et al., 2018; Witty et al., 2018) , with extensions such as the n-step TD update (Fedus et al., 2020) or the TD(λ) update (Molchanov et al., 2016) only marginally improving performance and generalization capabilities when coupled with DNNs. Part of the success of DNNs, including when applied to TD learning, is the use of adaptive or accelerated optimization methods (Hinton et al., 2012; Sutskever et al., 2013; Kingma & Ba, 2015) to find good parameters. In this work we investigate and extend the momentum algorithm (Polyak, 1964) as applied to TD learning in DNNs. While accelerated TD methods have received some attention in the literature, this is typically done in the context of linear function approximators (Baxter & Bartlett, 2001; Meyer et al., 2014; Pan et al., 2016; Gupta et al., 2019; Gupta, 2020; Sun et al., 2020) , and while some studies have considered the mix of DNNs and TD (Zhang et al., 2019; Romoff et al., 2020) , many are limited to a high-level analysis of hyperparameter choices for existing optimization methods (Sarigül & Avci, 2018; Andrychowicz et al., 2020) ; or indeed the latter are simply applied as-is to train RL agents (Mnih et al., 2013; Hessel et al., 2018) . For an extended discussion of related work, we refer the reader to appendix A. As a first step in going beyond the naive use of supervised learning tools in RL, we examine momentum. We argue that momentum, especially as it is used in conjunction with TD and DNNs, adds an additional form of bias which can be understood as the staleness of accumulated information. We quantify this bias, and propose a corrected momentum algorithm that reduces this staleness and is capable of improving performance.

1.1. REINFORCEMENT LEARNING AND TEMPORAL DIFFERENCE LEARNING

A Markov Decision Process (MDP) (Bellman, 1957; Sutton & Barto, 2018 ) M = S, A, R, P, γ consists of a state space S, an action space A, a reward function R : S → R and a transition function P (s |s, a). RL agents usually aim to optimize the expectation of the long-term return, G(S t ) = ∞ k=t γ k-t R(S k ) where γ ∈ [0, 1) is called the discount factor. Policies π(a|s) map states to action distributions. Value functions V π and Q π map states and states-action pairs to 1

