ON THE PERFORMANCE OF TEMPORAL DIFFERENCE LEARNING WITH NEURAL NETWORKS

Abstract

Neural Temporal Difference (TD) Learning is an approximate temporal difference method for policy evaluation that uses a neural network for function approximation. Analysis of Neural TD Learning has proven to be challenging. In this paper we provide a convergence analysis of Neural TD Learning with a projection onto B(θ 0 , ω), a ball of fixed radius ω around the initial point θ 0 . We show an approximation bound of O(ϵ + 1/ √ m) where ϵ is the approximation quality of the best neural network in B(θ 0 , ω) and m is the width of all hidden layers in the network.

1. INTRODUCTION

Temporal difference (TD) learning is considered to be a major milestone of reinforcement learning (RL). Proposed by Sutton (1988) , TD Learning uses the Bellman error, which is a difference between an agent's predictions in a Markov Decision Process (MDP) and what it actually observes, to drive the process of learning an estimate of the value of every state. To deal with large state-spaces, TD learning with linear function approximation was introduced in Tesauro (1995). A mathematical analysis was given in Tsitsiklis & Van Roy (1996) , which shows the process converges under appropriate assumptions on step-size and sampling procedure. However, with nonlinear function approximation, TD Learning is not guaranteed to converge, as observed in Tsitsiklis & Van Roy (1996) (see also Achiam et al. (2019) for a more recent treatment). Nevertheless, TD with neural network approximation, referred to as Neural TD, is used in practice despite the lack of strong theoretical guarantees. To our knowledge, rigorous analysis was only addressed in the three papers Cai et al. ( 2019 In Cai et al. ( 2019), a single hidden layer neural architecture was considered along with projection on a ball around the initial condition; approximate convergence was proved to be an approximate stationary point of a certain function related to the linearization around the initial point. This result was generalized to multiple hidden layers in Xu & Gu (2020), but this generalization required projection on a ball of radius of ω ∼ m -1/2 around the initial point, where m is the width of the hidden layers. Because the radius of this projection goes to zero with m, this effectively fixes the neural network to a small distance from its initial condition. Both Cai et al. (2019); Xu & Gu (2020) additionally required certain regularity conditions on the policy. Finally, Cayci et al. (2021) gave a convergence result for a single hidden layer, also with a projection onto a radius of ω ∼ m -1/2 around the initial point, but with the final objective being the representation error of neural approximation without any kind of linearization. This result also required a condition on the representability of the value function of the policy in terms of features from random initialization. In this paper, we analyze Neural TD with a projection onto B(θ 0 , ω), a ball of fixed radius ω around the initial point θ 0 . We show an approximation bound of O(ϵ+1/ √ m) where ϵ is the approximation quality of the best neural network in B(θ 0 , ω). Our result improves on previous work because it does not require taking the radius ω to decay with m, does not make any regularity or representability assumptions on the policy, applies to any number of hidden layers, and bounds the error associated with the neural approximation without any kind of linearization around the initial condition.



); Xu & Gu (2020); Cayci et al. (2021).

