ON THE PERFORMANCE OF TEMPORAL DIFFERENCE LEARNING WITH NEURAL NETWORKS

Abstract

Neural Temporal Difference (TD) Learning is an approximate temporal difference method for policy evaluation that uses a neural network for function approximation. Analysis of Neural TD Learning has proven to be challenging. In this paper we provide a convergence analysis of Neural TD Learning with a projection onto B(θ 0 , ω), a ball of fixed radius ω around the initial point θ 0 . We show an approximation bound of O(ϵ + 1/ √ m) where ϵ is the approximation quality of the best neural network in B(θ 0 , ω) and m is the width of all hidden layers in the network.

1. INTRODUCTION

Temporal difference (TD) learning is considered to be a major milestone of reinforcement learning (RL). Proposed by Sutton (1988) , TD Learning uses the Bellman error, which is a difference between an agent's predictions in a Markov Decision Process (MDP) and what it actually observes, to drive the process of learning an estimate of the value of every state. To deal with large state-spaces, TD learning with linear function approximation was introduced in Tesauro (1995). A mathematical analysis was given in Tsitsiklis & Van Roy (1996) , which shows the process converges under appropriate assumptions on step-size and sampling procedure. However, with nonlinear function approximation, TD Learning is not guaranteed to converge, as observed in Tsitsiklis & Van Roy (1996) (see also Achiam et al. (2019) for a more recent treatment). Nevertheless, TD with neural network approximation, referred to as Neural TD, is used in practice despite the lack of strong theoretical guarantees. To our knowledge, rigorous analysis was only addressed in the three papers Cai et al. (2019); Xu & Gu (2020); Cayci et al. (2021) . In Cai et al. (2019) , a single hidden layer neural architecture was considered along with projection on a ball around the initial condition; approximate convergence was proved to be an approximate stationary point of a certain function related to the linearization around the initial point. This result was generalized to multiple hidden layers in Xu & Gu (2020), but this generalization required projection on a ball of radius of ω ∼ m -1/2 around the initial point, where m is the width of the hidden layers. Because the radius of this projection goes to zero with m, this effectively fixes the neural network to a small distance from its initial condition. Both Cai et al. ( 2019); Xu & Gu (2020) additionally required certain regularity conditions on the policy. Finally, Cayci et al. (2021) gave a convergence result for a single hidden layer, also with a projection onto a radius of ω ∼ m -1/2 around the initial point, but with the final objective being the representation error of neural approximation without any kind of linearization. This result also required a condition on the representability of the value function of the policy in terms of features from random initialization. In this paper, we analyze Neural TD with a projection onto B(θ 0 , ω), a ball of fixed radius ω around the initial point θ 0 . We show an approximation bound of O(ϵ+1/ √ m) where ϵ is the approximation quality of the best neural network in B(θ 0 , ω). Our result improves on previous work because it does not require taking the radius ω to decay with m, does not make any regularity or representability assumptions on the policy, applies to any number of hidden layers, and bounds the error associated with the neural approximation without any kind of linearization around the initial condition. The main technical difference between our paper and previous works is the choice of norm for analysis. We will describe this at a more technical level in the main body of the paper, but we use a norm introduced by Ollivier (2018), which is a convex combination of the usual l 2 -norm weighted by the stationary distribution of the policy with the so-called Dirichlet semi-norm. The later has previously been used in the convergence analysis of Markov chains (Diaconis & Saloff-Coste (1996); Levin & Peres (2017)). It was shown in Ollivier (2018) that Neural TD is exactly gradient descent on this convex combination of norms if the underlying policy is reversible. In the case where the policy is not reversible, these results were partially generalized in Liu & Olshevsky (2021) , where it was shown that TD Learning with linear function approximation can be viewed as a so-called gradient splitting, a process which is analogous to gradient descent. We build heavily on that interpretation here. Our main technical argument is that Neural TD approximates the gradient splitting process at each step so that despite the nonlinearity of the approximation, an improvement in approximation quality can be guaranteed unless the system is already close to the best possible approximation over the projection radius. Notably, our arguments do not imply that the neural network stays close to its initialization and our empirical simulations show a significant benefit from taking the projection radius ω not to decay with the width m.

2.1. MARKOV DECISION PROCESSES

In this section, we present key concepts from MDPs, mostly to introduce our notation. A finite discounted-reward MDP can be described by a tuple (S, A, P env , r, γ), where S = {s 1 , s 2 , . . . , s n } is a finite state-space whose elements are vectors; A is a finite action space; P env = (P env (s ′ |s, a)) s,s ′ ∈S,a∈A is the transition probability matrix, where P env (s ′ |s, a) is the probability of transitioning from s to s ′ after taking action a; r : S × A → R is the reward function; and γ ∈ (0, 1) is the discount factor. A policy π in an MDP is a mapping π : S × A → [0, 1] such that a∈A π(s, a) = 1 for all s ∈ S, where π(s, a) is the probability that the agent takes action a in state s. We will use n for the number of states, i.e., |S| = n. Given a policy π, we define the corresponding transition probability matrix P = (P (s ′ |s)) s,s ′ ∈S as P (s ′ |s) = a∈A π(s, a)P env (s ′ |s, a). We also define the state reward function as r(s) = a π(s, a)r(s, a). Although P and r(s) depend on the policy π, throughout this paper the policy will be fixed and hence we will suppress this dependence in the notation. The stationary distribution µ corresponding to the policy π is defined to be a nonnegative vector with coordinates summing to one and satisfying µ T = µ T P . The Perron-Frobenius theorem guarantees that such a µ exists and is unique subject to some conditions on P , e.g., aperiodicity and irreducibility (Gantmacher (1964) ). We use µ(s) to denote each entry of µ. The value function of the policy π is defined as: V * (s) = E s,a∼π +∞ t=0 γ t r(s t ) , where E s,a∼π stands for the expectation when the starting state is s and actions are taken according to policy π, and s t is the t'th state encountered. Note that this quantity depends on π, but once again we suppress this dependence because the policy will be fixed throughout this paper. Moreover, note that, despite the star superscript, V * is not the optimal value function but rather the true value function corresponding to policy π. It is well known that this value function satisfies the Bellman equation, V * = R + γP V * ,

