ON THE PERFORMANCE OF TEMPORAL DIFFERENCE LEARNING WITH NEURAL NETWORKS

Abstract

Neural Temporal Difference (TD) Learning is an approximate temporal difference method for policy evaluation that uses a neural network for function approximation. Analysis of Neural TD Learning has proven to be challenging. In this paper we provide a convergence analysis of Neural TD Learning with a projection onto B(θ 0 , ω), a ball of fixed radius ω around the initial point θ 0 . We show an approximation bound of O(ϵ + 1/ √ m) where ϵ is the approximation quality of the best neural network in B(θ 0 , ω) and m is the width of all hidden layers in the network.

1. INTRODUCTION

Temporal difference (TD) learning is considered to be a major milestone of reinforcement learning (RL). Proposed by Sutton (1988) , TD Learning uses the Bellman error, which is a difference between an agent's predictions in a Markov Decision Process (MDP) and what it actually observes, to drive the process of learning an estimate of the value of every state. To deal with large state-spaces, TD learning with linear function approximation was introduced in Tesauro (1995) . A mathematical analysis was given in Tsitsiklis & Van Roy (1996) , which shows the process converges under appropriate assumptions on step-size and sampling procedure. However, with nonlinear function approximation, TD Learning is not guaranteed to converge, as observed in Tsitsiklis & Van Roy (1996) (see also Achiam et al. (2019) for a more recent treatment). Nevertheless, TD with neural network approximation, referred to as Neural TD, is used in practice despite the lack of strong theoretical guarantees. To our knowledge, rigorous analysis was only addressed in the three papers Cai et al. (2019) ; Xu & Gu (2020); Cayci et al. (2021) . In Cai et al. (2019) , a single hidden layer neural architecture was considered along with projection on a ball around the initial condition; approximate convergence was proved to be an approximate stationary point of a certain function related to the linearization around the initial point. This result was generalized to multiple hidden layers in Xu & Gu (2020) , but this generalization required projection on a ball of radius of ω ∼ m -1/2 around the initial point, where m is the width of the hidden layers. Because the radius of this projection goes to zero with m, this effectively fixes the neural network to a small distance from its initial condition. Both Cai et al. (2019) ; Xu & Gu (2020) additionally required certain regularity conditions on the policy. Finally, Cayci et al. (2021) gave a convergence result for a single hidden layer, also with a projection onto a radius of ω ∼ m -1/2 around the initial point, but with the final objective being the representation error of neural approximation without any kind of linearization. This result also required a condition on the representability of the value function of the policy in terms of features from random initialization. In this paper, we analyze Neural TD with a projection onto B(θ 0 , ω), a ball of fixed radius ω around the initial point θ 0 . We show an approximation bound of O(ϵ+1/ √ m) where ϵ is the approximation quality of the best neural network in B(θ 0 , ω). Our result improves on previous work because it does not require taking the radius ω to decay with m, does not make any regularity or representability assumptions on the policy, applies to any number of hidden layers, and bounds the error associated with the neural approximation without any kind of linearization around the initial condition. The main technical difference between our paper and previous works is the choice of norm for analysis. We will describe this at a more technical level in the main body of the paper, but we use a norm introduced by Ollivier (2018) , which is a convex combination of the usual l 2 -norm weighted by the stationary distribution of the policy with the so-called Dirichlet semi-norm. The later has previously been used in the convergence analysis of Markov chains (Diaconis & Saloff-Coste (1996) ; Levin & Peres (2017) ). It was shown in Ollivier (2018) that Neural TD is exactly gradient descent on this convex combination of norms if the underlying policy is reversible. In the case where the policy is not reversible, these results were partially generalized in Liu & Olshevsky (2021) , where it was shown that TD Learning with linear function approximation can be viewed as a so-called gradient splitting, a process which is analogous to gradient descent. We build heavily on that interpretation here. Our main technical argument is that Neural TD approximates the gradient splitting process at each step so that despite the nonlinearity of the approximation, an improvement in approximation quality can be guaranteed unless the system is already close to the best possible approximation over the projection radius. Notably, our arguments do not imply that the neural network stays close to its initialization and our empirical simulations show a significant benefit from taking the projection radius ω not to decay with the width m.

2. PRELIMINARIES 2.1 MARKOV DECISION PROCESSES

In this section, we present key concepts from MDPs, mostly to introduce our notation. A finite discounted-reward MDP can be described by a tuple (S, A, P env , r, γ), where S = {s 1 , s 2 , . . . , s n } is a finite state-space whose elements are vectors; A is a finite action space; P env = (P env (s ′ |s, a)) s,s ′ ∈S,a∈A is the transition probability matrix, where P env (s ′ |s, a) is the probability of transitioning from s to s ′ after taking action a; r : S × A → R is the reward function; and γ ∈ (0, 1) is the discount factor. A policy π in an MDP is a mapping π : S × A → [0, 1] such that a∈A π(s, a) = 1 for all s ∈ S, where π(s, a) is the probability that the agent takes action a in state s. We will use n for the number of states, i.e., |S| = n. Given a policy π, we define the corresponding transition probability matrix P = (P (s ′ |s)) s,s ′ ∈S as P (s ′ |s) = a∈A π(s, a)P env (s ′ |s, a). We also define the state reward function as r(s) = a π(s, a)r(s, a). Although P and r(s) depend on the policy π, throughout this paper the policy will be fixed and hence we will suppress this dependence in the notation. The stationary distribution µ corresponding to the policy π is defined to be a nonnegative vector with coordinates summing to one and satisfying µ T = µ T P . The Perron-Frobenius theorem guarantees that such a µ exists and is unique subject to some conditions on P , e.g., aperiodicity and irreducibility (Gantmacher (1964) ). We use µ(s) to denote each entry of µ. The value function of the policy π is defined as: V * (s) = E s,a∼π +∞ t=0 γ t r(s t ) , where E s,a∼π stands for the expectation when the starting state is s and actions are taken according to policy π, and s t is the t'th state encountered. Note that this quantity depends on π, but once again we suppress this dependence because the policy will be fixed throughout this paper. Moreover, note that, despite the star superscript, V * is not the optimal value function but rather the true value function corresponding to policy π. It is well known that this value function satisfies the Bellman equation, V * = R + γP V * , where V * is a vector whose i'th element is V * (s i ) and R is a vector whose i'th element is r(s i ). We will further assume that rewards are bounded in [-r max , r max ] as in the following assumption. Assumption 2.1. For any s, a ∈ S × A, we have |r(s, a)| ≤ r max . This immediately implies that |V * (s)| ≤ r max 1 -γ , ∀s ∈ S. (2)

2.2. MARKOV CHAIN NOISE MODEL

There are two standard sampling models where policy evaluation methods are usually considered. The simplest model involves i.i.d. sampling of s t at each step from stationary distribution µ. Alternatively, in the Markov model the states s t are collected from a single path of Markov chain transitioning according to P . It is still assumed in this case that the initial distribution is µ, so that the distribution of each s t is still µ, with the difference that the successive states are now not independent. This can always be approximately satisfied by generating a sufficiently long path from P and ignoring the initial states. Assuming that the underlying Markov chain mixes with a geometric rate is common in many analyses, like Bhandari et al. (2018) ; Liu & Olshevsky (2021) . We will also make this assumption. Formally, let P t denote the matrix P raised to the t'th power, P t s,: be the row of P t corresponding to state s, and || • || TV the total variation distance. Assumption 2.2. There exists constant C > 0 and β ∈ (0, 1) such that max s ||P t s,: -µ|| TV ≤ Cβ t . This assumption guarantees "mixing": no matter the initial distribution, the state will get closer and closer to the stationary distribution µ as t increases. As pointed out in Levin & Peres (2017) , this assumption always holds when the Markov chain is irreducible and aperiodic. Another useful quantity is the mixing time, τ mix (ϵ mix ), defined as the smallest integer t such that max s ||P t s,: -µ|| TV ≤ ϵ mix . Note that Assumption 2.2 implies τ mix (ϵ mix ) = log β ϵ mix C . For simplicity, we will use τ mix without specifying its dependence on ϵ mix throughout the paper.

2.3. D-NORM AND DIRICHLET NORM IN MDPS

We now introduce the so-called D-norm || • || D and the Dirichlet semi-norm || • || Dir associated with a policy. While the former has long been used for the analysis of temporal difference learning dating back to Tsitsiklis & Van Roy (1996) , the latter has, to our knowledge, been introduced in the context of RL relatively recently in Ollivier (2018) . Let D = diag(µ(s)) be the diagonal matrix whose elements are given by the entries of the stationary distribution µ. Given a function f : S → R, its D-norm is defined as ||f || 2 D = f T Df = s∈S µ(s)f (s) 2 . ( ) The D-norm is similar to the Euclidean norm except each entry is weighted proportionally to the stationary distribution. We also define the Dirichlet semi-norm of f : ||f || 2 Dir = 1 2 s,s ′ ∈S µ(s)P (s ′ |s)(f (s ′ ) -f (s)) 2 . (5) A semi-norm satisfies the triangle inequality and absolute homogeneity, as any norm, but it is not a norm as it may be equal to zero at a non-zero vector. Note that ||f || Dir depends on the policy both through the stationary distribution µ(s) as well as through the transition matrix P . Finally, following Ollivier (2018) , the weighted combination of the D-norm and the Dirichlet seminorm is denoted as N (f ) and defined as N (f ) = (1 -γ)||f || 2 D + γ||f || 2 Dir . Note that N (f ) is a valid norm: since N (f ) is quadratic, we can write N (f ) = f T N f for some symmetric matrix N ; examining the first term in the definition of N (f ) we see that N ⪰ (1 -γ)diag(π 1 , . . . , π n ) ≻ 0 by irreducibility and aperiodicity.

2.4. NEURAL NETWORK BASED APPROXIMATION

In this section we closely follow the notation from the previous works in Cai et al. (2019) ; Allen-Zhu et al. ( 2019); Liu et al. (2020) ; Ollivier (2018) on neural approximations. We define a multi-layer fully connected neural network by the following recursion: x (k) = 1 √ m σ θ (k) x (k-1) , for k ∈ {1, . . . , K}, where σ is an activation function and the input is state of the MDP: x (0) ∈ S. Next, we define V (s, θ) = 1 √ m b T x (K) , to be the output with no activation function on the output. We assume that each entry of θ (k) are initialized from N (0, 1) and each entry of the vector b satisfies |b r | ≤ 1, ∀r. We further assume that all the hidden layers have the same width which we denote by m, i.e., all the matrices θ (k) have first dimension of m. Note that the total number of layers in the neural network is denoted by K. We will stack up the weights of different layers into a column vector θ consisting of the entries of the matrices θ (1) , . . . , θ (K) , with norm defined by ||θ|| 2 = K k=1 ||θ (k) || 2 F , where || • || F is the Frobenius norm. During the training process, only the weights θ will be updated while the weights b will be left to their initial value. This particular definition of a neural network, as well as the decision to leave b fixed, is used by many papers on both TD Learning (i.e., Xu & Gu (2020)) and Deep Neural Network (i.e., Liu et al. (2020) ) analysis. Although there is not an explicit bias term above, this definition does allow each layer to have different bias. This can be reached by setting the last entry of x (0) to be 1 and last row of each θ (k) to be (0, . . . , 0, 1). We can view this neural network as mapping the parameters θ to a vector with as many entries as the number of states. Specifically, we can define the vector V (θ) whose i'th entry is [V (θ)] i = V (s i , θ), ∀i ∈ 1, 2, . . . , n. Note that n in the above equation denotes the number of states in the MDP, i.e., |S|. We remark that this vector will never be actually used in the execution of any algorithms we discuss due to its large size, but its use is still useful conceptually. The Jacobian of V (θ) is then the matrix ∇ θ V (θ) =    ∇ θ V (s 1 , θ) . . . ∇ θ V (s n , θ)    , where, abusing standard notation, ∇ θ V (s, θ) are defined to be row vectors. A standard assumption made to simplify the analysis is the following. y , where f ′ stands for the derivative of f . We find it helpful to make the following assumption: Assumption 2.4. The activation function σ is l-Lipschitz and c 0 -smooth. A function f is called L-Lipschitz if |f (x) -f (y)| ≤ L|x -y|, ∀x, y. And a differentiable function f : R → R is c 0 -smooth if |f ′ (x) -f ′ (y)| ≤ c 0 |x -y|, ∀x, The smoothness condition implies that our results below are not directly applicable to popular functions like ReLU. However, many activation functions are twice differentiable (e.g., sigmoid, tanh, arctan, softplus) and one could always use a smooth approximation to a ReLU activation (e.g., GeLU or ELU). We do need the smoothness of the input-to-output map as we will use the result from Liu et al. (2020) , which claims the neural network is O( 1 √ m )-smoothness with respect to its parameters. The following assumption is also required to implement their result: Assumption 2.5. For any k ∈ {1, 2, . . . , K}, given i ∈ {1, 2, . . . , m}, there exists some constant c (k) > 0, such that |x (k) i | = Õ(1) at initialization. Here, x (k) i means the i'th entry of x (k) .

2.5. NEURAL TD

In this section, we introduce (projected) neural TD learning. At each time step t, this algorithm samples state s from the stationary distribution from either of two sampling models discussed earlier, generates the next state s ′ in the MDP, and computes the temporal difference error, defined as δ t = r(s) + γV (s ′ , θ t ) -V (s, θ t ). Defining g(θ t ) as g(θ t ) = ∇ θ V (s, θ t )δ t , projected neural TD updates the weights as θ t+1/2 = θ t + α t g(θ t ), θ t+1 = Proj(θ t+1/2 ), where the projection is onto a ball of radius ω around the initial condition: Proj(θ) = arg min x:||x-θ0||≤ω ||x -θ||. Projection is a common tool in neural TD and neural Q-learning to try to stabilize the iterates, since divergence can occur (Achiam et al. (2019) ; Van Hasselt et al. (2018) ). Most analyses of TD learning proceed by comparing the evolution of TD to the mean-path update, defined as θ t+1/2 = θ t + α t ḡ(θ t ), θ t+1 = Proj(θ t+1/2 ), where ḡ(θ t ) = E[g(θ t )|θ t ] = s µ(s)∇ θ V (s, θ t )E s ′ |s [r(s) + γV (s ′ , θ t ) -V (s, θ t )] = ∇ θ V (θ t ) T D(R + γP V (θ t ) -V (θ t )). It is convenient to rewrite ḡ(θ t ) in terms of the difference between V (θ t ) and V * , which can be done by subtracting Eq.( 1) with the result being ḡ(θ t ) = ∇ θ V (θ t ) T D(γP -I)(V (θ t ) -V * ). (7) For simplicity, for the rest of paper we will define Θ to be the set onto which we are projecting: Θ = B(θ 0 , ω) = {θ | ||θ -θ 0 || ≤ ω}. Finally, we will say that V * is an ϵ-approximation to the true value function V * if max s∈S | V * (s) -V * (s)| ≤ ϵ. We will denote θ * to be the point where the function V ( θ * ) = V * is the ϵ-approximation of V * .

3. OUR MAIN RESULT

We can now state the main contribution of this paper, which is a performance result for Neural TD. We will require an assumption to the effect that the initialization is not too large. Assumption 3.1. For all k ∈ {1, 2, . . . , K}, ||θ (k) 0 || ≤ O( √ m). This is a common assumption: theoretical justification can be found in Liu et al. (2020) . In particular, this assumption holds with high probability for a random Gaussian initialization if, for example, K (the depth) grows slower than exponentially in m (the width). Now we are ready to state our result. Theorem 3.1. Suppose Assumption 2.1, 2.2, 2.3, 2.4, 2.5, 3.1 hold, θ t is generated by projected Neural TD with the constant step-size α t = α, and C, β come from the Assumption 2.2. (a) Under i.i.d. sampling, we have 1 T T -1 t=0 E[N (V (θ t ) -V ( θ * ))] ≤ ||θ 0 -θ * || 2 2αT + O ϵ + 1 √ m + O αϵ 2 + 1 (1 -γ) 2 O (α) . In particular, if α = T -1/2 and ϵ ≤ 1, we have 1 T T -1 t=0 E[N (V (θ t ) -V ( θ * ))] ≤ ||θ 0 -θ * || 2 2 √ T + O ϵ + 1 √ m + 1 (1 -γ) 2 O 1 √ T . (b) Under Markov sampling, we have 1 T T -1 t=0 E[N (V (θ t ) -V ( θ * ))] ≤ ||θ 0 -θ * || 2 2αT + O ϵ + 1 √ m + O αϵ 2 + 1 (1 -γ) 2 O (α) +O α log C α 1 -β ϵ 2 + 1 (1 -γ) 2 O α log C α 1 -β , In particular, if α = T -1/2 and ϵ ≤ 1, we have 1 T T -1 t=0 E[N (V (θ t ) -V ( θ * ))] ≤ ||θ 0 -θ * || 2 2 √ T + O ϵ + 1 √ m + 1 (1 -γ) 2 O 1 √ T + 1 (1 -γ) 2 O 1 1 -β log(C √ T ) √ T . Figure 1 : Key property of gradient splitting: h(θ) has the same inner product with a -θ as (1/2)∇f (θ). In all O(•) notations above, we treat factors that do not depend on T, ϵ, m, α, β, θ 0 as constants. As mentioned earlier, the key distinguishing feature of this theorem is the choice of norm. The left-hand side of all the equations measures the difference between V (θ t ) and the best possible V ( θ * ) within B(θ 0 , ω) by taking the N (•) norm. We note that since, trivially, ||f || 2 D ≤ N (f )/(1 -γ), where γ is the discount factor, one can simply replace the left-hand sides of all the equations by ||f || 2 D to obtain results that look more similar to the previous literature on TD, which usually proceeds based on an analysis in the D-norm (e.g., Tsitsiklis & Van Roy (1996)). As is common in analyses of SGD, the performance measure is the average of performance measures from 1 to T . If a particular θ t ′ is sought that satisfies the bounds obtained, a standard trick is to choose t ′ to be uniform from 1, . . . , T . In that case, the expectation E[N (V (θ t ′ ) -V ( θ * )] is exactly the left-hand side of all the bounds in the theorem. Looking at the right-hand sides of all the equations, the theorem guarantees a final error of O(ϵ + 1/ √ m), with a convergence rate that scales as Õ(1/ √ T ). The difference between the two cases is that that the Markov sampling case contains an additional term containing the mixing time. As a result of this extra term, the convergence time in the Markov case is worse by a factor of O(log √ T ). We now revisit the discussion of the novelty of this paper. First, in contrast to previous work, we do not assume the projection radius ω has to decay with m, nor do we restrict our analysis to a single hidden layer case. In our proof, the projection radius appears as a constant in the O(•) notation, which is why we need to assume it is a constant. Second, the left-hand side of the equations is a measure of the error V (θ t ) -V ( θ * ), which can be thought of the difference between the error of the neural network and the best possible error in B(θ 0 , ω). Moreover, as already discussed, the left-hand side of the equations above is greater than the quantity (1 -γ)||V (θ t ) -V ( θ * )|| 2 D = (1 -γ) s µ(s)(V (s, θ t ) -V (s, θ * )) 2 , which is a natural measure of the average error. The point is that the network is not being linearized around the initial condition in any sense. Finally, no additional assumptions on the policy are being made here; in particular, no assumptions that the policy is regular or that it is representable are necessary. As outlined in the Introduction, out of the four potentially undesirable elements discussed here (small projection radius, linearization around the initial condition, assumptions on policy, restriction to single layer case) all previous papers suffered from at least three.

4. DISTINGUISHING FEATURE OF OUR ANALYSIS

The main technical difference between our work and the previous papers is the use of the function N (•) to measure the approximation error. Here, we follow Ollivier (2018); Liu & Olshevsky (2021) which explained why this function is the "right" function to analyze policy evaluation. In Ollivier (2018) it was shown that if the matrix P corresponds to a reversible Markov chain, then E[ḡ(θ t )] = ∇ θ N (f ) for some f . This makes neural TD very easy to analyze, as it can be viewed as gradient descent. Unfortunately, in practice policies are almost never reversible. In Liu & Olshevsky (2021) , it was shown how to further use the function N (•) to analyze TD with linear approximation when the policy is not necessarily reversible. The key idea was the notion of a gradient splitting: a linear function h(θ) is said to be a gradient splitting of a convex quadratic f (θ) In other words, h(θ) has exactly the same inner product with the "direction to the optimal solution" as the true gradient of f (θ) (up to the factor of 1/2). The significance of this is that it allows many analyses of gradient descent to be modified to analyze TD with linear approximation, since the key step in analyses of gradient descent is usually to argue that the left-hand side of Eq. ( 9) is negative, signifying that gradient descent "makes progress" towards the optimal solution. In Liu & Olshevsky (2021) it was shown that for TD with linear approximation, ḡ(θ) is exactly the gradient splitting of N (f ). We build on that idea in this paper as follows. We use recent NTK-style bounds from Liu et al. (2020) to argue that with increasing width, the neural approximation gets more linear. For finite m, we can then modify existing techniques for analyzing gradient descent with errors in the gradient evaluations to analyze the neural TD update, which we view as gradient splitting with error in the evaluations. minimized at θ = a if 1 2 ∇f (θ) T (a -θ) = h(θ) T (θ -a). It should be stressed that the most interesting "regime" to which we expect these results to be applicable is when m is only moderately large. Intuitively, when m is too large, the network will be very close to linear, and the benefit over taking random linear features will be small. In terms of our bounds, in this case we would expect the ϵ term in Theorem 3.1 to be large. On the other hand, when m is too small, the error bound scaling with width of m -1/2 will not be very attractive. On the other hand, our results can be applied to the "middle range" -say m ∼ 100 -where, depending on the context of the problem, an ∼ m -1/2 error is acceptable. Informally, in this regime the network will be sufficiently linear to converge well, but not so linear to lose approximation quality. Our simulations confirm this, as we verify good approximations for networks of approximately this width on several open AI benchmarks.

5. SIMULATIONS

It might be objected that many analysis of neural network training in the large-width regime proceed by arguing that the neural network stays around its initial point (e.g., Chizat et al. (2019) ; Telgarsky (2020) ). If so, then the part of our results dealing with avoiding a projection radius that goes to zero with m would not make much of a difference in practice. Here, we show empirically that, setting the projection radius ω to be a constant rather than ω ∼ m -1/2 does make a substantial difference. We produce simulations on Open AI Gym tasks: Mountaincar, Cartpole and Acrobot. In each task, a trained policy is used to sample data to train a fully connected neural network for policy evaluation using Projected Neural TD learning. The policy in Mountaincar and Cartpole is trained by Proximal Policy Optimization (PPO) (Schulman et al. (2017) ) while in Acrobot it is trained by Deep Q-Learning (DQN) (Mnih et al. (2015) ). Both algorithms are implemented through the Stable Baselines package Raffin et al. (2021) . ), and the difference between gradients (given by ||∇ θ V (s t , θ t ) -∇ θ V (s t , θ 0 )||), respectively. Each subtitle of each figure suggests which task is performed on and how many hidden layers are used. It is clear from the figures that the networks we use are nonlinear. It is also clear that networks with a decaying projection radius are significantly outperformed by networks with a constant projection radius.

A PROOF FOR THE MAIN RESULT

A.1 USEFUL LEMMAS Before we go into details, we first introduce the following lemmas. Recall we have defined N (f ) as N (f ) = (1 -γ)||f || 2 D + γ||f || 2 Dir . The first lemma implies an important property of N (•). Lemma A.1. For any function f defined on the state space S, the following equation holds: -N (f ) = f T D(γP -I)f. Proof. We can perform the following sequence of manipulations: ||f || 2 Dir = 1 2 s,s ′ µ(s)P (s ′ |s)[f (s) -f (s ′ )] 2 = 1 2 s µ(s)f (s) 2 + 1 2 s,s ′ µ(s)P (s ′ |s)f (s ′ ) 2 - s,s ′ µ(s)P (s ′ |s)f (s)f (s ′ ) = 1 2 s µ(s)f (s) 2 + 1 2 s ′ µ(s ′ )f (s ′ ) 2 - s,s ′ µ(s)P (s ′ |s)f (s)f (s ′ ) = ||f || 2 D - s,s ′ µ(s)P (s ′ |s)f (s)f (s ′ ), where the first equation uses Eq.( 5) and the forth equation uses Eq.( 4). Thus, f T D(γP -I)f = -f T Df + γf T D(P f ) = -||f || 2 D + γ s µ(s)f (s) s ′ P (s ′ |s)f (s ′ ) = -(1 -γ)||f || 2 D -γ||f || 2 Dir = -N (f ). The following lemmas are some variations of the mean-value theorem. In this lemma and below, we adopt the notation that gradients are row vectors. Lemma A.2. (a). Let h : R → R be any differentiable function. For any x, y ∈ R, there exists λ ∈ (0, 1) and z = λx + (1 -λ)y such that h(y) -h(x) = h ′ (z)(y -x). (b). Let ξ : R a → R be any differentiable function. For any x, y ∈ R a , there exists λ ∈ (0, 1) and z = λx + (1 -λ)y such that ξ(y) -ξ(x) = ξ ′ (z)(y -x). (c). Let f : R a → R b be any differentiable function and e ∈ R b be any vector. For any x, y ∈ R a , there exists λ ∈ (0, 1) and z = λx + (1 -λ)y such that e T (f (y) -f (x)) = e T f ′ (z)(y -x), where f ′ (z) is the Jacobian at z.

Proof. (a)

. This is a direct result of the well-known mean value theorem (Theorem 5.10 in Rudin (1976) ). (b). Define h : R → R such that h(w) = ξ(x + w(y -x)). Using the above fact, for any u, v ∈ R, there exists λ ∈ (0, 1) such that h(v) -h(u) = h ′ (λu + (1 -λ)v)(v -u). By letting v = 1 and u = 0, ξ(y) -ξ(x) = ξ ′ (λx + (1 -λ)y)(y -x). (c). Define ξ : R a → R such that ξ = e T f . Using the above fact, for any vectors x, y ∈ R a ,there exists λ ∈ (0, 1) and z = λx + (1 -λ)y such that ξ(y) -ξ(x) = ξ ′ (z)(y -x). Notice that in this case, ξ(y) -ξ(x) = e T (f (y) -f (x)) and ξ ′ (z) = e T f ′ (z). So for any vectors x, y, there exists z = λx + (1 -λ)y such that e T f (y) -e T f (x) = e T f ′ (z)(y -x), which is exactly what needs to be proved. Lemma A.3. Let f : R a → R b be any differentiable function. For any x, y ∈ R a , there exists λ ∈ (0, 1) and z = λx + (1 -λ)y such that ||f (y) -f (x)|| ≤ ||f ′ (z)|| ||y -x||. Proof. Let us take e = f (y) -f (x) in Lemma A.2. We thus have (f (y) -f (x)) T (f (y) -f (x)) = (f (y) -f (x)) T f ′ (z)(y -x). We now apply Cauchy-Schwarz inequality on the right hand side to obtain, ||f (y) -f (x)|| 2 ≤ ||f (y) -f (x)|| • ||f ′ (z)(y -x)||, and finally using the definition of matrix norm, ||f (y) -f (x)|| ≤ ||f ′ (z)(y -x)|| ≤ ||f ′ (z)|| • ||y -x||. The next lemma shows how the mean-value theorem can help to analyze Neural TD Learning. Lemma A.4. The Projected Neural TD Learning with a mean-path update can be rewritten as θ t+1 = Proj(θ t + α t (ḡ 1 (θ t ) + ḡ2 (θ t ) + ḡ3 (θ t ))) with ḡ1 (θ t ), ḡ2 (θ t ), ḡ3 (θ t ) defined as follows: ḡ1 (θ t ) = ∇ θ V (θ mid 1 ) T D(γP -I)(V (θ t ) -V ( θ * )), ḡ2 (θ t ) = (∇ θ V (θ t ) -∇ θ V (θ mid 1 )) T D(γP -I)(V (θ t ) -V ( θ * )), ḡ3 (θ t ) = ∇ θ V (θ t ) T D(γP -I)( V * -V * ), where λ ∈ [0, 1] is a scalar and θ mid 1 = λθ t + (1 -λ) θ * is a vector such that (θ t -θ * ) T ∇ θ V (θ mid 1 ) T D(γP -I)(V (θ t )-V ( θ * )) = (V (θ t )-V ( θ * )) T D(γP -I)(V (θ t )-V ( θ * )). Proof. By Eq. ( 7), ḡ(θ t ) =∇ θ V (θ t ) T D(γP -I)(V (θ t ) -V * ) =∇ θ V (θ t ) T D(γP -I)(V (θ t ) -V ( θ * )) + ∇ θ V (θ t ) T D(γP -I)( V * -V * ). Now let D(γP -I)(V (θ t ) -V ( θ * )) be the vector e in Lemma A.2. There exists a scalar λ ∈ (0, 1) and a vector θ mid 1 = λθ t + (1 -λ) θ * such that (θ t -θ * ) T ∇ θ V (θ mid 1 ) T D(γP -I)(V (θ t )-V ( θ * )) = (V (θ t )-V ( θ * )) T D(γP -I)(V (θ t )-V ( θ * )). This gives the reason to divide Eq.( 14) into two parts as follows: ∇ θ V (θ t ) T D(γP -I)(V (θ t ) -V ( θ * )) =∇ θ V (θ mid 1 ) T D(γP -I)(V (θ t ) -V ( θ * )) + (∇ θ V (θ t ) -∇ θ V (θ mid 1 )) T D(γP -I)(V (θ t ) -V ( θ * )). The following lemma builds the relationship between x T Dy and expectation. Lemma A.5. Let x, y be two vectors and x(i), y(i) denote their i'th entries, respectively. Let x, ȳ be two scalars such that |x(i)| ≤ x and |y(i)| ≤ ȳ hold for all i. The following results hold: x T Dy = y T Dx ≤ xȳ, x T DP y ≤ xȳ, y T DP x ≤ xȳ. Proof. We expand x T Dy, x T DP y and y T DP x as follows: x T Dy = y T Dx = i µ(s i )x(i)y(i) ≤ i µ(s i )xȳ ≤ xȳ, x T DP y = i µ(s i )x(i) j P (s j |s i )y(j) ≤ xȳ i µ(s i ) j P (s j |s i ) ≤ xȳ, y T DP x = i µ(s i )y(i) j P (s j |s i )x(j) ≤ xȳ j i µ(s i )P (s j |s i ) ≤ xȳ. The following lemmas implies two important properties of neural network approximation: Lipschitzness and smoothness. Lemma A.6. For all k ∈ {1, 2, . . . , K}, ||θ (k) || ≤ O( √ m). Proof. ||θ (k) || ≤ ||θ (k) -θ (k) 0 || + ||θ (k) 0 || ≤ ω + ||θ (k) 0 || ≤ O( √ m), where the second inequality is by projection and the last inequality uses Assumption 3.1 and the fact ω is constant to m. Lemma A.7. For all k ∈ {1, 2, . . . , K}, ||x (k) || ≤ O( √ m). Proof. From Assumption 2.3, ||x (0) || ≤ 1. By Lemma A.6, ||x (1) || 2 = 1 √ m σ(θ (1) x (0) ) 2 ≤ 1 m l 2 ||θ (1) || 2 ||x (0) || 2 + |σ(0)| 2 ≤ O(m). By induction, suppose ||x (k) || 2 ≤ O(m). By Lemma A.6, ||x (k+1) || 2 = || 1 √ m σ(θ (k+1) x (k) )|| 2 ≤ 1 m l 2 ||θ (k+1) || 2 ||x (k) || 2 + |σ(0)| 2 ≤ O(m). Lemma A.8. For all k ∈ {1, 2, . . . , K}, ||∇ x (k-1) x (k) || ≤ O(1). Proof. ∇ x (k-1) x (k) (i, j) = 1 √ m σ ′   j θ (k) (i, j)x (k-1) (j)   θ (k) (i, j), implies ||∇ x (k-1) x (k) || 2 = sup ||v||=1 m i=1   j ∇ x (k-1) x (k) (i, j)v j   2 = sup ||v||=1 1 m ||Σ ′ θ (k) v|| 2 ≤ 1 m ||Σ ′ || 2 • ||θ (k) || 2 ≤ O(1). where Σ ′ is a diagonal matrix with Σ ′ (i, i) = σ ′ ( j θ (k) (i, j)x (k-1) (j)). The last inequality holds because of Lemma A.6. Lemma A.9. For all k ∈ {1, 2, . . . , K}, ||∇ θ (k) x (k) || ≤ O(1) Here, ∇ θ (k) x (k) is defined to be a matrix whose (i, (j -1)m + h)'th entry ∇ θ (k) x (k) (i, j, h) is given by ∇ θ (k) x (k) (i, j, h) = ∂x (k) (i) ∂θ (k) (j, h) . Proof. ∇ θ (k) x (k) (i, j, j ′ ) = 1 √ m 1{i = j}σ ′ h θ (k) (i, h)x (k-1) (h) x (k-1) (j ′ ). We can write this as ∇ θ (k) x (k) (i, j, j ′ ) = 1 √ m 1{i = j}ξ(i)x (k-1) (j ′ ) which implies ||∇ θ (k) x (k) || 2 = sup ||V || F =1 m i=1   j,j ′ ∇ θ (k) x (k) (i, j, j ′ )V j,j ′   2 = 1 m sup ||V || F =1 m i=1   j,j ′ 1{i = j}ξ(i)x (k-1) (j ′ )V j,j ′   2 = 1 m sup ||V || F =1 m i=1   j 1{i = j}ξ(i)[V x k-1 ] j   2 = 1 m sup ||V || F =1 m ξ(i) 2 V x k-1 2 i = sup ||V || F =1 1 m ||Σ ′ V x (k-1) || 2 ≤ 1 m ||Σ ′ || 2 • ||x (k-1) || 2 ≤ O(1). The last inequality holds because of Lemma A.7. Lemma A.10. For all s ∈ S and θ, ||∇ θ V (s, θ)|| ≤ O(1). with respect to m. Proof. Since each entry of b satisfies |b r | ≤ 1, it is easy to see that ||∇ x (K) V (s, θ)|| = 1 √ m ||b|| ≤ 1. By Lemma A.8, Lemma A.9, and the chain rule, ||∇ θ (k) V (s, θ)|| = ||∇ x (K) V (s, θ)∇ x (K-1) x (K) • • • ∇ x (k) x (k+1) ∇ θ (k) x (k) || ≤ O(1). It follows: ||∇ θ V (s, θ)|| 2 = sup ||V || F =1 K k=1 (∇ θ (k) V (s, θ)V k ) 2 ≤ O(1). Lemma A.11. The Hessian matrix, which is ∇ 2 θ V (s, θ), has a norm of O(m -0.5 ). In other words, ||∇ 2 θ V (s, θ)|| ≤ O(m -0.5 ). This is a direct result of Theorem 3.2 in Liu et al. (2020) . Notice that since we assume Assumption 3.1 and2.5 (which correspond to Lemma G.1 and Lemma G.4 in Liu et al. ( 2020)) holds with probability 1, Lemma A.11 also holds with probability 1. The following lemma implies that the update during each step will not be too large. Lemma A.12. We have the following bound for ||g(θ t )||: ||g(θ t )|| 2 ≤ O(ϵ 2 ) + 1 (1 -γ) 2 O(1). Proof. Recall that Eq. ( 1) tell us the property V * satisfies, and it can be rewritten as V * (s) = r(s) + γ s ′′ P (s ′′ |s)V * (s ′′ ). Moreover, recall that g(θ t ) is defined to be g(θ t ) = ∇ θ V (s, θ t ) [r(s) + γV (s ′ , θ t ) -V (s, θ t )] . This immediately implies that g(θ t ) actually a random variable and implicitly relies on the state s. This allows g(θ t ) to be rewritten as g(θ t ) =∇ θ V (s, θ t ) [r(s) + γV (s ′ , θ t ) -V (s, θ t )] =∇ θ V (s, θ t ) V * (s) -V (s, θ t ) + γ s ′′ P (s ′′ |s)(V (s ′ , θ t ) -V * (s ′′ )) =∇ θ V (s, θ t ) V (s, θ * ) -V (s, θ t ) + V * (s) -V (s, θ * ) + γ s ′′ P (s ′′ |s) V (s ′ , θ t ) -V (s ′ , θ * ) + V (s ′ , θ * ) -V * (s ′ ) + V * (s ′ ) -V * (s ′′ ) =∇ θ V (s, θ t ) f (s) + V * (s) -V (s, θ * ) + γ s ′′ P (s ′′ |s) -f (s ′ ) + V (s ′ , θ * ) -V * (s ′ ) + V * (s ′ ) -V * (s ′′ ) . where, for simplicity, we denote f (s) = V (s, θ * ) -V (s, θ t ). Now, ||g(θ t )|| 2 can be bounded as ||g(θ t )|| 2 ≤5||∇ θ V (s, θ t )|| 2 f (s) 2 + γ s ′′ P (s ′′ |s) f (s ′ ) 2 + V * (s) -V (s, θ * ) 2 + γ s ′′ P (s ′′ |s)(V (s ′ , θ * ) -V * (s ′ ) 2 + γ s ′′ P (s ′′ |s)(V * (s ′ ) -V * (s ′′ )) 2 . There are five different terms, and now we will bound them respectively. By Lemma A.10 which says f (s) is O(1)-Lipschitz and the fact that ||θ t -θ * || ≤ ω, f (s) 2 ≤ O(1). Using the above fact, γ s ′′ P (s ′′ |s) f (s ′ ) 2 = γ 2 f (s ′ ) 2 ≤ γ 2 O(1). Using Eq.( 8) , we can derive V * (s) -V (s, θ * ) 2 ≤ O(ϵ 2 ). Using the above fact, γ s ′′ P (s ′′ |s)(V (s ′ , θ * ) -V * (s ′ )) 2 = γ 2 V (s ′ , θ * ) -V * (s ′ ) 2 ≤ γ 2 O(ϵ 2 ). Using Eq.( 2) and Jensen's inequality, γ s ′′ P (s ′′ |s)(V * (s ′ ) -V * (s ′′ )) ≤ γ 2 s ′′ P (s ′′ |s) 2r max 1 -γ 2 = γ 2 (1 -γ) 2 O(1). Combine the above five facts, and we have the bound for ||g(θ t )|| 2 : ||g(θ t )|| 2 ≤5||∇ θ V (s, θ t )|| 2 f (s) 2 + γ s ′′ P (s ′′ |s) f (s ′ ) 2 + V * (s) -V (s, θ * ) 2 + γ s ′′ P (s ′′ |s)(V (s ′ , θ * ) -V * (s ′ ) 2 + γ s ′′ P ′′ |s)(V * (s ′ ) -V * (s ′′ )) 2 ≤5||∇ θ V (s, θ t )|| 2 O(1) + γ 2 O(1) + O(ϵ 2 ) + γ 2 O(ϵ 2 ) + γ 2 (1 -γ) 2 O(1) ≤(1 + γ 2 )O(1 + ϵ 2 ) + γ 2 (1 -γ) 2 O(1) (15) where the last inequality uses Lemma A.10. Since 0 ≤ γ ≤ 1, we can simplify the bound as ||g(θ t )|| 2 ≤ O(ϵ 2 ) + 1 (1 -γ) 2 O(1).

A.2 PROOF FOR THE MAIN RESULT

Proof. Consider Projected Neural TD Learning, it is easy to see ||θ t+1 -θ * || 2 = ||Proj(θ t + α t g(θ t )) -θ * || 2 ≤ ||θ t -θ * + α t g(θ t )|| 2 = ||θ t -θ * || 2 + 2α t (θ t -θ * ) T g(θ t ) + α 2 t ||g(θ t )|| 2 = ||θ t -θ * || 2 + 2α t (θ t -θ * ) T ḡ(θ t ) + α 2 t ||g(θ t )|| 2 + 2α t (θ t -θ * ) T (g(θ t ) -ḡ(θ t )). ( ) First, we consider 2α t (θ t -θ * ) T ḡ(θ t ). Lemma A.4 allows us to divide it into several parts and bound them respectively as follows: 2α t (θ t -θ * ) T ḡ(θ t ) = 2α t (θ t -θ * ) T (ḡ 1 (θ t ) + ḡ2 (θ t ) + ḡ3 (θ t )). To bound (θ t -θ * ) T ḡ1 (θ t ), (θ t -θ * ) T ḡ1 (θ t ) =(θ t -θ * ) T ∇ θ V (θ mid 1 ) T D(γP -I)(V (θ t ) -V ( θ * )) =(V (θ t ) -V ( θ * )) T D(γP -I)(V (θ t ) -V ( θ * )) = -N (V (θ t ) -V ( θ * )), where the first equality is by Eq.( 10), the second equality is by Eq. ( 13), and the third equality is by setting f = V (θ t ) -V ( θ * ) in Lemma A.1. To bound (θ t -θ * ) T ḡ2 (θ t ), we first notice that Lemma A.11 means V (s, θ) is O(m -0.5 )-smoothness with respect to θ. Hence, ||∇ θ V (s, θ t ) -∇ θ V (s, θ mid 1 )|| ≤ O(m -0.5 ) • ||θ t -θ mid 1 || = O(m -0.5 ), where the inequality is by Lemma A.11 and the equality is because both θ t and θ mid 1 are in B(θ 0 , ω). Similarly, Lemma A.10 tells us that V (s, θ) is O(1)-Lipschitz. This means ||V (s, θ t ) -V (s, θ * )|| ≤ O(1) • ||θ t -θ * || = O(1), where the inequality is by Lemma A.10 and the equality is because both θ t and θ * lie in B(θ 0 .ω). These two facts imply that each entry of (∇ θ V (θ t ) -∇ θ V (θ mid 1 ))(θ t -θ * ) is upper-bounded by O(m -0.5 ) and each entry of V (θ t ) -V ( θ * ) is upper-bounded by O(1). With this fact, (θ t -θ * ) T ḡ2 (θ t ) =(θ t -θ * ) T (∇ θ V (θ t ) -∇ θ V (θ mid 1 )) T D(γP -I)(V (θ t ) -V ( θ * )) ≤(1 + γ)O(m -0.5 ) ≤O(m -0.5 ). where the equality is by Eq.( 11) and the first inequality is setting x to be (∇ θ V (θ t ) - ∇ θ V (θ mid 1 ))(θ t -θ * ), y to be V (θ t ) -V ( θ * ) in Lemma A.5. To bound (θ t -θ * ) T ḡ3 (θ t ), (θ t -θ * ) T ḡ3 (θ t ) =(θ t -θ * ) T ∇ θ V (θ t ) T D(γP -I)( V * -V * ) ≤(1 + γ)O(ϵ) ≤O(ϵ), where the equality is by Eq.( 12) and the first inequality is by setting x to be ∇ θ V (θ t )(θ t -θ * ), y to be V * -V * in Lemma A.5, as each entry of ∇ θ V (θ t )(θ t -θ * ) is bounded by O(1) using Lemma A.10 and each entry of V * -V * is bounded by ϵ using Assumption 8. Combining the above facts, 2α t (θ t -θ * ) T ḡ(θ t ) ≤ -2α t N (V (θ t ) -V ( θ * )) + O(α t (m -0.5 + ϵ)), which is the bound of the first part in Eq.( 16). Second, we consider α 2 t ||g(θ t )|| 2 . By Lemma A.12, α 2 t ||g(θ t )|| 2 ≤ O(α 2 t ϵ 2 ) + 1 (1 -γ) 2 O(α 2 t ). The above facts are all we need to establish the result when each s t is sampled i.i.d. from µ. In summary, 2α t (θ t -θ * ) T ḡ(θ t ) + α 2 t ||g(θ t )|| 2 ≤ -2α t N (V (θ t ) -V ( θ * )) + O(α t (m -0.5 + ϵ)) + O(α 2 t ϵ 2 ) + 1 (1 -γ) 2 O(α 2 t ), which just simply combines the bound for 2α t (θ t -θ * ) T ḡ(θ t ) and α 2 t ||g(θ t )|| 2 we obtained earlier. Given θ t , taking expectation on both sides of Eq.( 16): E[||θ t+1 -θt || 2 ] ≤||θ t -θ * || 2 + 2α t (θ t -θ * )ḡ(θ t ) + E[α 2 t ||g(θ t )|| 2 ] + 2α t (θ t -θ * )E[g(θ t ) -ḡ(θ t )] =||θ t -θ * || 2 + 2α t (θ t -θ * )ḡ(θ t ) + E[α 2 t ||g(θ t )|| 2 ] ≤||θ t -θ * || 2 -2α t N (V (θ t ) -V ( θ * )) + O(α t (m -0.5 + ϵ)) + O(α 2 t ϵ 2 ) + 1 (1 -γ) 2 O(α 2 t ), where the equality uses the condition that given s t are sampled i.i.d. from µ, which will lead to E[g(θ t ) -ḡ(θ t )] = 0. From now on we fix α t = α, and this leads to E[||θ t+1 -θ * || 2 ] 2α - E[||θ t -θ * || 2 ] 2α ≤ -E[N (V (θ t )-V ( θ * ))]+O(m -0.5 +ϵ)+O(αϵ 2 )+ 1 (1 -γ) 2 O(α). Telescoping a sum from 1 to T and dividing both sides by T , we establish the result in i.i.d. case. To continue with the non i.i.d. case, we need to consider 2α(θ t -θ * ) T (g(θ t ) -ḡ(θ t )) in Eq.( 16) which is no longer 0. We will use τ mix , the mixing time defined in Eq.( 3), to split it into two terms and bound them separately. This idea is from Sun et al. (2018) . Our first step is to divide it into two terms: 2α(θ t -θ * )(g(θ t ) -ḡ(θ t )) = 2α(θ t -θ t-τmix )(g(θ t ) -ḡ(θ t )) + 2α(θ t-τmix -θ * )(g(θ t ) -ḡ(θ t )), To bound 2α(θ t -θ t-τmix )(g(θ t ) -ḡ(θ t )), notice that Lemma A.12 implies ||g(θ t )|| ≤ O(ϵ) + 1 1-γ O(1). As a consequence of Eq.( 6), we ||θ t -θ t-τmix || ≤||θ t-1 -θ t-τmix + αg(θ t-1 )|| ≤||θ t-2 -θ t-τmix + α [g(θ t-1 ) + g(θ t-2 )] || ≤ • • • ≤||θ t-τmix -θ t-τmix + α [g(θ t-1 ) + g(θ t-2 ) + • • • + g(θ t-τmix )] || ≤ O(ϵ) + 1 1 -γ O(1) ατ mix . Further, the same upper bound that g(θ t ) has holds for ḡ(θ t ) since ḡ(θ t ) = E[g(θ t )]. These observations imply that 2α(θ t -θ t-τmix )(g(θ t ) -ḡ(θ t )) ≤ α 2 τ mix O(ϵ 2 ) + 1 (1 -γ) 2 O(1) . To bound 2α(θ t-τmix -θ * )(g(θ t ) -ḡ(θ t )), observe that, conditioned on θ t-τmix , the quantity θ t-τmix -θ * is deterministic, and E[g(θ t ) -ḡ(θ t )|θ t-τ mix ] can be bounded by E[g(θ t ) -ḡ(θ t )|θ t-τmix ] ≤ O(ϵ) + 1 1 -γ O(1) max s ||P τmix s,: -µ|| TV ≤ O(ϵ) + 1 1 -γ O(1) ϵ mix . The first inequality follows because, conditional on s t-τmix , the random variable s t has the distribution of one row of P τmix . Thus, the second term in Eq.( 17) can be bounded as E[2α(θ t-τmix -θ * )(g(θ t ) -ḡ(θ t ))] = E[E[2α(θ t-τmix -θ * )(g(θ t ) -ḡ(θ t ))|θ t-τmix ]] ≤ O(ϵ) + 1 1 -γ O(1) αϵ mix . Thus, coming back to Eq.( 17), we obtain that E[2α(θ t -θ * )(g(θ t ) -ḡ(θ t ))] ≤ α 2 τ mix O(ϵ 2 ) + 1 (1 -γ) 2 O(1) + O(ϵ) + 1 1 -γ O(1) αϵ mix . Now let ϵ mix = α, by the definition of τ mix , τ mix = log α C log β . Using the fact that log x ≤ x-1, ∀x > 0, we derive log α C log β ≤ log α C β -1 = log C α 1 -β , where the first inequality is because log β ≤ β -1 < 0 and log α C ≤ 0. The bound can be rewritten as E[2α t (θ t -θ * )(g(θ t ) -ḡ(θ t ))] ≤ α 2 log C α 1 -β O(ϵ 2 ) + 1 (1 -γ) 2 O(1) + O(ϵ) + 1 1 -γ O(1) α 2 . Now we are ready to consider Eq.( 16). Taking expectation on both sides, E[2α(θ t -θ * ) T g(θ t ) + α 2 ||g(θ t )|| 2 ] ≤ -2αN (V (θ t ) -V ( θ * )) + O(α(m -0.5 + ϵ)) + O(α 2 ϵ 2 ) + 1 (1 -γ) 2 O(α 2 ) + α 2 log C α 1 -β O(ϵ 2 ) + 1 (1 -γ) 2 O(1) + O(ϵ) + 1 1 -γ O(1) α 2 = -2αN (V (θ t ) -V ( θ * )) + O(α(m -0.5 + ϵ)) + O(α 2 ϵ 2 ) + 1 (1 -γ) 2 O(α 2 ) + O(α 2 log C α 1 -β ϵ 2 ) + 1 (1 -γ) 2 O(α 2 log C α 1 -β ). Rewrite this inequality as E[||θ t+1 -θ * || 2 ] 2α - E[||θ t -θ * || 2 ] 2α = -N (V (θ t ) -V ( θ * )) + O(m -0.5 + ϵ) + O(αϵ 2 ) + 1 (1 -γ) 2 O(α) + O(α log C α 1 -β ϵ 2 ) + 1 (1 -γ) 2 O(α log C α 1 -β ). Telescoping a sum from 1 to T and dividing both sides T , we establish the result in non the i.i.d. case.

B ANALYSIS OF UNPROJECTED TD LEARNING -PROOF OF OUR CONJECTURE FOR THE SINGLE HIDDEN LAYER CASE B.1 RESULT WITHOUT PROJECTION

Here we prove that when the distance from the optimal solution is O( √ m), unprojected TD learning converges with high probability. We begin by standardizing notation. Recall g(θ t ) is defined by g(θ t ) = ∇ θ V (s, θ t )δ t , where δ t = r(s) + γV (s ′ , θ t ) -V (s, θ t ). Without projection, there is only one step in the algorithm, which is θ t+1 = θ t + α t g(θ t ). We call this algorithm Non Projected Neural TD Learning. Similarly to the way we proceeded earlier in this paper, Non Projected Neural TD Learning with mean-path update is given by θ t+1 = θ t + α t ḡ(θ t ). We will need to make the following assumption on exact approximation, which is stronger than what we assumed in the projected case. Assumption B.1. There exists some θ * such that V (θ * ) = V * . Now, for simplicity of notations, we also introduce the following notation. Remember the smallest eigenvalue λ min of some matrix A is defined as λ min (A) = arg min x ||Ax|| ||x|| . A similar way can be applied to define the smallest 2, D eigenvalue σ 2,D min of a matrix A: σ 2,D min (A) = arg min x ||Ax|| D ||x|| . For simplicity, we will use σ 2,D min to denote the 2, D eigenvalue of ∇ θ V (θ * ). We will always suppose each s t is sampled i.i.d. from µ. We now introduce the following result. Theorem B.1. In the Non Projected Neural TD learning using a one hidden layer neural network, smooth and Lipschitz activation function, suppose each s t are sampled i.i.d from µ, and stepsize is chosen to be α t = 1 λ(t+1) . Further, let A be the event A = sup t X t < 2(1 -γ)(σ 2,D min ) 2 λ -3l 4 (1 + γ 2 ) 4(1 + γ)c 0 l m 0.5 , and assume E ||θ 0 -θ * || 2 < 2(1 -γ)(σ 2,D min ) 2 λ -3l 4 (1 + γ 2 ) 2 64(1 + γ) 2 c 2 0 l 2 λ 2 mδ -2. Then, A happens with probability at least 1 -δ and the sequence E ||θ t -θ * || 2 |A converges to 0.

B.2 USEFUL LEMMAS

Lemma B.1. (θ t -θ * ) T ḡ(θ t ) can be rewritten as (θ t -θ * ) T ḡ(θ t ) = ḡ1 (θ t ) + ḡ2 (θ t ) + ḡ3 (θ t ), where ḡ1 (θ t ), ḡ2 (θ t ), ḡ3 (θ t ) are defined as follows: ḡ1 (θ t ) = (θ t -θ * ) T ∇ θ V (θ * ) T D(γP -I)∇ θ V (θ * )(θ t -θ * ), (θ t ) = (θ t -θ * ) T (∇ θ V (θ t ) -∇ θ V (θ * )) T D(γP -I)(V (θ t ) -V (θ * )), (19) ḡ3 (θ t ) = (θ t -θ * ) T ∇ θ V (θ * ) T D(γP -I)(∇ θ V (θ mid 2 ) -∇ θ V (θ * ))(θ t -θ * ). (20) Here, λ 2 ∈ [0, 1] is a scalar and θ mid 2 = λ 2 θ t + (1 -λ 2 )θ * is a vector such that (θ t -θ * ) T ∇ θ V (θ * ) T D(γP -I)∇ θ V (θ mid 4 )(θ t -θ * ) = (θ t -θ * ) T ∇ θ V (θ * ) T D(γP -I)(V (θ t )-V (θ * )). Proof. First, we divide (θ t -θ * ) T ḡ(θ t ) into two parts, (θ t -θ * ) T ḡ(θ t ) = (θ t -θ * ) T ∇ θ V (θ t ) T D(γP -I)(V (θ t ) -V (θ * )) = (θ t -θ * ) T ∇ θ V (θ * ) T D(γP -I)(V (θ t ) -V (θ * )) + (θ t -θ * ) T (∇ θ V (θ t ) -∇ θ V (θ * )) T D(γP -I)(V (θ t ) -V (θ * )) = (θ t -θ * ) T ∇ θ V (θ * ) T D(γP -I)(V (θ t ) -V (θ * )) + ḡ2 (θ t ) where the first equality is by the definition of ḡ(θ t ) in Eq.( 7). Now let (θ t -θ * ) T ∇ θ V (θ * ) T D(γP -I) be the vector e in Lemma A.2. There exists a scalar λ 2 ∈ [0, 1] and a vector θ mid 2 = λ 2 θ t + (1 - λ 2 )θ * such that (θ t -θ * ) T ∇ θ V (θ * ) T D(γP -I)∇ θ V (θ mid 2 )(θ t -θ * ) = (θ t -θ * ) T ∇ θ V (θ * ) T D(γP -I)(V (θ t )-V (θ * )). Using this fact, (θ t -θ * ) T ∇ θ V (θ * ) T D(γP -I)(V (θ t ) -V (θ * )) can be divided by (θ t -θ * ) T ∇ θ V (θ * ) T D(γP -I)(V (θ t ) -V (θ * )) = (θ t -θ * ) T ∇ θ V (θ * ) T D(γP -I)∇ θ V (θ * )(θ t -θ * ) + (θ t -θ * ) T ∇ θ V (θ * ) T D(γP -I)(∇ θ V (θ mid 2 ) -∇ θ V (θ * ))(θ t -θ * ) = ḡ1 (θ t ) + ḡ3 (θ t ). Lemma B.2. If the activation function is l-Lipschitz and c 0 -smooth, for any s, θ 1 and θ 2 , the inequalities ||∇ θ V (s, θ 1 ) -∇ θ V (s, θ 2 )|| ≤ c 0 m -0.5 ||θ 1 -θ 2 ||, ||∇ θ V (s, θ 1 )|| ≤ l hold where c 1 is a scalar that is independent of m and θ. Proof. Using the definition of neural network in Section 2.4, one hidden layer neural network would be simplified as V (s, θ) = 1 √ m m r=1 b r σ(θ r T s). It is easy to see that ∇ θ V (s, θ) = 1 √ m [b 1 σ ′ (θ 1 T s)s T , • • • , b m σ ′ (θ mT s)s T ] T . Suppose the activation function σ is c 0 -smooth. That is, for any x and y, ||σ ′ (x) -σ ′ (y)|| ≤ c 0 ||x -y||. This means ||∇ θ V (s, θ 1 ) -∇ θ V (s, θ 2 )|| 2 = 1 m ||[b 1 (σ ′ (θ 1 1 T s) -σ ′ (θ 1 2 T s))s T , • • • , b m (σ ′ (θ m 1 T s) -σ ′ (θ m 2 T s))s T ]|| 2 ≤ 1 m r ||s|| 2 (σ ′ (θ r 1 T s) -σ ′ (θ r 2 T s)) 2 ≤ 1 m r ||s|| 2 c 2 0 (θ r 1 T s -θ r 2 T s) 2 ≤ 1 m r ||s|| 4 c 2 0 ||θ r 1 -θ r 2 || 2 = c 2 0 m ||θ 1 -θ 2 || 2 which proves the first part of the lemma. Now let us move on to the second part of the lemma. By Lipschitzness, ||∇V (s, θ 1 )|| 2 = 1 m ||[b 1 σ ′ (θ 1 1 T s)s T , • • • , b m σ ′ (θ m 1 T s)s T ]|| 2 ≤ 1 m r ||s|| 2 (σ ′ (θ r 1 T s)) 2 ≤ 1 m r l 2 = l 2 . Now we finish the second part of the lemma. Lemma B.3. The following inequality holds: N (∇ θ V (θ * )(θ -θ * )) ||θ t -θ * || 2 ≥ (1 -γ)(σ 2,D min ) 2 . Proof. To begin with, the definition of N is N (∇ θ V (θ * )(θ -θ * )) = (1 -γ)||∇ θ V (θ * )(θ -θ * )|| 2 D + γ||∇ θ V (θ * )(θ -θ * )|| 2 Dir ≥ (1 -γ)||∇ θ V (θ * )(θ -θ * )|| 2 D . Using the definition of σ 2,D min , N (∇ θ V (θ * )(θ -θ * )) ||θ -θ * || 2 ≥ (1 -γ) ||∇ θ V (θ * )(θ -θ * )|| 2 D ||θ -θ * || 2 ≥ (1 -γ)(σ 2,D min ) 2 , which establishes the result. Lemma B.4. If a non negative sequence {X t } satisfies X t+1 ≤ (1 - c t + 1 )X t + b (t + 1) 2 , for some b, c > 0, then the sequence {X t } converges to 0. Proof. Recursively applying the relation between X t+1 and X t , we can derive the following: X t ≤ b t 2 + t-1 i=1 b i 2 t j=i+1 (1 - c j ) + t j=1 (1 - c j )X 0 . The first term in Eq.( 21) definitely goes to 0 as t goes to infinity. Now let us consider the term t j=i+1 (1 -c j ). A logarithm usually helps us to convert it into a sum, so we perform the following manipulations: t j=i+1 ln(1 - c j ) ≤ - t j=i+1 c j ≤ c(ln(i + 1) -ln(t + 1)), where the first inequality uses the fact ln(1 + x) ≤ x and the second inequality uses t j=i+1 1 j ≥ ln(t + 1) -ln(i + 1). So, By setting i = 0, this means the third term in Eq.( 21) goes to 0 as t goes to infinity. Now consider t-1 i=1 b i 2 t j=i+1 (1 -c j ), we obtain t-1 i=1 b i 2 t j=i+1 (1 - c j ) ≤ b (t + 1) c t-1 i=1 (i + 1) c i 2 ≤ 4b (t + 1) c t-1 i=1 (i + 1) c-2 ≤ 4b (t + 1) c + 4b (t + 1) c t 1 (i + 1) c-2 di, where the second inequality uses the fact (i+1) 2 i 2 ≤ 4 for all positive integer i, and the third inequality combine the two facts t-1 i=2 (i + 1) c-2 ≤ t 2 (i + 1) c-2 di ≤ t 1 (i + 1) c-2 di given c ≥ 2 and t-1 i=2 (i + 1) c-2 ≤ t-1 1 (i + 1) c-2 di ≤ t 1 (i + 1) c-2 di given c ≤ 2. If c ̸ = 1, then t-1 i=1 b i 2 t j=i+1 (1 - c j ) ≤ 4b (t + 1) c + 4b (t + 1) c • (t + 1) c-1 -2 c-1 c -1 . If c = 1, then t-1 i=1 b i 2 t j=i+1 (1 - c j ) ≤ b t (1 + ln(t + 1) -ln 2). Under both cases, we can easily argue that the second term in Eq. ( 21) goes to 0. Now, we have proved that all three terms in the right hand side of Eq.( 21) go to 0 as t goes to infinity. This directly implies {X t } converges to 0.  ||θ t+1 -θ * || 2 = ||θ t -θ * + α t g(θ t )|| 2 = ||θ t -θ * || 2 + 2α t (θ t -θ * ) T g(θ t ) + α 2 t ||g(θ t )|| 2 . Given θ t , the only randomness is from s t . Taking expectation on both side and using the fact that E[g(θ t )] = ḡ(θ t ) (since we assume s t are sampled i.i.d. from µ), we obtain E ||θ t+1 -θ * || 2 |θ t = ||θ t -θ * || 2 + 2α t (θ t -θ * ) T ḡ(θ t ) + α 2 t E ||g(θ t )|| 2 |θ t . First, we consider 2α t (θ t -θ * ) T ḡ(θ t ). Lemma B.1 allows us to divide it into several parts and thus we can bound them respectively: 2α t (θ t -θ * ) T ḡ(θ t ) = 2α t (ḡ 1 (θ t ) + ḡ2 (θ t ) + ḡ3 (θ t )). To bound ḡ1 (θ t ), ḡ1 (θ t ) =(θ t -θ * ) T ∇ θ V (θ * ) T D(γP -I)∇ θ V (θ * )(θ t -θ * ) = -N (∇ θ V (θ * )(θ t -θ * )) ≤ -(1 -γ)(σ 2,D min ) 2 , where the first equality is because the definition of ḡ1 (θ t ) in Eq.( 18), the second equality is by setting f = ∇ θ V (θ * )(θ t -θ * ) in Lemma A.1, and the inequality is by Lemma B.3. To bound ḡ2 (θ t ), ḡ2 (θ t ) =(θ t -θ * ) T (∇ θ V (θ t ) -∇ θ V (θ * )) T D(γP -I)(V (θ t ) -V (θ * )) ≤ γ(θ t -θ * ) T (∇ θ V (θ t ) -∇ θ V (θ * )) T DP (V (θ t ) -V (θ * )) + (θ t -θ * ) T (∇ θ V (θ t ) -∇ θ V (θ * )) T D(V (θ t ) -V (θ * )) ≤(1 + γ)c 0 lm -0.5 ||θ t -θ * || 3 , where the equality is because the definition of ḡ2 (θ t ) in Eq.( 19), the first inequality is by triangle inequality which says |x + y| ≤ |x| + |y|, and the second inequality is by setting x to be ( ∇ θ V (θ t ) - ∇ θ V (θ * ))(θ t -θ * ), y to be V (θ t ) -V ( θ * ) in Lemma A.5 with each entry of x, y bounded by c 0 m -0.5 ||θ t -θ * || 2 , l||θ t -θ * || respectively using Lemma B.2. To bound ḡ3 (θ t ), ḡ3 (θ t ) =(θ t -θ * ) T ∇ θ V (θ * ) T D(γP -I)(∇ θ V (θ mid 2 ) -∇ θ V (θ * ))(θ t -θ * ) ≤ γ(θ t -θ * ) T ∇ θ V (θ * ) T DP (∇ θ V (θ mid 2 ) -∇ θ V (θ * ))(θ t -θ * + (θ t -θ * ) T ∇ θ V (θ * ) T D(∇ θ V (θ mid 2 ) -∇ θ V (θ * ))(θ t -θ * ) ≤(1 + γ)c 0 lm -0.5 ||θ t -θ * || 3 , where the equality is because of the definition of ḡ3 (θ t ) in Eq.( 20), the first inequality is because of the triangle inequality, and the second inequality is by setting x to be ( ∇ θ V (θ mid 2 ) -∇ θ V (θ * ))(θ t - θ * ), y to be ∇ θ V (θ * )(θ t -θ * ) in Lemma A.5 with each entry of x, y bounded by c 0 m -0.5 ||θ t -θ * || 2 , l||θ t -θ * || respectively using Lemma B.2. Combine the above facts and we now have the bound for the second term in the right hand side of Eq.( 22), 2α t (θ t -θ * ) T ḡ(θ t ) ≤ -2α t (1 -γ)(σ 2,D min ) 2 ||θ t -θ * || 2 + 4α t (1 + γ)c 0 lm -0.5 ||θ t -θ * || 3 . Second, we consider E[α 2 t ||g(θ t )|| 2 |θ t ] in Eq.( 22). For simplicity, define f (s) = V (s, θ * )-V (s, θ t ). Since we are using one hidden layer neural network to approximate V (s, θ), |f (s)| 2 = |V (s, θ * ) -V (s, θ t )| 2 = 1 m r b r (σ(θ r t s) -σ(θ r * s)) 2 ≤ r b 2 r ||σ(θ r t s) -σ(θ r * s)|| 2 ≤ r l 2 ||θ r t s -θ r * s|| 2 ≤ r l 2 ||θ r t -θ r * || 2 = l 2 ||θ t -θ * || 2 . Recall that V (s, θ * ) satisfies Eq.( 1), which is V (s, θ * ) = r(s) + γ s ′′ P (s ′′ |s)V (s ′′ , θ * ), and g(θ t ) is defined to be g(θ t ) = ∇ θ V (s, θ t ) [r(s) + γV (s ′ , θ t ) -V (s, θ t )] . This latter immediately implies that g(θ t ) is actually a random variable and implicitly relies on the state s. Using these facts, we obtain g(θ t ) = ∇ θ V (s, θ t ) [r(s) + γV (s ′ , θ t ) -V (s, θ t )] = ∇ θ V (s, θ t ) f (s) -γ s ′′ P (s ′′ |s)f (s ′ ) + γ s ′′ P (s ′′ |s) (V (s ′ , θ * ) -V (s ′′ , θ * )) . By Eq.( 2) we can obtain the following quick result: |V (s ′ , θ * ) -V (s ′′ , θ * )| ≤ 2r max 1 -γ . And this leads to the following: ||g(θ t )|| 2 = ∇ θ V (s, θ t ) f (s) -γ s ′′ P (s ′′ |s)f (s ′ ) + γ s ′′ P (s ′′ |s)(V (s ′ , θ * ) -V (s ′′ , θ * )) 2 ≤3||∇ θ V (s, θ t )|| 2   |f (s)| 2 + γ s ′′ P (s ′′ |s)f (s ′ ) 2 + γ s ′′ P (s ′′ |s)(V (s ′ , θ * ) -V (s ′′ , θ * )) 2   ≤3||∇ θ V (s, θ t )|| 2 |f (s)| 2 + γ 2 s ′′ P (s ′′ |s)|f (s ′ )| 2 + γ 2 s ′′ P (s ′′ |s)|V (s ′ , θ * ) -V (s ′′ , θ * )| 2 ≤3(1 + γ 2 )l 4 ||θ t -θ * || + 12γ 2 l 2 r 2 max (1 -γ) 2 , where the first inequality uses the fact that (a + b + c) 2 ≤ 3(a 2 + b 2 + c 2 ), the second inequality is by Jensen's inequality, and the third inequality simply uses Lemma B.2 to bound ||∇ θ V (s, θ t )||. Further, α 2 t E ||g(θ t )|| 2 |θ t ≤ 3α 2 t (1 + γ 2 )l 4 ||θ t -θ * || 2 + 12α 2 t γ 2 l 2 r 2 max (1 -γ) 2 . Now let us go back to Eq.( 22), where the second and third terms on the right hand side can be bounded by 2α t (θ t -θ * ) T g(θ t ) + α 2 t E ||g(θ t )|| 2 |θ t ≤ -2α t (1 -γ)(σ 2,D min ) 2 ||θ t -θ * || 2 + 4α t (1 + γ)c 0 lm -0.5 ||θ t -θ * || 3 + 3α 2 t (1 + γ 2 )l 4 ||θ t -θ * || 2 + 12α 2 t γ 2 l 2 r 2 max (1 -γ) 2 = -2α t (1 -γ)(σ 2,D min ) 2 + 3α 2 t (1 + γ 2 )l 4 ||θ t -θ * || 2 + 4α t (1 + γ)c 0 lm -0.5 ||θ t -θ * || 3 + 12α 2 t γ 2 l 2 r 2 max (1 -γ) 2 . This means E ||θ t+1 -θ * || 2 |θ t ≤ 1 -2α t (1 -γ)(σ 2,D min ) 2 + 3α 2 t (1 + γ 2 )l 4 ||θ t -θ * || 2 + 4α t (1 + γ)c 0 lm -0.5 ||θ t -θ * || 3 + 12α 2 t γ 2 l 2 r 2 max (1 -γ) 2 . ( ) For simplicity, we define the following notations: X t =||θ t -θ * ||, C = 12γ 2 l 2 r 2 max λ 2 (1 -γ) 2 , α t = 1 λ(t + 1) , a t =2α t (1 -γ)(σ 2,D min ) 2 -3α 2 t (1 + γ 2 )l 4 , b t =4α t (1 + γ)c 0 lm -0.5 , c t = 12α 2 t γ 2 l 2 r 2 max (1 -γ) 2 = C (t + 1) 2 , d t = C t (while d 0 is defined to be d 0 = 2). We can rewrite Eq.( 24) as E X 2 t+1 t ≤ (1 -a t + b t X t )X 2 t + c t , while the condition E ||θ 0 -θ * || 2 < (2(1-γ)(σ 2,D min ) 2 λ-3l 4 (1+γ 2 )) 2 64(1+γ) 2 c 2 0 l 2 λ 2 mδ -2 is just E[X 2 0 ] ≤ a 2 0 4b 2 0 δ -d 0 . Under such notations, an important fact is that c t ≤ d t -d t+1 . This can be showed easily since 1 (t+1) 2 ≤ 1 t -1 t+1 . If we define the stopping time T = inf t {X t ≥ a0 2b0 } and the sequence {Y t } to be Y t = X 2 t + d t if t ≤ T Y t-1 if t > T Now we first claim that the sequence {Y t } is a super martingale. We show this as follows: Suppose T = 0, and we find that Y t = Y 0 , ∀t. In this trivial case, obviously, {Y t } is a super martingale. Now we assume T > 0. When t = 0, we have E[X 2 1 |X 0 ] ≤ (1 - 1 2 a 0 )X 2 0 + c 0 ≤ X 2 0 + c 0 ≤ X 2 0 + d 0 -d 1 . which means E[Y 1 |Y 0 ] ≤ Y 0 . For t ≤ T , by the definition of T we know that X t ≤ a0 2b0 , which implies that X t ≤ at 2bt (this is because by the definition of a t and b t , h(t) = at 2bt is increasing with t). Hence, E[X 2 t+1 |X t ] ≤ (1 - 1 2 a t )X 2 t + c t ≤ X 2 t + c t ≤ X 2 t + d t -d t+1 . which means E[Y t+1 |Y t ] ≤ Y t . For t > T , by the definition of {Y t } we know that Y t = Y T . Hence, combine all the above facts and we conclude the sequence {Y t } is a super martingale. Next, we claim that X t < a0 2b0 , ∀t holds with probability at least 1-δ. This can be shown as follows: Let A be the event sup t X t < a0 2b0 = X 0 < a0 2b0 , X 1 < a0 2b0 , • • • . To compute P (A), notice that P X 0 < a 0 2b 0 , X 1 < a 0 2b 0 , • • • = P Y 0 < a 2 0 4b 2 0 + d 0 , Y 1 < a 2 0 4b 2 0 + d 1 , • • • . We can easily check that T is also a stopping time for {Y t }. In order to use Lemma B.5, we need to check the conditions that |Y T | ≤ c for some constant c. We split it into two cases. where the second inequality using the fact that Y t ≥ a 2 0 4b 2 0 + d T . Hence, P (T ≤ τ ) ≤ E[Y 0 ] a 2 0 4b 2 0 + d T = E[X 2 0 ] + d 0 a 2 0 4b 2 0 + d T ≤ E[X 2 0 ] + d 0 a 2 0 4b 2 0 ≤ δ. where the equality is because E[Y 0 ] = E[X 2 0 ]+d 0 , the second inequality is because d t is nonnegtive, and the third inequality is because of the condition E[X 2 0 ] ≤ This result holds for all τ , so we can let τ → ∞ and obtain P (A) ≥ 1 -δ. Actually, using h(t) = at 2bt is monotonically increase with t again, event A implies X t ≤ (1 -γ) 2 λ 2 (t + 1) 2 . Given that X t ≤ a0 2b0 ≤ (1-γ)m 0.5 (σ 2,D min ) 2 2(1+γ)c0l . Hence, we conclude E X 2 t+1 |X t , A ≤ 1 - (1 -γ)(σ 2,D min ) 2 λ(t + 1) X 2 t + 3(1 -γ) 2 (1 + γ 2 )l 2 (σ 2,D min ) 2 m 8λ 2 (1 + γ) 2 c 2 0 (t + 1) 2 + 12γ 2 l 2 r 2 max (1 -γ) 2 λ 2 (t + 1) 2 . Take expectation on both sides and we derive E X 2 t+1 |A ≤ 1 - (1 -γ)(σ 2,D min ) 2 λ(t + 1) E[X 2 t |A]+ 3(1 -γ) 2 (1 + γ 2 )l 2 (σ 2,D min ) 2 m 8λ 2 (1 + γ) 2 c 2 0 (t + 1) 2 + 12γ 2 l 2 r 2 max (1 -γ) 2 λ 2 (t + 1) 2 . Lemma B.4, we conclude that the sequence E X 2 t |A converges to 0.



CONCLUSIONSWe have provided an analysis of Projected Neural TD for policy evaluation. We have shown that if the projection set is a ball of constant radius B(θ 0 , ω) around the initial point θ 0 , the final approximation error is O(ϵ + 1/ √ m), where ϵ is the approximation quality of the best neural network in B(θ 0 , ω) and m is the width of all hidden layers in the network. Our result improves on previous works because it does not require taking the radius ω to decay with m, does not make any regularity or representability assumptions on the policy, applies to any number of hidden layers, and bounds the error associated with the neural approximation without any kind of linearization around the initial condition. We conjecture that unprojected neural TD converges to the optimal solution with high probability provided it begins within a radius of O( √ m) away from a point which exactly describes the true value function. This conjecture is proved for single hidden layer networks. Finally, we have demonstrated empirically that these changes can make a substantial difference in performance.



Figure 2: Averaged Bellman error.

Figure 3: Distance to initialization divided by the projection radius.

Figure 4: Gradient differences ||∇ θ V (s t , θ t ) -∇ θ V (s t , θ 0 )|| as a measure of nonlinearity of the neural network.

c j ) ≤ e c ln i+1 t+1 = (i + 1) c (t + 1) c .

Lemma B.5. (Optional Stopping Theorem) Suppose {X t } is a super martingale and T is a stopping time. If there exists a constant c such that |X τ ∧T | ≤ c holds for all τ , then we haveE[X τ ] ≤ E[X 0 ].This is also calledDoob's Optional Stopping Theorem. See Theorem 10.10 of Williams (1991). B.3 PROOF OF LEMMA B.1 Consider Non Projected Neural TD Learning,

, if τ < T , then |Y τ ∧T | = |Y τ |. Because of the definition of T we know that |Y τ the fact that {d t } is a decreasing sequence and d 0 = 2. Second, if τ ≥ T , then |Y τ ∧T | = |Y T |. Recall that update of the algorithm isθ t+1 = θ t + α t g(θ t ). which implies ||θ t+1 -θ * || ≤ ||θ t -θ * || + α t ||g(θ t )||.Now we let t = T -1 and use X t notations. The above fact can be rewritten as|X T | ≤ |X T -1 | + α T -1 ||g(θ T -1 )||.Because of the definition of T , we know that |X T -1 | ≤ a0 2b0 . Moreover, Eq.(23) give us a bound for||g(θ T -1 )||, which is ||g(θ T -1 )|| ≤ 3(1 + γ 2 )l 4 X 2 T -1 + 12γ 2 l 2 r 2 max (1-γ) 2 . Finally, α T -1 is set to be 1 λ(t+1) so it is obviously bounded. All these facts lead to the result that |Y T | is bounded.Combining the above two cases, we are now eligible to use Lemma B.5. Hence, we obtainE[Y τ ∧T ] ≤ E[Y 0 ].On the other hand, E[Y τ ∧T ] can be expanded asE[Y τ ∧T ] = E[Y τ ∧T |T ≤ τ ]P (T ≤ τ ) + E[Y τ ∧T |T > τ ]P (T > τ ) = E[Y T ]P (T ≤ τ ) + E[Y τ ]P (T > τ ).Combining these two facts,E[Y 0 ] ≥ E[Y T ]P (T ≤ τ ) d T P (T ≤ τ ).

d τ = P (T > τ ) = 1 -P (T ≤ τ ) ≥ 1 -δ.

Assumption 2.3. Suppose for any i ∈ {1, 2, . . . , n}, ||s i || ≤ 1 where || • || stands for the l 2 -norm.

Plugging in at = 2α t (1 -γ)(σ 2,D min ) 2 -3α 2 t (1 + γ 2 )l 4 , c t =

