THE IMPACT OF APPROXIMATION ERRORS ON WARM-START REINFORCEMENT LEARNING: A FINITE-TIME ANALYSIS Anonymous authors Paper under double-blind review

Abstract

Warm-Start reinforcement learning (RL), aided by a prior policy obtained from offline training, is emerging as a promising RL approach for practical applications. Recent empirical studies have demonstrated that the performance of Warm-Start RL can be improved quickly in some cases but become stagnant in other cases, calling for a fundamental understanding, especially when the function approximation is used. To fill this void, we take a finite time analysis approach to quantify the impact of approximation errors on the learning performance of Warm-Start RL. Specifically, we consider the widely used Actor-Critic (A-C) method with a prior policy. We first quantify the approximation errors in the Actor update and the Critic update, respectively. Next, we cast the Warm-Start A-C algorithm as Newton's method with perturbation, and study the impact of the approximation errors on the finite-time learning performance with inaccurate Actor/Critic updates. Under some general technical conditions, we obtain lower bounds on the sub-optimality gap of the Warm-Start A-C algorithm to quantify the impact of the bias and error propagation. We also derive the upper bounds, which provide insights on achieving the desired finite-learning performance in the Warm-Start A-C algorithm.

1. INTRODUCTION

Online reinforcement learning (RL) (Kaelbling et al., 1996; Sutton & Barto, 2018) often faces the formidable challenge of high sample complexity and intensive computational cost (Kumar et al., 2020; Xie et al., 2021) , which hinders its applicability in real-world tasks. Indeed, this is the case in portfolio management (Choi et al., 2009) , vehicles control (Wu et al., 2017; Shalev-Shwartz et al., 2016) and other time-sensitive settings (Li, 2017; Garcıa & Fernández, 2015) . To tackle this challenge, Warm-Start RL has recently garnered much attention (Nair et al., 2020; Gelly & Silver, 2007; Uchendu et al., 2022) , by enabling online policy adaptation from an initial policy pre-trained using offline data (e.g., via behavior cloning or offline RL). One main insight of Warm-Start RL is that online learning can be significantly accelerated, thanks to the bootstrapping by an initial policy. Despite the encouraging empirical successes (Silver et al., 2017; 2018; Uchendu et al., 2022) , a fundamental understanding of the learning performance of Warm-Start RL is lacking, especially in the practical settings with function approximation by neural networks. In this work, we focus on the widely used Actor-Critic (A-C) method (Grondman et al., 2012; Peters & Schaal, 2008) , which combines the merits of both policy iteration and value iteration approaches (Sutton & Barto, 2018) and has great potential for RL applications (Uchendu et al., 2022) . Notably, in the framework of abstract dynamic programming (ADP) (Bertsekas, 2022a) , the policy iteration method (Sutton et al., 1999) has been studied extensively, for warm-start learning under the assumption of accurate updates. In such a setting, policy iteration can be regarded as a second-order method in convex optimization (Grand-Clément, 2021) from the perspective of ADP, and can achieve super-linear convergence rate (Santos & Rust, 2004; Puterman & Brumelle, 1979; Boyd et al., 2004) . Nevertheless, when the A-C method is implemented in practical applications, the approximation errors are inevitable in the Actor/Critic updates due to many implementation issues, including function approximation using neural networks, the finite sample size, and the finite number of gradient iterations. Moreover, the error propagation from iteration to iteration may exacerbate the 'slowing down' of the convergence and have intricate impact therein. Clearly, the (stochastic) accumulated errors may throttle the convergence rate significantly and degrade the learning performance dramatically (Fujimoto et al., 2018; Uehara et al., 2021; Dalal et al., 2020; Doan et al., 2019) . Thus, it is of great importance to characterize the learning performance of Warm-Start RL in practical scenarios; and the primary objective of this study is to take steps to build a fundamental understanding of the impact of the approximation errors on the finite-time sub-optimality gap for the Warm-Start A-C algorithm, i.e., Whether and under what conditions online learning (e.g., A-C) can be significantly accelerated by a warm-start policy from offline RL? To this end, we address the question in two steps: (1) We first focus on the characterization of the approximation errors via finite time analysis, based on which we quantify its impact on the sub-optimality gap of the A-C algorithm in Warm-Start RL. In particular, we analyze the A-C algorithm in a more realistic setting where the samples are Markovian in the rollout trajectories for the Critic update (different from the widely used i.i.d. assumption). Further, we consider that the Actor update and the Critic update take place on the single-time scale, indicating that the time-scale decomposition is not applicable to the finite-time analysis here. We tackle these challenges using recent advances on Bernstein's Inequality for Markovian samples (Jiang et al., 2018; Fan et al., 2021b) . By delving into the coupling due to the interleaved updates of the Actor and the Critic, we provide upper bounds on the approximation errors in the Critic update and the Actor update of online exploration, respectively, from which we pinpoint the root causes of the approximation errors. (2) We analyze the impact of the approximation errors on the finite-time learning performance of Warm-Start A-C. Based on the approximation error characterization, we treat the Warm-Start A-C algorithm as Newton's method with perturbation, and study the impact of the approximation errors on the finite-time learning performance of Warm-Start A-C. For the case when the approximation errors are biased, we derive lower bounds on the sub-optimality gap, which reveals that even with a sufficiently good warm-start, the performance gap of online policy adaptation to the optimal policy is still bounded away from zero when the biases are not negligible. Further, we also derive the upper bounds, which shed light on designing Warm-Start A-C to achieve desired finite-time learning performance. We present the experiments results to further elucidate our findings in Appendix K. Related Work. (Warm-Start RL) AlphaZero (Silver et al., 2017) is one of the most remarkable successes in Warm-Start RL. In a line of very recent works (Gupta et al., 2020 ) (Ijspeert et al., 2002 ) (Kim et al., 2013) on Warm-Start RL, the policy is initialized via behavior cloning from offline data and then is fine-tuned with online reinforcement learning. A variant of this scheme is proposed in Advanced Weighted Actor Critic (Nair et al., 2020) which enables quick learning of skills across a suite of benchmark tasks. In the same spirit, Offline-Online Ensemble (Lee et al., 2022) leverages multiple Q-functions trained pessimistically offline as the initial function approximation for online learning. Jump-start RL (Uchendu et al., 2022) utilizes a guided-policy to initialize online RL in the early phase with a separate online exploration-policy. The guided-policy will be abandoned as the online exploration-policy improves. However, a fundamental characterization of the finite-time performance of Warm-Start RL is still lacking. Recent work (Xie et al., 2021) provides a quantitative understanding on the policy fine-tuning problem in episodic Markov Decision Processes (MDPs) and establishes the lower bound for the sample complexity, where no function approximation is used. Our work aims to take steps to quantify the impact of approximation error on online RL when a warm-start policy is given. (Actor-Critic as Newton's Method) The intrinsic connection between the A-C method and Newton's method can be traced back to the convergence analysis of policy iteration in MDPs with continuous action spaces (Puterman & Brumelle, 1979) . The connection is further examined later in a special MDP with discretized continuous state space (Santos & Rust, 2004) . Recent work (Bertsekas, 2022b) points out that the success of Warm-Start RL, e.g., AlphaZero, can be attributed to the equivalence between policy iteration and Newton's method in the ADP framework, which leads to the superlinear convergence rate for online policy adaptation. Under the generalized differentiable assumption, it has also been proved theoretically that policy iteration is the instances of semi-smooth Newton-type methods to solve the Bellman equation (Gargiani et al., 2022) . While some prior works (Grand-Clément, 2021) have provided theoretical investigation of the connections between policy iteration and Newton's Method, the studies are carried out in the abstract dynamic programming (ADP) framework, assuming accurate updates in iterations. Departing from the ADP framework, this work treats the A-C algorithm as Newton's method in the presence of approximation errors, and focuses on the finite-time learning performance of Warm-Start RL. (Finite-time analysis for Actor-Critic methods) Among the existing works on the finite time analysis of A-C methods with function approximation, (Yang et al., 2019) establishes the global convergence under the linear quadratic regulator. (Wang et al., 2020) proves the convergence behavior when both Actor and Critic are approximated by overparameterized neural networks. (Kumar et al., 2019) considers the sample complexity under i.i.d. assumptions where the Actor update and Critic update can be 'decoupled'. Khodadadian et al. (2022) considers the two-timescale setting with Markovian samples. (Fu et al., 2020) focuses on the more general single-time scale setting but constrains the policy function approximation in the energy based function class.

2. BACKGROUND

Markov Decision Processes. We consider a MDP defined by a tuple (S, A, P, r, γ), where S = {1, 2, • • • , n}, n < ∞ and A = {1, 2, • • • , A}, A < ∞ represent the finite state space and finite action space, respectively. P (s ′ |s, a) : S × A × S → [0, 1] is the probability of the transition from state s to state s ′ by applying action a and r(s, a) : S × A → R is the corresponding reward. γ ∈ (0, 1) is the discount factor. At each step t, an agent moves from the current state s t to next state s t+1 by taking an action a t following the policy π : S → A and receives the reward r t . In the Warm-Start RL, we assume that the initial policy π 0 is given, e.g., in the form of a neural network (Li, 2017) , and obtained by offline training. For brevity, we use bold symbols r π ∈ R n : [r π ] s = r(s, π(s)) and P π ∈ R n×n : [P π ] s,s ′ ≜ P (s ′ |s, π(s)) to denote the reward vector and the transition matrix induced by policy π. We further denote by d π : S → [0, 1] and ρ π : S × A → [0, 1] the stationary state distribution and state-action transition distribution induced by policy π. We use ρ 0 to represent the initial state distribution. We use ∥ • ∥ or ∥ • ∥ 2 to represent the 2-norm. Value Functions. For any policy π, define the value function v π (s) : S → R as v π (s) = E at∼π(•|st),st+1∼P (•|st,at) [ ∞ t=0 γ t r t |s 0 = s] to measure the average accumulative reward staring from state s by following policy π. We define Q-function Q π (s, a) : S × A → R as Q π (s, a) = E[ ∞ t=0 γ t r t |s 0 = s, a 0 = a] to represent the expected return when the action a is chosen at the state s. By using the transition matrix and reward vector defined above, we have the compact form of the value function v π = (I -γP π ) -1 r π , where I ∈ R n×n is the identity matrix and v π ∈ R n is the value vector with the component-wise values [v π ] s ≜ v π (s), with v π (s) ≜ E a∼π(•|s) [Q π (s, a)]. (1) The main objective is to find an optimal policy π * such that the value function is maximized, i.e., max π E s∼ρ0 [v π (s)] ≜ max π E s∼ρ0,a∼π(•|s) [Q π (s, a)]. In what follows, we use both Q-function and value function v(s) for convenience, and the relation between the two is given in Eqn. (2). Bellman Operator. For v ∈ R n , define the Bellman evaluation operator T π : R n → R n and the Bellman operator T : R n → R n as T π (v) =r π + γP π v, T (v) = max π {r π + γP π v} = max π T π (v). It is well known that the Bellman operator T is a contraction mapping and has order-preserving property. Note that the Bellman operator T may not be differentiable everywhere due to the max operator, and the value v * of the optimal policy π * is the only fixed point of the Bellman operator T (Puterman, 2014). From the definition of the Bellman Evaluation Operator T π , we have v π to be the fixed point of T π , i.e., v π = T π (v π ).

2.1. POLICY ITERATION AS NEWTON'S METHOD IN ABSTRACT DYNAMIC PROGRAMMING

Policy iteration carries out policy learning by alternating between two steps: policy improvement and policy evaluation. At time t, the policy evaluation step seeks to learn the value function v πt for the current policy π t by solving the fixed point equation of the Bellman evaluation operator: v = T πt (v). (3) Denote v t = v πt for simplicity. Then in the policy improvement step, a new policy π t+1 is obtained by maximizing the learnt value function v t in the policy evaluation step, in a greedy manner, i.e., To introduce the connection between policy iteration and Newton's Method, we first define operator π t+1 = arg max π T π (v t ). F : v → v -T (v) for convenience. As in (Grand-Clément, 2021; Puterman, 2014) , F can be treated as the "gradient" of an unknown function. Under the assumption that F (v) is differentiable at v, the Jacobian J v of F at v can be obtained as , 2014) . Since it can be shown that v πt+1 = (I -γP πt+1 ) -1 r πt+1 = J -1 v π t r πt+1 for the policy evaluation of π t+1 , we have that, J v = I -γP π(v) , where π(v) ≜ arg max π T π (v). Note that J -1 v = ∞ i=1 (γP π(v) ) i is invertible (Puterman v πt+1 = v πt -J -1 v π t J v π t v πt + J -1 v π t r πt+1 = v πt -J -1 v π t F (v πt ), (5) which indicates that the analytic representation of policy iteration in the abstract dynamic programming framework reduces to Newton's Method. It is worth mentioning that the convergence behavior of policy iteration near the optimal value v * cannot be directly obtained by using the results from convex optimization (Boyd et al., 2004) since the Bellman operator T may not be differentiable at any given value vector v. The full proof is included in Appendix A.

2.2. AN ILLUSTRATIVE EXAMPLE OF THE ERROR PROPAGATION IN ACTOR-CRITIC UPDATES

The A-C method can be viewed as a generalization of policy iteration in ADP, where the Critic update corresponds to the policy evaluation of the current policy and the Actor update performs the policy improvement. In practice, function approximation (e.g., via neural networks) is often used to approximate both the Critic and the Actor, which inevitably incurs approximation errors for the policy update and evaluation. More importantly, the approximation errors could propagate along with the iterative updates in the A-C method. We have the illustrative example to get a more concrete sense of the impact of the approximation errors on the policy update. As illustrated in Figure 1 , for a given policy π t with the underlying true policy value v πt , we denote v πt as the learnt value estimation of v πt in the Critic step. We further denote π t+1 and π t+1 as the greedy policy obtained in the Actor update Eqn. (4) by using v πt and v πt , respectively. Let πt+1 be the policy estimation of π t+1 with function approximation in the Actor step. Intuitively, π t+1 is the underlying true policy update from π t using one step policy iteration without any error, π t+1 is the policy update from π t with approximation errors in the Critic update, and πt+1 is the policy update from π t with approximation errors in both the Critic step and the Actor step. To characterize the impact of the approximation errors on the policy update, i.e., the difference between v πt+1 and v πt+1 , we evaluate the Critic error, i.e., the difference between v πt+1 and v πt+1 , and the Actor error, i.e., the difference between v πt+1 and v πt+1 , in a separate manner. More specifically, to quantify the Critic error, we can first have the following update based on the same reasoning with Eqn. (5): v πt+1 = v t -J -1 vt v t -(r πt+1 + γP πt+1 v t ) ≜ v t -J -1 vt v t -T (v t ) , where T (v t ) = r πt+1 + γP πt+1 v t and J vt = I -γP πt+1 . Denote the approximation error (random variable) in the Bellman operator T and the Jacobian J v by E T,t and E J,t , i.e., T (v t ) -T (v t ) =(r πt+1 + γP πt+1 v t ) -(r πt+1 + γP πt+1 v t ) ≜ E T,t , J -1 vt -J -1 vt =(I -γP πt+1 ) -1 -(I -γP πt+1 ) -1 ≜ E J,t , where it is clear that both error terms stem from the function approximation errors in the Critic update. To quantify the Actor error, we assume that v πt+1 = v πt+1 + E a,t , where E a,t is the error term. Therefore, by casting the A-C method as Newton's method with perturbation, we can characterize the approximation errors on the policy update: v πt+1 = v πt+1 + E c,t + E a,t , where E c,t ≜ -E J,t (v t -T (v t )) + (J -foot_0 vt + E J,t )E T,t and E a,t capture the impact of the approximation error from Critic update step and Actor update step, respectively. Intuitively, as illustrated in Figure 1 , both errors from the previous update in the A-C method may propagate to the next update and thus affect the convergence behavior of the algorithm substantially, in contrast to idealized policy iteration without approximation errors. This phenomenon has also been observed in the empirical results (Fujimoto et al., 2018; Thrun & Schwartz, 1993) . In this work, we strive to systematically analyze the impact of the approximation errors, through (1) a detailed characterization of the approximation errors in the Critic update and the Actor update in Section 3 and (2) a thorough analysis of the error propagation effect and biases in Section 4. The illustration of our analysis is available in Appendix A.

3. CHARACTERIZATION OF APPROXIMATION ERRORS

Actor-Critic Methods with Function Approximation. In what follows, we consider that the policy is parameterized by θ ∈ Θ, which in general corresponds to a non-linear function class. Following (Konda & Tsitsiklis, 1999; Peters & Schaal, 2008; Kumar et al., 2019; 2020; Santos & Rust, 2004) , the Q-function is parameterized by a linear function class with base function ϕ(s, a) and parameter ω ∈ Ω ⊂ R d , i.e., Q ω (s, a) = ω ⊤ ϕ(s, a). We note that the modeling of the Q-function via linear value function is often used to extract insight in the A-C method. Similar to the policy iteration, the update in the A-C method alternates between the following two steps 1 . Critic update: The Critic updates its parameter ω to evaluate the current policy π t , e.g., through m-step (m ≥ 1) Bellman evaluation operator T π to the current Q-function estimator (namely, m-step return), which leads to the following update rule at time step t, Q t+1 (s, a) ←E πt (1 -γ) • m-1 i=0 γ i r (s i , a i ) + γ m • Q ωt (s m , a m ) | s 0 = s, a 0 = a , (7) ω t+1 ←arg min ω E (s,a)∼ρ π t Q t+1 (s, a) -ω ⊤ ϕ(s, a) 2 . ( ) Actor update: The Actor is updated through a greedy step to maximize Q-function Q ωt+1 , i.e., π t+1 ← arg max π E (s,a)∼ρ π Q ωt+1 (s, a) . (9)

3.1. APPROXIMATION ERROR IN THE CRITIC UPDATE

Solving the minimization problem in Eqn. ( 8) involves the expectation over the stationary stateaction distribution ρ πt induced by the current policy π t , which can be approximated by sample average in practice. Therefore, we consider the Critic update below based on two groups of samples, {(s l , a l )} N l=1 and {τ l } N l=1 where τ l = {s l,t , a l,t , r l,t } m t=0 , which are collected by following π t : ω t+1 = Γ R N l=1 ϕ (s l , a l ) ϕ (s l , a l ) ⊤ -1 • N l=1 (1 -γ) m-1 i=0 γ i r l,i + γ m Q ωt (s l,m , a l,m ) ϕ (s l,m , a l,m ) , ( ) where Γ is the projection operator onto the Critic parameter space Ω with radius R in R d . Since the samples in each trajectory τ l are obtained via rolllouts, in general the samples in each trajectory follow a Markovian process (Dalal et al., 2018; Kumar et al., 2019) . We further assume the samples are from the stationary distribution induced by the current policy. In what follows, we use ω and ω to distinguish the difference between the sample-based update and the solution from Eqn. ( 8), such that the approximation error in the Critic update can be quantified as |Q ωt -Q ωt |. We first impose the following standard assumptions on the Bellman evaluation operator T π , the base function ϕ and the MDP. Assumption 1. For given Critic parameter ω and policy parameter θ, the following condition holds: inf ω∈Ω E ρ π θ [ (T π θ ) m Q ω -ω⊤ ϕ (s, a)] = 0, where ρ π θ is the stationary state-action transition probability induced by policy π θ . Assumption 1 (Fu et al., 2020) indicates that the solution of the Critic update given in Eqn. ( 8) lies in the Critic parameter space Ω. We note that this assumption is used for ease of exposition, and our results can be modified by incorporating an additional constant term when this assumption does not hold. The proof sketch in this case can be found in Appendix C. Assumption 2. The base function ϕ in the Critic satisfies the following two conditions: (1) ∥ϕ(s, a)∥ 2 ≤ 1, ∀ (s, a) ∈ S × A; and (2) the smallest singular value for E ρ [ϕ(s, a)ϕ(s, a) ⊤ ] is lower bounded by a positive constant σ * for any stationary state-action transition distribution ρ. Assumption 2 is widely used in the A-C method to guarantee that the minimization in Eqn. ( 8) can be attained by a unique minimizer (Fu et al., 2020; Bhandari et al., 2018; Agarwal et al., 2021) . Assumption 3. The reward r(s, a) satisfies the following two conditions: (1) The reward is upper bounded by a positive constant r max for all (s, a) ∈ S × A; and (2) the stationary state-action transition matrix P π has non-zero spectral gap 1 -λ π > 0 for all π. The first condition in Assumption 3 is often used for discounted MDPs to ensure a finite value function (Thrun & Schwartz, 1993; Fujimoto et al., 2018; Fu et al., 2020) . Moreover, since the samples in the same trajectories are generally correlated, the second condition is adopted to guarantee the concentration properties of the Markov chain, which is generally true for the stationary Markov chain (Jiang et al., 2018; Ortner, 2020; Amit et al., 2020) . For any λ ∈ [0, 1), let α 1 (λ) = (1 + λ)/(1 -λ), α 2 (λ) = 5/(1 -λ) where α 2 (0) = 1/3. Define rm = α 2 2 r 2 max (max{λπ t ,0}) 2 ln 2 p-2mα1(max{λπ t ,0}) ln p-α2(max{λπ t ,0}) ln p m + r max . Then we can have the following main result on the approximation error in the Critic update step. Proposition 1 (Approximation Error in Critic Update). Under Assumptions 1, 2, 3, the following inequality holds with probability at least 1 -p, for any t > 0, (s, a) ∈ S × A: E[|Q ωt (s, a) -Q ωt (s, a)|] ≤ 4((1-γ)rm+γ m R) √ N (σ * ) 2 -2 3N log p 4d + 4 9N 2 log 2 p 4d -2 N log p 4d , where d is the dimension of the Critic parameter ω and R is the radius of Critic parameter space Ω as defined in Eqn. (10). Proposition 1 establishes the upper bound for the approximation error in the Critic update, which encapsulates the impact of the finite sample size and the finite-step rollout with Bellman evaluation operator T π . It can be seen from Proposition 1 that in order to obtain an accurate evaluation of the policy, we can increase the sample size N in the update Eqn. ( 10) and have more steps of rollout with Bellman evaluation operator T π . We remark that Proposition 1 considers the correlation across samples, and we appeal to the recent advances in Bernstein's Inequality for Markovian samples (Jiang et al., 2018) (Fan et al., 2021b) to tackle this challenge. The proof of Proposition 1 can be found in Appendix B and Appendix C.

3.2. APPROXIMATION ERROR IN THE ACTOR UPDATE

In practice, the greedy search step for solving Eqn. ( 9) is generally approximated by multiple (e.g., N a ) steps of policy gradient. Based on the policy gradient theorem (Silver et al., 2014; Sutton et al., 1999) , we can have the following update at gradient step k ∈ [1, N a ] in the t-th Actor update: θ t,1 = θ t , θ t,Na = θ t+1 , θ t,k+1 = θ t,k + αE (s,a)∼ρ π θ t,k [Q ωt+1 (s, a)∇ θ π θ t,k (a|s)], where α is the learning rate. For simplicity, we drop the subscript t in θ t,k when no confusion will arise and denote ρ k := ρ π θ k . As in the Critic update, we sample a trajectory with length l by following the current policy π θ k , i.e., {s 1 , a 1 , s 2 , a 2 , • • • , s l , a l }, to approximate the expectation in Eqn. ( 11). Then we can have that θ k+1 =θ k + α 1 l l i=1 [Q ωt+1 (s i , a i )∇ θ π θ k (a i |s i )] := θ k + α(C k,t,1 + C k,t,2 ) + αf k,t , where C k,t,1 , C k,t,2 and f k,t are defined as follows C k,t,1 :=1/l l i=1 (Q ωt+1 (s i , a i )∇ θ π θ k (a i |s i ) -Q ωt+1 (s i , a i )∇ θ π θ k (a i |s i )), C k,t,2 :=1/l l i=1 (Q ωt+1 (s i , a i )∇ θ π θ k (a i |s i ) -Q π θ t (s i , a i )∇ θ π θ k (a i |s i )), f k,t :=1/l l i=1 Q π θ t (s i , a i )∇ θ π θ k (a i |s i ). Here C k,t,1 captures the error resulted from using samples to estimate expectation in the Critic update. Based on our result in Proposition 1, with high probability, this term will go to 0 when we have infinite samples or infinite rollout length m. Note that (T π θ t ) m Q ωt = Q ωt+1 (Critic update) and lim m→∞ (T π θ t ) m Q ωt = Q π θ t . And C k,t,2 implies the approximation error when applying the Bellman operator limited (m) times. This term will go to 0 when m → ∞. f k,t is an unbiased estimation of the gradient of E (s,a)∼ρ k [Q π θ t (s, a)], i.e., E[f k,t ] = E (s,a)∼ρ k [Q π θ t (s, a)∇ θ π θ k (a|s)]. Based on Eqn. ( 12), it is clear that the Actor update with the approximation error resulted from the Critic update can be viewed as a stochastic gradient update with some perturbation C k,t = C k,t,1 + C k,t,2 . Denote θ * t+1 as the solution of the Eqn. ( 9). For ease of exposition, we use h(ω, θ) to denote the objective function in the Actor update: h(ω, θ) = E (s,a)∼ρ π θ [Q ω (s, a)] = E s∼d π θ [v πω (s)]. Note that h is a function of Actor parameter θ for a given Critic parameter ω. Next, we quantify the approximation error in the Actor update in terms of the gap h(ω t+1 , θ * t+1 ) -h(ω t+1 , θ t+1 ). Recall that Proposition 1 gives upper bound for the approximation error E ω [|Q ωt -Q ωt |] in the Critic update, which has direct impact on C k,t . Based on Proposition 1, we have the following two lemmas for the upper bounds on the bias term b = E[C k,t ] and the error term 12), respectively. The proof of Lemmas 1 and 2 can be found in Appendix D and E, respectively. Lemma 1 (σ 2 -bounded noise). Suppose Assumptions 1, 2, 3 hold. Then with probability at least β = f k,t + C k,t -E[C k,t ] in the stochastic gradient update Eqn. ( 1 -p, E[∥β∥ 2 ] ≤ ∥∇ θ h(ω, θ) + b∥ 2 + σ 2 , ∀θ, where σ 2 ≥ 0 is a constant and depends on p. Lemma 2 (ζ 2 -bounded bias). Suppose Assumptions 1, 2, 3 hold. Then with probability at least 1 -p, ∥b∥ 2 ≤ ζ 2 , ∀θ, where ζ 2 ≥ 0 is a constant and depends on p. Denote the score function ψ θ (a|s) := ∇ θ π θ (a|s) and we impose the following standard assumptions. Assumption 4. For any θ, θ ′ ∈ R d and state-action pair (s, a) ∈ S ×A, there exist positive constants L ψ , C ψ and C π such that the following holds: (1) ∥ψ θ -ψ θ ′ ∥ ≤ L ψ ∥θ -θ ′ ∥; (2) ∥ψ θ ∥ ≤ C ψ and (3) ∥π θ (•|s) -π θ ′ (•|s)∥ TV ≤ C π ∥θ -θ ′ ∥, where ∥ • ∥ T V is the total-variation distance. We remark that the smoothness and bounded property of the score function as stated in the (1) and (2) in Assumption 4 are widely adopted in the literature (Xu et al., 2020a; b; Zou et al., 2019; Agarwal et al., 2020; Kumar et al., 2019) , and it has been shown (Xu et al., 2020a ) that (3) in Assumption 4 can be satisfied for any smooth policy with bounded action space. For the sake of tractability, we next give the following two lemmas about the smoothness and Polyak-Lojasiewicz Condition on the objective function h(•, θ). The proof can be found in Appendix F. Lemma 3 (L-smoothness). Suppose Assumption 4 hold. Then function h(•, θ) is bounded from below by an infimum h inf ∈ R, differentiable and ∇h is L-Lipschitz, i.e., ∥∇h(ω, θ) -∇h(ω, θ ′ )∥ ≤ L∥θ -θ ′ ∥, ∀ ω, θ, θ ′ . Lemma 4 (µ-PL). If ∇h(ω, θ) ̸ = 0, then we have 1 2 ∥∇h(ω, θ)∥ ≥ µ(h(ω, θ * )-h(ω, θ)) ≥ 0, ∀ θ, ω. Let α ≤ 1 2L . Next we present the upper bound of the approximation error in the Actor update. Proposition 2 (Approximation Error in Actor Update). Given the updated Actor parameter θ t-1 , the following inequality holds with probability at least 1 -p: E θ [∥h(ω, θ * t ) -h(ω, θ t )∥] ≤ (1 -αµ) Na (h(ω, θ * t ) -h(ω, θ t-1 )) + ζ 2 + 2αLσ 2 2µ , where σ 2 , ζ 2 , L and µ are defined in Lemma 1, Lemma 2, Lemma 3 and Lemma 4, respectively. Proposition 2 reveals that due to the bias and noise induced by the Critic approximation error, running more gradient iterations do not necessarily guarantee the convergence to the optimal policy π θ * t . Note that Lemmas 1 and 2 in (Ajalloeian & Stich, 2020) are given in the form of assumptions, whereas in this work, we justify that both assumptions hold with high probability and prove Proposition 2, and the proof can be found in Appendix G.

4. IMPACT OF APPROXIMATION ERRORS AND WARM-START POLICY ON FINITE-TIME LEARNING PERFORMANCE

We next quantify the impact of the approximations errors on the sub-optimality gap of the Warm-Start A-C method with inaccurate Actor/Critic updates. We first cast the A-C method as Newton's Method with perturbation, and then present both the finite-time upper bound and lower bound on the finite-time learning performance in Section 4.1. Actor-Critic Method as Newton's Method with Perturbation. As mentioned earlier, the Critic update follows Eqn. ( 10) with finite samples and finite step rollout with Bellman evaluation operator T π and the Actor update follows Eqn. ( 12). Given the policy π t at time t, we denote the resulting policy of one A-C update as πt+1 . Recall that we use π t+1 to denote the policy attained the max in T (v πt ) as illustrated in Figure 1 . Furthermore, we define the following notations for ease of our discussion: (1) Denote E v,t = v πt+1 -v πt+1 as the approximation error in the Actor update; (2) Denote E r,t = r πt+1 -r πt+1 as the error in the reward vector, which is induced by the approximation error in the Actor update step; (3) Denote E P,t = P πt+1 -P πt+1 as the error in the transition matrix P ; (4) Denote E Ĵ,t = J -1 vt -J -1 vt where J vt = I -γP πt+1 and J vt = I -γP πt+1 . Following the same line as in Section 2.2, we treat the A-C algorithm as Newton's method with perturbation E t , i.e., v πt+1 = v πt -(J -1 vt (v πt -T (v πt )) -E t ) := v πt -L(t), where L(t) is the stochastic estimator of Newton's update L(t) = J -1 vπ t v πt -T v πt , and E t = E v,t + E Ĵ,t (v πt+1 -(r πt+1 + γP πt+1 v πt+1 )) -J -1 vt (E r,t + γE P,t v πt ), which can be further decomposed into bias and Martingale difference noise as follows: B(t) ≜E[ L(t)] -L(t) = E[E t ], N (t) ≜ L(t) -E[ L(t)] = E t -E[E t ]. We have a few observations in order. It can be seen that the perturbation E t results from both Actor approximation error (e.g., E r,t , E P,t ) and Critic approximation error (e.g., E v,t ). More importantly, the learnt Q function in the Critic update Eqn. ( 10) is biased in general due to finite rollout steps m, which further leads to the biased gradients in the Actor update Eqn. (11) (Kumar et al., 2019) . Clearly, the estimation bias plays an important role in affecting the learning performance, especially when deep neural networks are used as function approximations, which has been extensively investigated using empirical studies (Fujimoto et al., 2018; Elfwing et al., 2018; Van Hasselt et al., 2016) . Next, we examine the bias B(t) based on the approximation errors in the Actor/Critic updates. Combining the results in Proposition 1 and 2 on the approximation error in the Critic/Actor updates, we have the following result on the bias B(t). A full derivation is given in Appendix H. Proposition 3 (Upper Bound on the Bias). Suppose Assumption 4 holds. Let S ϵ (•) be an open ball of radius ϵ. There exist positive constants L b , and ϵ, such that when θ t+1 ∈ S ϵ (θ * t+1 ), the following holds for any t > 0, ∥B(t)∥ ≤ L b E θ [∥h(•, θ * t+1 ) -h(•, θ t+1 )∥].

4.1. FINITE-TIME LEARNING PERFORMANCE AND ERROR PROPAGATION EFFECT

Lower Bound on Performance Gap. Aiming to understand "whether online learning can be accelerated by a warm-start policy", we first derive a lower bound to quantify the impact of the bias and the error propagation. By unrolling the recursion of the Newton update (with perturbation) Eqn. ( 14), we obtain the following theorem. Theorem 1. The lower bound of ∥E[v * -v πt+1 ]∥ satisfies that ∥E v * -v πt+1 ∥ ≥ ∥γ t+1 Pt+1 (v * -v π0 ) + t i=1 γ i Pi B(t -i) + B(t)∥, where Pt+1 = E t i=0 P πt+1-i and π 0 is the warm-start policy. Error Propagation and Accumulation. It can be seen form Theorem 1 that the bias terms {B(t)} add up over time, and the propagation effect of the bias terms is encapsulated by the last two terms on the right side of Eqn. (15). Clearly, the first term on the right side, corresponding to the impact of the warm-start policy π 0 , diminishes with A-C updates. To get a more concrete sense of Theorem 1, we consider the following special settings. (1) When the bias is always positive, i.e., B(t) > 0 for all t ≥ 0, the lower bound in Theorem 1 is always positive, i.e., ∥E v * -v πt+1 ∥ ≥ ∥B(t)∥ > 0. In this case, the sub-optimal gap remains bounded away from zero. Similar conclusion can be made when the bias is always negative. (2) When the bias term can be either positive or negative, the lower bound is shown as Eqn. (15). In this case, the learning performance of the A-C algorithm largely depends on the behavior of the Bias term. It can be seen from Theorem 1 that even when the warm-start policy is near-optimal, it is still challenging to guarantee that online fine-tuning can improve the policy if the approximation error is not handled correctly. We note that this has also been observed empirically (Nair et al., 2020; Lee et al., 2022) . Upper Bound on Performance Gap. In order to derive the upper bound on the sub-optimality gap, we first introduce the following standard assumption on the Jacobian J v as in the analysis of policy iteration (Puterman & Brumelle, 1979; Grand-Clément, 2021) . Assumption 5. For some q > 0 there exists a constant 0 < L J < +∞ such that ∥J v π -J v * ∥ ≤ L J ∥v π -v * ∥ q ∀ π, and there is a constant 0 < M < +∞ such that ∥J -1 v π ∥ ≤ M, ∀ π. Denote H t := ∥J -1 vt [J vt -J v * ] ∥. Clearly, we have H t is upper bounded by H t ≤ M L J ∥v πtv * ∥ q from Assumption 5. Next, We present the finite-time upper bound in Theorem 2. Theorem 2. Suppose Assumption 5 holds. Then we have that for any t > 0, ∥E[v πt+1 -v * ]∥ ≤ t i=0 H t-i ∥v * -v π0 ∥ + t i=1   i j=1 H t-j   ∥B(t -i)∥ + ∥B(t)∥. Under what conditions online learning can be accelerated by the warm-start policy? The upper bound in Theorem 2 sheds light on the impact of warm-start policy π 0 (the first term) and the bias {B(t)} (the last two terms), thereby providing guidance on how to achieve desired finite-time learning performance. Specifically, consider the case when the approximation error is unbiased. Clearly, we have ∥E[v πt+1 -v * ]∥ ≤ t i=0 H t-i ∥v * -v π0 ∥, which decreases quickly if the warm-start policy π 0 is close to the optimal policy π * . This observation corroborates the most recent empirically finding, where the online RL is able to further improve the warm-start policy by few adaptation steps (Silver et al., 2017; Bertsekas, 2022a; Kalashnikov et al., 2018) . More generally, when the bias B(t) ̸ = 0, the upper bound hinges heavily on the biases in the approximation errors, even when the warm-start policy π 0 is close to the optimal policy. In this case, recall the result on the upper bound of the bias B(t) in proposition 3, where we establish the connection between the bias and the approximation error. As expected, in order to reduce the performance gap, it is essential to decrease the bias in the approximation error, which can be achieved by increasing gradient steps and sample sizes. The proof of Theorem 1 and 2 are relegated to the Appendix I and J, respectively. The experiments results and analysis on the Gridworld benchmark can be found in Appendix K.

5. CONCLUSION

In this work, we take a finite-time analysis approach to quantify the impact of approximation errors on the learning performance of Warm-Start A-C method with a given prior policy. By delving into the intricate coupling between the updates of the Actor and the Critic, we first provide upper bounds on the approximation errors in both the Critic update and Actor update of online adaptation, respectively, where the recent advances on Bernstein's Inequality are leveraged to deal with the sample correlation therein. Based on these results, we next cast the Warm-Start A-C method as Newton's method with perturbation, which serves as the foundation for characterizing the impact of the approximation errors on the finite-time learning performance of Warm-Start A-C. In particular, we derive lower bounds on the sub-optimality gap under biased approximation errors, indicating that the performance gap can be bounded away from zero even for Warm-Start A-C with a good prior policy. And we also provide upper bounds on the sub-optimality gap, which provides guidance on the design of Warm-Start RL for achieving desired finite-time learning performance. Policy Iteration as Newton's Method. Based on (Puterman & Brumelle, 1979)(Grand-Clément, 2021), we first build the relation between policy iteration and Newton's Method in the abstract dynamic programming (ADP) framework, assuming accurate updates. From the definition of the value function v, we have that for any policy π, v π = r π + γP π v π . Recall the definition of Bellman evaluation operator T π (•) and the Bellman operator T (•), T π (v) =r π + γP π v, T (v) = max π {r π + γP π v} = max π T π (v). It follows that v πt+1 = J -1 v π t r πt+1 = v πt -v πt + J -1 v π t r πt+1 = v πt -J -1 v π t J v π t v πt + J -1 v π t r πt+1 = v πt -J -1 v π t -r πt+1 + J v π t v πt = v πt -J -1 v π t -r πt+1 + I -γP πt+1 v πt = v πt -J -1 v π t v πt -r πt+1 -γP πt+1 v πt = v πt -J -1 v π t (v πt -T (v πt )) , ) where J v = I -γP π(v) and π(v) attains the max in T (v). Eqn. ( 16) establishes a connection between policy iteration under ADP and Newton's Method. Specifically, if we assume function F : v → v -T (v) is differentiable at any vector v visited by policy iteration, then we have v t+1 = v t + J -1 vt F (v t ), which is exactly the update of the Newton's Method in convex optimization (Boyd et al., 2004) . Due to the fact that F (•) may not be differentiable at all v in policy iteration, the assumptions on the Lipschitzness of v → J v is commonly used to prove the convergence of the policy iteration (see Assumption 5). Following the same line, next we show the case when function approximation is used in the A-C algorithm. A-C Updates with Function Approximation. Consider the illustration example in Section 2.2. Next we outline the main differences between the A-C update with function approximation and the policy iteration in the ADP framework, and cast A-C based policy iteration with function approximation as Newton's Method with perturbation. Specifically, v πt+1 = J -1 v π t r πt+1 = v πt -v πt + J -1 v π t r πt+1 = v πt -J -1 v π t J v π t v πt + J -1 v π t r πt+1 = v πt -J -1 v π t -r πt+1 + J v π t v πt = v πt -J -1 v π t -r πt+1 + I -γP πt+1 v πt = v πt -J -1 v π t v πt -(r πt+1 + γP πt+1 v πt ) ≜ v πt -J -1 v π t v πt -T (v πt ) , where J v π t = I -γP π(v π t ) and π(v) attains the max in T (v) (not T (v)), with the following two operators defined as The Lower Bound of T (v t ) ≜r πt+1 + γP πt+1 v t , T (v t ) ≜r πt+1 + γP πt+1 v t . E[v* -v π t+1 ] Theorem 2 The Upper Bound of E[v* -v π t+1 ] Figure 2 : Illustration of the theoretical analysis. For convenience, let E T,t and E J,t denote the approximation errors in the Bellman operator T and the Jacobian J v , i.e., T (v t ) -T (v t ) =(r πt+1 + γP πt+1 v t ) -(r πt+1 + γP πt+1 v t ) ≜ E T,t , J -1 vt -J -1 vt =(I -γP πt+1 ) -1 -(I -γP πt+1 ) -1 ≜ E J,t , and define v πt+1 ≜ v πt+1 + E a,t , where E a,t capture the error induced by inaccurate policy improvement (the greedy step, e.g., Eqn. ( 9)) in the Actor update. Then we have that v πt+1 = v πt -J -1 v π t v πt -T (v πt ) = v t -(J -1 vt + E J,t ) (v t -T (v t ) -E T,t ) = v t -J -1 vt (v t -T (v t )) Exact Newton Step -E J,t (v t -T (v t )) + (J -1 vt + E J,t )E T,t Perturbation ≜ v t -J -1 vt (v t -T (v t )) Exact Newton Step +E t = v πt+1 + E t . In a nutshell, we have that v πt+1 = v πt+1 + E c,t + E a,t , where E c,t ≜ -E J,t (v t -T (v t )) + (J -1 vt + E J,t )E T,t .

SAMPLES

In this section, we provide the proof of Bernstein's Inequality with General Makovian samples following the proof in Theorem 2 (Jiang et al., 2018) . With a bit abuse of notation,let π denote the stationary distribution of the Markov chain {X i } i≥1 . We define π(h) := h(x)π(dx) to be the integral of function h with respect to π. Let L 2 (π) = {h : π(h 2 ) < ∞} be the Hilbert space of square-integrable functions and L 0 2 (π) = {h ∈ L 2 (π) : π(h) = 0} be the subspace of mean zero functions. Let P be the Markov transition matrix of its underlying (state space) graph and P * be its adjoint in the Hilbert space. Let λ(P ) ∈ [0, 1] be the operator norm of P on L 0 2 (π) and λ r (P ) ∈ [-1, 1] be the rightmost spectral value of (P + P * )/2. Then the right spectral gap of P is defined as 1 -λ r (Levin & Peres, 2017) (We remark that in Assumption 3, we assume the absolute spectral gap is non-zero, which implies the right spectral gap is also non-zero. This is true since -1 ≤ λ r ≤ λ ≤ 1.). Let E h denote the multiplication operator of function e h : x → e h(x) . In the Hilbert space L 2 (π), we define the norm of a function h to be ∥h∥ π = ⟨h, h⟩ π . Furthermore, we introduce the norm of a linear operator T on L 2 (π) as |||T ||| π = sup{∥T h∥ π : ∥h∥ π = 1}. We first restate Bernstein's Inequality with General Makovian Samples (Jiang et al., 2018) in the following theorem. Let α 1 (λ) = (1 + λ)/(1 -λ), α 2 (λ) = 5/(1 -λ) and α 2 (0) = 1/3. Theorem 3 (Bernstein's Inequality with General Makovian Samples). Suppose {X i } i≥1 is a stationary Markov chain with invariant distribution π and non-zero right spectral gap 1 -λ r > 0, and f : → x[-c, c] is a function with π(f ) = 0. Let σ 2 = π(f 2 ). Then, for any 0 ≤ t < (1 -max{λ r , 0})/5c and any ϵ > 0, P π 1 n n i=1 f (X i ) > ϵ ≤ exp - nϵ 2 /2 α 1 (max {λ r , 0}) • σ 2 + α 2 (max {λ r , 0}) • cϵ . ( ) Proof. Step 1. Establish the upper bound of E e t n i fi(Xi) . Let I : x → 1 be the fucntion mapping x to 1 and let Π be the projection operator onto 1, i.e., Π : g → ⟨h, I⟩ π I = π(h)I. Define the León-Perron operator to be P γ = γI + (1 -γ)Π, γ ∈ [0, 1). Then we recall the following lemma (Lemma 2, (Jiang et al., 2018) ) on the stationary Markov chain (Fan et al., 2021a) . Lemma 5. Let {X i } be a stationary Markov chain with invariant measure π and non-zero right spectral gap 1 -λ r > 0. For any bounded function f and any t ∈ R, E π e t n i=1 f (Xi) ≤ E tf /2 P max{λr,0} E tf /2 n π . Lemma 6 indicates that it is sufficient to prove the upper bound of E e t n i fi(Xi) by proving the upper bound of E tf /2 P max{λr,0} E tf /2 n π . To this end, we first invoke the following lemma (Lemma 6, (Jiang et al., 2018)  ) to construct f k ≈ f such that for any λ ∈ [0, 1), E tf /2 P λ E tf /2 π = lim k→∞ E t f k /2 P λ E t f k /2 π . Lemma 6. For function f : x ∈ X → [-c, c] such that π(f ) = c, π(f 2 ) = σ 2 . Let ⌈•⌉ be the ceiling function and f k (x) = f (x)+c c/3k × c 3k -c. Let f k = f k -π( f k ) 1+1/3k . Then f k takes at most 6k + 1 possible values and satisfies that for any bounded linear operator T acting on the Hilbert Space L 2 (π) and any t ∈ R, E tf /2 T E tf /2 π = lim k→∞ E t f k /2 T E t f k /2 π . Assume that the Markov chain { X i } i≥1 , X i ∈ X is generated by the León-Perron operator Pλ . It follows that { Y i } i≥1 = { fk ( X i )} i≥1 is a Markov chain in the state space Y = fk (X ). We recall the following lemma (Lemma 7, (Jiang et al., 2018) ) on the relation between the two Markov chains. Lemma 7. Let P λ be the León-Perron operator with λ ∈ [0, 1) on state space X . Let f be a function on X . On the finite state space Y = {y ∈ f (X ) : π({x : f (x) = y}) > 0}, define a transition matrix Q λ = λI + (1 -λ)Iµ ⊤ , with transition vector µ consisting of elements π({x : f (x) = y}) for yinY. Let E tY denote the diagonal matrix with elements e ty : y ∈ Y. Then we have, E tf /2 P λ E tf /2 π = E tY/2 Q λ E tY/2 µ . Next, we bound the term E tY/2 Q λ E tY/2 µ by the expansion of the largest eigenvalue of the perturbed Markov operator E tf /2 P E tf /2 as a series in t. Specifically, we recall the following result (Lezaud, 1998) . Lemma 8. Consider a reversible, irreducible Markov chain on finite state space X . Let D be the diagonal matrix with {f (x) : x ∈ X } and T (m) = P D m /m! for any m ≥ 0 with D 0 = I. Assume the invariant distribution of the Markov chain is π and the second largest eigenvalue of the transition matrix P is λ r < 1. Let t 0 = 2 T (1) π (1 -λ r ) -1 + c 0 -1 for some c 0 such that T (m) π ≤ T (1) π c m-1 0 , ∀m ≥ 1. Denote the largest eigenvalue of P E tf by β(t) and Z = ( I -P + Π) -1 -Π. Let Z 0 = -Π, Z (j) = Z j , j ≥ 1, β(0) = 1 and β(m), m ≥ 1 be β (m) = m p=1 -1 p v1+•••+vp=m,vi≥1,k1+•••+kp=p-1,kj ≥0 trace T (v1) Z (k1) . . . T (vp) Z (kp) , Then we have the following expansion on β(t), β(t) = ∞ m=0 β (m) t m , |t| < t 0 . Follow the same line as in (Lezaud, 1998) (Page 854-856), denote σ 2 = ∥f ∥ 2 π and c = c 0 ≥ |||D||| π (defined in Lemma 8), then we have the following upper bound of β(t). β(t) = β (0) + β (1) t + m=2 β (m) t m ≤ 1 + 0 + ∞ m=2 π (f m ) t m m! + ∞ m=2 σ 2 λt 5c 5ct 1 -λ r m-1 ≤ exp ∞ m=2 π (f m ) t m m! + ∞ m=2 σ 2 λt 5c 5ct 1 -λ r m-1 ≤ exp σ 2 c 2 e tc -1 -tc + σ 2 λt 2 1 -λ r -5ct := exp(g 1 (t) + g 2 (t)) Now we are ready to derive the bound for the term E e t n i fi(Xi) . Following the results in Lemma 6, we consider a sequence of f k such that, E tf /2 P λ E tf /2 π = lim k→∞ E t f k /2 P λ E t f k /2 π . Next, we construct the finite state space counterpart of each pair of E t f k /2 P λ E t f k /2 and π by Lemma 7, i.e., E t f k /2 P λ E t f k /2 π := E tY k /2 Q λ E tY k /2 µ k Let the random variable in the state space Y k be Y k , then the mean and variance of Y k is y∈Y k π x : f k (x) = Y y = π f k = 0 and y∈Y k π x : f k (x) = y y 2 = π f 2 k . For each k, applying Eqn. ( 18) gives us, E tY k /2 Q λ E tY k /2 µ k ≤ exp   π f 2 k c 2 e tc -1 -tc + π f 2 k λt 2 1 -λ r -5ct   Under review as a conference paper at ICLR 2023 Note that as k → ∞, we have π f 2 k → π(f 2 ) = σ 2 . Then we have the upper bound for each operator E tfi/2 P E tfi/2 π , i.e., for any λ ∈ [0, 1), E tf /2 P λ E tf /2 π ≤ exp(g 1 (t) + g 2 (t)) where g 1 and g 2 are defined in Eqn. ( 18). Consequently, we obtain the upper bound for E e t n i fi(Xi) as follows, E e t n i fi(Xi) , E π e t n i=1 fi(Xi) ≤ exp nσ 2 c 2 e tc -1 -tc + nσ 2 max{λ r , 0}t 2 1 -max{λ r , 0} -5ct Step 2 Use the convex analysis argument to derive the Bernstein's Inequality. We first restate the following lemma (Lemma 9, (Jiang et al., 2018) ) on the terms g 1 and g 2 . Lemma 9. For λ ∈ [0, 1), let g 1 (t) = nσ 2 c 2 (e tc -1 -tc) and g 2 (t) = nσ 2 max{λr,0}t 2 1-max{λr,0}-5ct , then for any 0 ≤ t < (1 -γ)/5c, the Frechet conjugates (g 1 + g 2 ) * satisfy the following inequalities. if λ ∈ (0, 1) : (g 1 + g 2 ) * (ϵ) := sup 0≤t<(1-λ)/5c {tϵ -g 1 (t) -g 2 (t)} ≥ ϵ 2 2 1 + λ 1 -λ σ 2 + 5cϵ 1 -λ -1 if λ = 0 : (g 1 + g 2 ) * (ϵ) =g * 1 (ϵ) ≥ ϵ 2 2 σ 2 + cϵ 3 -1 . By the Chernoff bound, we have, -log P 1 n n i=1 f i (X i ) > ϵ ≥ n × sup t∈R {tϵ -g 1 (t) -g 2 (t)} Notice that g 1 (t) = O(t 2 ) and g 2 (t) = O(t 2 ) as t → 0, then for some t > 0, we have tϵ -g 1 (t)g 2 (t) > 0. Meanwhile, when t ≤ 0, we have tϵ -g 1 (t) -g 2 (t) ≤ 0. Thus, we can obtain that, sup {tϵ -g 1 (t) -g 2 (t) : t > 0} = sup {tϵ -g 1 (t) -g 2 (t) : t ∈ R} = (g 1 + g 2 ) * (ϵ). Letting λ = max{λ r , 0}, α 1 (λ) = (1 + λ)/(1 -λ), α 2 (λ) = 5/(1 -λ) and α 2 (0) = 1/3 yields, P π 1 n n i=1 f (X i ) > ϵ ≤ exp - nϵ 2 /2 α 1 (max {λ r , 0}) • σ 2 + α 2 (max {λ r , 0}) • cϵ . ( ) This concludes the proof.

C PROOF OF PROPOSITION 1

Let ωt+1 = Γ R (ω t+1 ), and assume ∥ϕ(s, a)∥ ≤ 1 uniformly (see Assumption 1). Based on the approach in Appendix G.1 (Fu et al., 2020) , it suffices to upper bound ∥ω t+1 -ωt+1 ∥ 2 . Observe that ∥ω t+1 -ωt+1 ∥ 2 ≤ ∥ Φ v -Φv∥ 2 ≤ ∥Φ∥ 2 • ∥ v -v∥ 2 + ∥ Φ -Φ∥ 2 • ∥ v∥ 2 , where Φ and v are given as follows: Φ = 1 N N l=1 ϕ (s l , a l ) ϕ (s l , a l ) ⊤ -1 , Φ = E ρt+1 ϕ(s, a)ϕ(s, a) ⊤ -1 , v = 1 N N l=1 (1 -γ) m-1 i=0 γ i r l,i + γ m Q ωt (s l,m , a l,m ) • ϕ (s l,m , a l,m ) , v = E ρt+1 (1 -γ) m-1 i=0 γ i r l,i + γ m P π θ t+1 Q ωt (s m , a m ) • ϕ(s m , a m ) . Recall that the following assumptions are in place: (1) Spectral norm ∥ϕ(s, a)∥ 2 ≤ 1, ϕ(s, a) ∈ R d ; (2) |r(s, a)| ≤ r max and r = E s,a r(s, a); (3) ∥ω t ∥ 2 ≤ R and (4) the minimum singular value of the matrix E ρt [ϕ(s, a)ϕ(s, a) ⊤ ], t ≥ 1 is uniformly lower bounded by σ * . It can be shown that ∥Φ∥ 2 ≤ 1 σ * . Next, we derive the bound by appealing to Bernstein's Inequality with General Makovian samples. Following Theorem 2 (Jiang et al., 2018) (The proof of Bernstein's Inequality can be found in Appendix B), let π r be the invariant distribution (which is relevant to the current policy π k ) of the stationary Markov chain {r t } m t=1 . Suppose that it has non-zero right spectral gap 1 -λ r > 0. Let σ 2 r = (r -r) 2 π r (dr). Then, we have that for any ϵ > 0: P πr 1 m m i=1 (r i -r) > ϵ ≤ exp - mϵ 2 /2 α 1 (max {λ r , 0}) • σ 2 + α 2 (max {λ r , 0}) • r max ϵ , where α 1 (λ) = 1+λ 1-λ , α 2 (λ) = 1 3 if λ = 0 5 1-λ if λ ∈ (0, 1) . We conclude that with probability at least 1 -p, m-1 i=0 r i ≤ α 2 2 (max {λ r , 0}) 2 ln p 2 -2mα 1 (max {λ r , 0}) ln p -α 2 (max {λ r , 0}) ln p m + r := rm . It follows that with probability at least 1 -p, ∥v∥ 2 ≤ (1 -γ)r m + γ m R, Further, note that ∥v∥ 2 ≤ (1 -γ)r + γ m R, Since the minimum singular value of Φ-1 is no less than σ * 2 w.h.p. when N is large enough, we have that ∥ Φ∥ 2 ≤ 2 σ * . For convenience, define X ≜ 1 N N l=1 ϕ (s l , a l ) ϕ (s l , a l ) ⊤ , X ≜ E ρt+1 ϕ(s, a)ϕ(s, a) ⊤ , and define Z ≜ X -X = N k=1 S k , S k ≜ 1 N (ϕ k ϕ ⊤ k -X), where S k , k = 1, • • • , N are independent. Next, we derive the uniform bound on the spectral norm of each summand as follows: ∥S k ∥ 2 = 1 N ∥ϕ k ϕ ⊤ k -X∥ ≤ 1 N (∥ϕ k ϕ ⊤ k ∥ + ∥X∥) ≤ 2 N . To this end, we bound the matrix variance statistic V (Z): V (Z) :=∥E[Z 2 ]∥ = ∥ N k=1 E[S 2 k ]∥. Note that the variance of each summand is given by E[S 2 k ] = 1 N 2 E[(ϕ k ϕ ⊤ k -X) 2 ] = 1 N 2 E[∥ϕ k ∥ 2 • ϕϕ ⊤ -ϕϕ ⊤ X -Xϕϕ ⊤ + X 2 ] ≼ 1 N 2 [E[ϕϕ ⊤ ] -X 2 ] ≼ 1 N 2 X. Combining the above, we conclude that 0 ≼ N k=1 E[S 2 k ] ≼ 1 N X. Observe that ∥X∥ = ∥E[ϕϕ ⊤ ]∥ 2 ≤ E[∥ϕϕ ⊤ ∥] = E∥ϕ∥ 2 ≤ 1. Since the spectral norm is the variance statistic given by V (Z) ≤ 1 N ∥X∥, appealing to Bernstein's Inequality, we have that P{∥Z∥ ≥ t} ≤2d exp -t 2 2 1 N ∥X∥ + 2t 3N , E[∥Z∥] ≤ 2 N ∥X∥ log(2d) + 2 3N log(2d) ≤ 2 N log(2d) + 2 3N log(2d). This is to say, with probability at least 1 -p/2, the following holds: ∥X X∥ ≤ - 2 3N log p 4d + 4 9N 2 log 2 p 4d - 2 N log p 4d . In a nutshell, we have that ∥ Φ -Φ∥ 2 =∥ X-1 -X -1 ∥ 2 =∥ X-1 ( X -X)X -1 ∥ 2 =∥ Φ( X -X)Φ∥ 2 ≤ 2 (σ * ) 2 ∥ X -X∥ 2 ≤ 4 √ N (σ * ) 2 • - 2 3N log p 4d + 4 9N 2 log 2 p 4d - 2 N log p 4d . Similarly, the following inequality holds with probability at least 1 -p/2: ∥ v -v∥ 2 ≤ - δ 1 3 log p 2(d + 1) + δ 2 1 9 log 2 p 2(d + 1) -2δ 2 log p 2(d + 1) , where d is the dimension of vector φ, δ 1 = 1 N ((1 -γ)(r m + r) + 2γ m R) and δ 2 = ∥E[v -v]∥ 2 satisfying δ 2 ≤ 1 N [(1 -γ)(|r m |(|r m -r| + γ m R|r m -r|))] ≤ 1 -γ N [r max + γ m R]|r m -r|. Summarizing, we have that ∥ω t+1 -ωt+1 ∥ 2 ≤∥Φ∥ 2 • ∥ v -v∥ 2 + ∥ Φ -Φ∥ 2 • ∥ v∥ 2 ≤ - δ 1 3σ * log p 2(d + 1) + δ 2 1 9 log 2 p 2(d + 1) -2δ 2 log p 2(d + 1) + 4((1 -γ)r m + γ m R) √ N (σ * ) 2 - 2 3N log p 4d + 4 9N 2 log 2 p 4d - 2 N log p 4d , which indicates that with probability at least 1 -p, |Q ωt+1 -Q ωt+1 | ≤ 4((1 -γ)r m + γ m R) √ N (σ * ) 2 - 2 3N log p 4d + 4 9N 2 log 2 p 4d - 2 N log p 4d ≜ϵ Q . ( ) Remark. In the case when Assumption 1 does not hold, i.e., we have inf ω∈Ω E ρ π θ [ (T π θ ) m Q ω -ω⊤ ϕ (s, a)] = c 1 , where c 1 > 0 is a constant. Let ωt+1 = Γ R (ω t+1 ), recall that ω denotes the solution of Eqn. ( 8) and ω denotes the sample-based solution, then we have |Q ωt+1 -Q ωt+1 | = c 1 From Eqn. ( 22), we obtain that, |Q ωt+1 -Q ωt+1 | ≤ ϵ Q Then the difference between the sample-based solution and the underlying true solution of Eqn. ( 8) is, |Q ωt+1 -Q ωt+1 | ≤ ϵ Q + c 1 . Note that when Assumption 1 holds, Q ωt+1 = Q ωt+1 . D PROOF OF LEMMA 1 Recall β = f k,t + C k,t -E[C k,t ] -E[f k,t ]. We also have the following definitions: C k,t,1 ≜1/l l i=1 (Q ωt+1 (s i , a i )∇ θ π θ k (a i |s i ) -Q ωt+1 (s i , a i )∇ θ π θ k (a i |s i )), C k,t,2 ≜1/l l i=1 (Q ωt+1 (s i , a i )∇ θ π θ k (a i |s i ) -Q π θ t (s i , a i )∇ θ π θ k (a i |s i )), f k,t ≜1/l l i=1 Q π θ t (s i , a i )∇ θ π θ k (a i |s i ), C k,t ≜C k,t,1 + C k,t,2 . Next we evaluate E[∥f k,t + C k,t -E[C k,t ] -E[f k,t ]∥ 2 ] as follows: ∥f k,t + C k,t -E[C k,t ] -E[f k,t ]∥ 2 =(f k,t + C k,t )(f k,t + C k,t ) ⊤ + (E[C k,t ] + E[f k,t ])(E[C k,t ] + E[f k,t ]) ⊤ -2(f k,t + C k,t )(E[C k,t ] + E[f k,t ]) ⊤ ≤(f k,t + C k,t )(f k,t + C k,t ) ⊤ + (E[C k,t ] + E[f k,t ])(E[C k,t ] + E[f k,t ]) ⊤ . Note that C k,t and f k,t are both bounded above since Q-function is bounded and ∇ θ π θ (a|s) is bounded (see Assumption 4), i.e., ∥∇π(a|s)∥ ≤C ψ , ∥Q(s, a)∥ ≤ ∞ t=1 γ t r max = r max 1 -γ . Then we have the following bounds for C k,t and f k,t : ∥C k,t ∥ ≤ 2C ψ r max 1 -γ , ∥f k,t ∥ ≤ C ψ r max 1 -γ . Taking expectation over both sides of the inequality ( 23), we have that E[∥β∥ 2 ] ≤ 1 • ∥E[C k,t ] + E[f k,t ]∥ 2 + E[(f k,t + C k,t )(f k,t + C k,t ) ⊤ ]. Let M n = 1 and σ 2 = E[(f k,t + C k,t )(f k,t + C k,t ) ⊤ ]. Then we have that E[∥β∥ 2 ] ≤ M n • ∥E[C k,t ]∥ + σ 2 , where σ depends on probability p as indicated in Eqn. ( 22). E PROOF OF LEMMA 2 Recall that b = E[C k,t ] and C k,t :=C k,t,1 + C k,t,2 =1/l l i=1 Q ωt+1 (s i , a i )∇ θ π θ k (a i |s i ) -Q ωt+1 (s i , a i )∇ θ π θ k (a i |s i )+ (Q ωt+1 (s i , a i )∇ θ π θ k (a i |s i ) -Q π θ t (s i , a i )∇ θ π θ k (a i |s i ) . Next, we evaluate ∥b∥ 2 . Observe that (see also Appendix D) ∥C k,t ∥ ≤ 2C ψ r max 1 -γ . Let ζ = 2∥C ψ rmax 1-γ ∥. Then we have ∥b∥ 2 = ∥E[C k,t ]∥ 2 ≤ E[∥C k,t ∥ 2 ] ≤ ζ 2 . F PROOF OF LEMMA 3 AND LEMMA 4 • [Lemma 3] Given Critic parameter ω in the objective function, it can be seen that ∥∇h(ω, θ) -∇h(ω, θ ′ )∥ ≤ Q max ∥∇π θ -∇π θ ′ ∥. Since value function is bounded (e.g., Q max ) and the score function ∇π θ is L ψ -smooth (ref. Assumption 6), the constant in Assumption 4 can be easily determined by L = Q max L ψ . • [Lemma 4] Since the objective function is finite, let h max = max θ̸ =θ * h(θ, ω), h * max = max θ=θ * h(θ, ω),. In the case when the gradient is non-zero, let g min = min θ̸ =θ * ∇h, then we can determine µ = gmin h * max -hmax ≥ 0.

G PROOF OF PROPOSITION 2

Observe that the Actor updates use the biased stochastic gradient methods (SGD). For simplicity, we adopt the following notations to study the Actor update: θ k+1 = θ k + α(∇h(ω, θ k ) + b(t) + β(t)). (24) where b(t) = E[C k,t ] is the bias, α is the step size, and β = f k,t + C k,t -E[C k,t ] -E[f k,t ] is the zero-mean noise. Note that the objective function h(ω, θ k ) is a function of θ. Denote the optimal value (in this iteration of the Actor update) by h(ω, θ * ). We prove the following lemma on the modified version of the descent lemma for smooth function (cf. (Ajalloeian & Stich, 2020; Nesterov, 2003) ). Lemma 10. Suppose Assumption 3 and 4 hold. Then, for any stepsize α ≤ 1 (Mn+1)L , the following inequality holds with probability at least 1 -p: E[h(ω, θ k+1 ) -h(ω, θ k )|θ k ] ≤ α 2 ζ 2 + α 2 L 2 σ 2 - α 2 ∥∇h(ω, θ k )∥ 2 . Observe that under the PL-condition (Assumption 4), we have the following recursion: E[h(ω, θ k+1 ) -h(ω, θ * )] ≤ (1 -αµ)E[h(ω, θ k ) -h(ω, θ * )] + α 2 ζ 2 + α 2 L 2 σ 2 . ( ) By applying Eqn. ( 25) recursively, we obtain the desired results in Proposition 2.

H PROOF OF PROPOSITION 3

We first prove the following lemma on the relation between Actor parameter θ and the objective function h(ω, θ). Lemma 11. There exist a contant L h > 0 and an open ball S ϵ (θ * t ) such that for any θ t ∈ B ϵ (θ * t ) the following holds for any t > 0. E[∥π θt -π * ∥ TV ] ≤ L h E[h(ω, θ * t ) -h(ω, θ t )]. Proof. By Taylor's expansion, we have h(ω, θ * ) = h(ω, θ t ) + ∇h(ω, θ t )(θ * t -θ t ) + o(∥θ * t -θ t ∥). Since h(ω, •) satisfies Polyak-Lojasiewicz Condition, it follows that ∥∇h(ω, θ)∥ ≥ 2µ(h(ω, θ * ) -h(ω, θ)) := L g for all θ. Note that L g > 0 when θ ̸ = θ * . Then we have that h(ω, θ * t ) -h(ω, θ t ) =|∇h(ω, θ t )(θ * -θ t ) + o(∥θ * -θ t ∥)| ≥|∇h(ω, θ t )(θ * t -θ t )| -|o(∥θ * -θ t ∥)| ≥L g ∥θ * t -θ t ∥ -L o ∥θ * t -θ t ∥ =(L g -L o )∥θ * t -θ t ∥ , where the last inequality uses the fact that there exists ϵ such that when ∥θ t -θ * t ∥ ≤ ϵ, |o(∥θ * t -θ t ∥)| ≤ L o ∥θ * t -θ t ∥, L o < L g . Taking expectation over both sides gives E[h(ω, θ * t ) -h(ω, θ t )] =(L g -L o )E[∥θ * t -θ t ∥] ≥(L g -L o )∥E[θ * t -θ t ]∥. Then we conclude that the parameter of interest L h , L h = C π L g -L o > 0. where C π is defined in Assumption 4. We are ready to present the proof of Proposition 3. Based on the definition of E Ĵ,t and E T ,t , we derive the upper bound for each term respectively. E Ĵ,t =(I -γP πt+1 ) -1 -(I -γP πt+1 ) -1 =(I -γP πt+1 ) -1 γP πt+1 -γP πt+1 (I -γP πt+1 ) -1 . Observe that value function v is smooth and upper bounded. We denote the smoothness parameter by L v , the upper bound by ∥v∥ ≤ V max , and the smoothness of the reward function by L r . By taking the norm of both sides and applying Assumption 3, 4 and 5, we obtain ∥E Ĵ,t ∥ ≤ M 2 L J L v ∥ π t+1 -πt+1 ∥ TV . Further, observe that E T ,t = r πt+1 + γP πt+1 v πt -(r πt+1 + γP πt+1 v πt ), = r πt+1 -r πt+1 + γ(P πt+1 -P πt+1 )v πt . By taking the norm of both sides and applying Assumption 5, we obtain ∥E T ,t ∥ = ∥r πt+1 -r πt+1 ∥ + ∥γ(P πt+1 - P πt+1 )v πt ∥ ≤(L r + γV max )∥ π t+1 -πt+1 ∥ TV :=L max T . Recall the definition of E t is given as E t = -E Ĵ,t (v πt -T (v πt )) + J -1 vt E T ,t + E T ,t E Ĵ,t . Taking the norm and expectation on both sides yields that ∥E[E t ]∥ ≤ E[∥E t ∥] =E ∥E Ĵ,t (v πt -T (v πt )) + J -1 vt E T ,t + E T ,t E Ĵ,t ∥ ≤L E E[∥ π t+1 -πt+1 ∥ TV ], where L E = (2V max K + L max T )M 2 L v L J + M (L r + γV max ) > 0 is a constant. Since π t+1 = π * t+1 is the greedy solution, we thereby complete the proof of Proposition 3.

I PROOF OF THEOREM

Following the value function update rule, we have v πt+1 = v πt -J -1 vt v πt -T (v πt ) + E t = v πt -(L(t) + E t ) := v πt -L(t). Then, the difference between v πt+1 and v * is given by v * -v πt+1 =v * -v πt + J -1 vt v πt -T (v πt ) + E t . (26) Observe the following result holds for any πt , (v πt -T (v πt )) -(v * -T (v * )) =0 ≥ J 2 vt (v πt -v * ). Recall our decomposition of the value function update is given as L(t) = L(t) + L(t) -E[ L(t)] Martingale Difference Noise: N (t) + E[ L(t)] -L(t) Bias: B(t) . Plugging Eqn. ( 27) into Eqn. (26), we obtain v * -v πt+1 =v * -v πt + J -1 vt v πt -T (v πt ) + E t ≥ (I -J v πt )) (v * -v πt ) + B(t) + N (t) =γP πt+1 (v * -v πt ) + B(t) + N (t). Taking expectation on both sides yields that E[v * -v πt+1 |v πt ] ≥γP πt+1 (v * -v πt ) + B(t). Applying the above inequality recursively gives that E v * -v πt+1 ≥γ t+1 E t i=0 P πt+1-i (v * -v π0 ) + t i=1 γ i E     i-1 j=0 P πt+1-j     (B(t -i)) + B(t) :=γ t+1 Pt+1 (v * -v π0 ) + t i=1 γ i Pi B(t -i) + B(t), with Pt+1 = E t i=0 P πt+1-i . Taking norm on both sides of Eqn. ( 28) yields the desired results.

J PROOF OF THEOREM 2

Based on the update rule of the value function, we have v * -v πt+1 =J -1 vt J vt (v * -v πt ) + J -1 vt v πt -T (v πt ) + E t ≤J -1 vt J vt (v * -v πt ) -J -1 vt J v * (v * -v πt ) -E t ≤J -1 vt [J vt -J v * ] (v * -v πt ) + E t , which implies that E[v * -v πt+1 |v πt ] ≤ J -1 vt [J vt -J v * ] (v * -v πt ) + B(t) . Then, taking norm on both sides of the inequality above gives ∥E[v * -v πt+1 |v πt ]∥ ≤∥J -1 vt [J vt -J v * ] (v * -v πt ) + B(t)∥. Let H t = ∥J -1 vt [J vt -J v * ] ∥. It follows from Assumption 5 that H t ≤ M L J ∥v πt -v * ∥ q . where L J is defined in Assumption 5. Hence, we have that ∥E[v πt+1 -v * ]∥ ≤∥J -1 vt ∥∥J vt -J v * ∥∥v πt -v * ∥ + ∥B(t)∥. (30) By applying Eqn. (30) recursively, we conclude that ∥E[v πt+1 -v * ]∥ ≤ t i=0 H t-i ∥E[v * -v π0 ]∥ + t i=1   i j=1 H t-j   ∥B(t -i)∥ + ∥B(t)∥.

K EXPERIMENTS

A Summary of Theoretical Results. This work aims to provide a comprehensive answer to the question of "whether and under what conditions online learning (e.g., the general algorithm like AC) can be significantly accelerated by a warm-start policy from offline RL". Our key observations are as follows. (1) Our results in Theorem 1 and Theorem 2 point out that the warm-start policy goes handby-hand with the approximation error to influence the learning performance (see the table below for the summary). The intricate relationship between the warm-start policy and the approximation error can be identified by studying the structure of the bounds in Theorem 4 and Theorem 5, where the warm-start policy not only has impact on the first term (e.g., v * -v π0 ) but also the bias propagation through H 0 := ∥J -1 v0 [J v0 -J v * ] ∥. Meanwhile, the biases have impact on the effect of the warm-start policy (the first term in the bounds) through H t directly. (2) In Theorem 1, we point out that the bias terms have direct impact on "whether the warm-start policy is able to facilitate the online learning". For instance, "even when the warm-start policy is nearly-optimal", there is still no guarantee that online fine-tuning can improve the policy much if there exist biases in the approximation errors in online Actor and Critic updates and these biases are not dealt with properly. To clarify further, consider the case when the bias is always positive, i.e., B(t) > 0 for all t ≥ 0, the lower bound is always positive and bounded away from zero. (3) In Theorem 2, we aim to answer the question: "under what conditions online learning can be significantly accelerated by a warm-start policy?". Consider the case when the approximation error is unbiased (such that the A-C can be viewed as the Newton's Method). Clearly, we have ∥E[v πt+1 -v * ]∥ ≤ t i=0 H t-i ∥v * -v π0 ∥, which can decrease quickly as long as the warm-start policy π 0 is not far away from the optimal policy π * and also satisfies Assumption 5. Intuitively, this result shows that "the imperfections of the warm-start policy can be 'washed out' by the (superlinear) Newton step" and corroborates with the observation in the very recent literature (Bertsekas, 2022b.) . We remark that this phenomenon has not been formalized by previous works on the A-C algorithm. B(t) → 0 ∥B(t)∥ > 0 when the distance between π 0 and π * is small The warm-start can facilitate the online convergence (Theorem 2) (Empirical studies: Silver et al. (2017; 2018)) Biases can throttle the convergence significantly due to the accumulation effect (Theorem 1) (Empirical studies: Uchendu et al. ( 2022)) when the distance between π 0 and π * is relatively large The imperfections of the warmstart can be "washed out" by online learning (Theorem 2, Eqn. ( 30)) (Empirical studies: Bertsekas ( 2022b)) The warm-start policy goes hand-byhand with the approximation error to influence the learning performance Empirical Results. We consider experiments over the Gridworld benchmark task. In particular, we consider the following sizes of the grid to represent different problem complexity, i.e., 10 × 10, 15 × 15 and 20 × 20. The goal of the agent is to find a way (policy) to travel from a specified start location, e.g., the red square in Fig. 3 , to an assigned target location, e.g., the red hexagram in Fig. 3 , such that the (discounted) accumulative reward along the way is maximized. Specifically, the action space contains 4 discrete actions, namely, up, down, left, right, which are represented as 1,2,3,4 in the algorithm, respectively. The reward in the goal state is defined as 10 and in the bad state , e.g., the black cube in Fig. 3 , is -6. The rest of the states result in the reward -1. The discounting factor is set as γ = 0.9. We consider the grid with 10 rows and 10 columns such that the state space has 100 states. The transition properties of the environment is as follows: the agent will transfer to next state following the chosen action with probability 0.7; the agent will go left of the desired action with probability 0.15 and go right with with probability 0.15. For each experiment, the shaded area represents a standard deviation of the average evaluation over 5 training seeds. Specifically, we consider the following A-C algorithm to solve the Gridworld benchmark task, Critic Update: The Critic updates its value by applying the Bellman evaluation operator (T π ) for m-times (m ≥ 1), i.e., given policy π, at the t-th step A-C update, v(t + 1) = (T π ) m (v(t)). Actor Update: The Actor updates the policy by a greedy step to maximize the learnt v value, i.e., π ′ = arg max π T π (v(t + 1)). Impact of the Warm-Start Policy. We first consider the impact of the Warm-Start policy in the ideal setting, where both the Critic update and Actor update is nearly accurate as in ADP. In this case, we let m be large enough, e.g., m = 1000, in the Critic update Eqn. (31). As observed in Fig. 4 , a 'good' Warm-Start policy can efficiently accelerate the learning process, e.g., it only takes two iterations to convergence with a Warm-Start policy. Meanwhile, in all three cases, the performance gap ∥v(t) -v * ∥ decays over time which reflects our discovery in Theorem 2. Specifically, when the Warm-Start policy is not 'good' enough (or even no Warm-Start), the A-C algorithm can still be able to improve the learning performance overtime (see e.g., the first term on the right side of the upper bound in Theorem 2). Impact of the Approximation Error in the Critic Update. We evaluate the impact of the approximation error in the Critic update on the convergence behavior by two approaches. (1) First, we study the Critic update with finite time Bellman evaluation, e.g., m = 500, 50, 20, 5. As shown in Fig. 5 , the inaccurate Critic update impacts the convergence behavior as expected. The case when m = 5 shows that the finite time Bellman evaluation may contribute to the slower convergence. (2) Next, we consider the general case when there is approximation error in the Critic update. In particular, we add the uniform noise e(t) in the value function with different bias, e.g.,E[e(t)] = 0, 0.5, 1, -1. Meanwhile, we also consider the case when the bias can be either +0.5 or -0.5 in the learning process, e.g., E[e(t)] = 0.5 with probability 0.5 and E[e(t)] = -0.5 with probability 0.5. The resulting convergence behavior is presented in Fig. 6 . Notably, it can be clearly seen that both the positive and negative bias may result in an error floor 'prevent' the algorithm from converging to the optimal (e.g., the last two terms of the lower bound in Theorem ( 1)). The experiment results in Fig. 6 corroborate our theoretical findings in Theorems 1, 2 and 1. Impact of the Approximation Error in the Actor Update. We investigate the learning performance of the A-C algorithm under inaccurate Actor update. In particular, we add the perturbation on the learnt policy in Eqn. (32) as follows, Policy(s) = Policy(s), p, randi ([1, 4] ), 1 -p. where Policy(s) denotes the action should the agent take at the current state s following the learnt policy and randi ([1, 4] ) is a random function to choose the action 1, 2, 3, 4 uniformly. Thus, with probability p, the agent will choose the action follow the current policy while with probability 1 -p, the agent will choose a random action. By setting different p, we show in Fig. 7 that the approximation error in the Actor update may significantly degrade the learning performance. Meanwhile, Fig. 7 also indicates that decreasing bias can be helpful to improve the learning performance (see the red and green lines in Fig. 7 ). This observation also verifies our results in Theorem 1.

L OFF-POLICY A-C ALGORITHEM AS NEWTON'S METHOD WITH PERTURBATION

We note that the actor and critic updates in Eqn. (9) and Eqn. ( 8) are a general template that admits both off-and on-policy method. More specifically, denote the target policy by π tar and the behavior policy by π bhv . When the off-policy menthod is used, then the updates in Eqn. ( 9 This is in contrast to the updates given below when the on-policy method is used: • One major challenge of the off-policy analysis lies in the fact that the behavior policy can be arbitrary Sutton et al. (1999) Sutton & Barto (2018) and hence it is impossible to develop a unifying framework. For example, the behavior policy can be obtained by human demonstration (a similar idea is used in an early version of AlphaGo), deriving from the target policy as in Q-learning/DQN or from a previous behavior policy. Meanwhile, the key drawback of off-policy method is that it does not stably interact with the function approximation and is generally of greater variance and slower convergence rate Sutton & Barto (2018) . In this regard, modern off-policy deep RL requires techniques such as growing batch learning, importance sampling or ensemble method to stabilize the algorithm. Thus, for ease of exposition, we only include the on-policy analysis in our work. ω t+1 ← • Our framework and theoretical results are able to be applied to off-policy setting with the extra assumption on the behavior policy. In particular, we assume the behavior policy is in the neighborhood of the target policy, i.e., in each Actor and Critic update step, ∥E bhv-tar,t ∥ := ∥π tar , t -π bhv,t ∥ ≤ C bt , where C tb ≥ 0 is a constant. In this way, we can write the A-C update in the off-policy setting as a Newton Method with perturbation, i.e., v πtar,t+1 = v πtar,t -(J -1 vπ tar ,t (v πtar,t -T (v πtar,t )) -E t ), where E t is the perturbation which captures the approximation error from Actor update, Critic update and the behavior policy. Explicitly, we have the perturbation with the following form, E t = E v,t + E Ĵ,t (v πt+1 -(r πt+1 + γP πt+1 v πt+1 )) -J -1 vt (E r,t + E bhv-tar,t + γ(E P,t + E bhv-tar,t )v πt ). Thus, the off-policy analysis is similar to the on-policy case but with the 'error' induced by the behavior policy.



We remark that our analysis framework and theoretical results are able to be applied to off-policy setting with the extra assumption on the behavior policy. We include the details in Appendix L.



Figure 1: Illustration of error propagation effect in the A-C method: The approximation errors from Critic update (E c ) and Actor update (E a ) are carried forward and may get amplified due to accumulation. (To distinguish the approximation errors between Critic update and Actor update, we use tilde symbol ( ) above variables, such as policy π and value vector v, to represent the policy and the value vector obtained in the presence of Critic update error. We use hat symbol ( ˆ) above the variables to represent the results with approximation error in Actor update.)

) and Eqn. (8) are given by ω t+1 ← arg min ω E (s,a)∼ρ π bhv Q ω,πtar t+1 (s, a) -ω ⊤ ϕ(s, a) 10 × 10 Gridworld. (b) 15 × 15 Gridworld. (c) 15 × 15 Gridworld.

Figure 3: Gridworld benchmark with different sizes. The colors specify the 'goodness' measure of the state, i.e., the darker color cubes are with lower v(s) value and the agent should avoid those areas. The horizontal lines and vertical lines in each cube point to the direction the agent should take, i.e., policy at every state. Fig. 3(a), Fig. 3(b) and Fig. 3(c) show the learning results after 50 iterations of A-C update.

(a) 10 × 10 Gridworld. (b) 15 × 15 Gridworld. (c) 15 × 15 Gridworld.

Figure 4: The impact of the Warm-Start Policy when no approximation errors in Actor update and Critic update. The convergence behavior given different initial policy, i.e., a random policy (no Warm-Start), a Warm-Start policy obtained by running the A-C algorithm for one iteration and two iterations. The x-axis represents the A-C update step and y-axis is the value of the norm ∥v(t) -v * ∥.

(a) 10 × 10 Gridworld. (b) 15 × 15 Gridworld. (c) 20 × 20 Gridworld.

Figure 5: Learning performance vs. rollout length.

Figure 6: Illustration of the lower bound in Theorem 1.

Figure 7: Convergence behavior vs. Approximation Error in the Actor Update.

arg min ω E (s,a)∼ρ π tar Q ω,πtar t+1 (s, a) -ω ⊤ ϕ(s, a) 2 , π t+1 ← arg max π E (s,a)∼ρ π tar Q ωt+1,πtar,t (s, a) .

Appendix

A EXAMPLES IN SECTION 2.2 In this section, we elaborate further on the illustrative example in Section 2.2. We use the notation defined in Figure 1 . The Lower Bound ofTheorem 2The Upper Bound ofFigure 8 : Illustration of the theoretical analysis.In Section 3. We add the illustration of the theoretical analysis Fig. 8 to make this work more accessible to a broader audience.In Section 4. We add Table 1 to summarize the key observations from our main results (Theorem 1 and Theorem 2) The imperfections of the warmstart can be "washed out" by online learning (Theorem 2, Eqn. ( 30)) (Empirical studies: Bertsekas ( 2022b))The warm-start policy goes hand-byhand with the approximation error to influence the learning performance Table 1 : The learning performance given different warm-start policy and biases setting.In Theorem 1, we point out that the bias terms have direct impact on "whether the warm-start policy is able to facilitate the online learning". For instance, "even when the warm-start policy is nearly-optimal", there is still no guarantee that online fine-tuning can improve the policy much if there exist biases in the approximation errors in online Actor and Critic updates and these biases are not dealt with properly. To clarify further, consider the case when the bias is always positive, i.e., B(t) > 0 for all t ≥ 0, the lower bound is always positive and bounded away from zero. In Theorem 2, we aim to answer the question: "under what conditions online learning can be significantly accelerated by a warm-start policy?". Consider the case when the approximation error is unbiased (such that the A-C can be viewed as the Newton's Method). Clearly, we have ∥Ewhich can decrease quickly as long as the warm-start policy π 0 is not far away from the optimal policy π * and also satisfies Assumption 5. Intuitively, this result shows that "the imperfections of the warm-start policy can be 'washed out' by the (superlinear) Newton step" and corroborates with the observation in the very recent literature (Bertsekas, 2022b.) . We remark that this phenomenon has not been formalized by previous works on the A-C algorithm.

