BACKSTEPPING TEMPORAL DIFFERENCE LEARNING

Abstract

Off-policy learning ability is an important feature of reinforcement learning (RL) for practical applications. However, even one of the most elementary RL algorithms, temporal-difference (TD) learning, is known to suffer form divergence issue when the off-policy scheme is used together with linear function approximation. To overcome the divergent behavior, several off-policy TD-learning algorithms, including gradient-TD learning (GTD), and TD-learning with correction (TDC), have been developed until now. In this work, we provide a unified view of such algorithms from a purely control-theoretic perspective, and propose a new convergent algorithm. Our method relies on the backstepping technique, which is widely used in nonlinear control theory. Finally, convergence of the proposed algorithm is experimentally verified in environments where the standard TD-learning is known to be unstable.

1. INTRODUCTION

Since Mnih et al. (2015) , which has demonstrated that deep reinforcement learning (RL) outperforms human in several video games (Atari 2600 games), significant advances has been made in RL theory and algorithms. For instance, Van Hasselt et al. (2016) ; Lan et al. (2020) ; Chen et al. (2021) proposed some variants of the so-called deep Q-network (Mnih et al., 2015) that achieves higher scores in Atari games than the original deep Q-network. An improved deep RL was developed in Badia et al. (2020) that performs better than average human scores across 57 Atari games. Not only performing well in video games, but Schrittwieser et al. (2020) also have shown that an RL agent can self-learn chess, Go, and Shogi. Furthermore, RL has shown great success in real world applications, e.g., robotics (Kober et al., 2013) , healthcare (Gottesman et al., 2019) , and recommendation systems (Chen et al., 2019) . Despite the practical success of deep RL, there is still a gap between theory and practice. One of the notorious phenomena is the deadly triad (Sutton & Barto, 2018) , the diverging issue of the algorithm when function approximation, off-policy learning, and bootstrapping are used together. One of the most fundamental algorithms, the so-called temporal-difference (TD) learning (Sutton, 1988) , is known to diverge under the deadly triad, and several works have tried to fix this issue for decades. In particular, the seminar works Sutton et al. (2008; 2009) introduced the so-called GTD, gradient-TD2 (GTD2), and TDC, which are off-policy, and have been proved to be convergent with linear function approximation. More recently, Ghiassian et al. (2020) suggested regularized version of TDC called TD learning with regularized correction (TDRC), and showed its favorable features under off-policy settings. Moreover, Lee et al. (2021) developed several variants of GTD based on primal dual formulation. On the other hand, backstepping control (Khalil, 2015) is a popular method in designing stable controllers for nonlinear systems with special structures. The design technique offers a wide range of stable controllers, and is proved to be robust under various settings. It has been used in various fields including quadrotor helicopters (Madani & Benallegue, 2006) , mobile robots (Fierro & Lewis, 1997) , and ship control (Fossen & Strand, 1999) . Using backstepping control technique, in this paper, we develop a new convergent off-policy TD-learning which is a single time-scale algorithm. In particular, the goal of this paper is to introduce a new unifying framework to design off-policy TDlearning algorithms under linear function approximation. The main contributions are summarized as follows: • We propose a systemic way to generate off-policy TD-learning algorithms including GTD2 and TDC from control theoretic perspective. • Using our framework, we derive a new TD-learning algorithm, which we call backstepping TD (BTD). • We experimentally verify its convergence and performance under various settings including where off-policy TD has known to be unstable. In particular, most of the previous works on off-policy TD-learning algorithms (e.g., GTD2 and TDC) are derived based on optimization perspectives starting with an objective function. Then, the convergence is proved by proving stability of the corresponding O.D.E. models. In this paper, we follow reversed steps, and reveal that an off-policy TD-learning algorithm (called backstepping TD) can be derived based on control theoretic motivations. In particular, we develop stable O.D.E. models first using the backstepping technique, and then recover back the corresponding off-policy TD-learning algorithms. The new analysis reveals connections between off-policy TD-learning and notions in control theory, and provides additional insights on off-policy TD-learning with simple concepts in control theory. This sound theoretical foundation established in this paper can potentially motivate further analysis and developments of new algorithms. Finally, we briefly summarize TD learning algorithms that guarantee convergence under linear function approximation. GTD (Sutton et al., 2008), GTD2 and TDC (Sutton et al., 2009) have been developed to approximate gradient on mean squared projected Belllman error. Later, GTD and GTD2 has been discovered to solve minimax optimization problem (Macua et al., 2014; Liu et al., 2020) . Such sadde-point view point of GTD has led to many interesting results including Du et al. ( 2017); Dai et al. (2018) ; Lee et al. (2021) . TDRC (Ghiassian et al., 2020) adds an additional term similar to regularization term to one-side of parameter update, and tries to balance between the performance of TD and stability of TDC. TDC++ (Ghiassian et al., 2020) also adds an additional regularization term on both sides of the parameter update. Even though TDRC shows good performance, it uses additional parameter condition to ensure convergence, whereas TDC++ does not.

2. PRELIMINARIES

2.1 NONLINEAR SYSTEM THEORY Nonlinear system theory will play an important role throughout this paper. Here, we briefly review basics of nonlinear systems. Let us consider the continuous-time nonlinear system ẋt = f (x t , u t ), x 0 2 R n , where x 0 2 R n is the initial state, t 2 R, t 0 is the time, x t 2 R n is the state, u t 2 R n is the control input, and f : R n ⇥ R n ! R n is a nonlinear mapping. An important concept in dealing with nonlinear systems is the equilibrium point. Considering the state-feedback law u t = µ(x t ), the system can be written as ẋt = f (x t , u t ) = f (x t , µ(x t )) =: f (x t ), and a point x = x e in the state-space is said to be an equilibrium point of (1) if it has the property that whenever the state of the system starts at x e , it will remain at x e (Khalil, 2015) . For ẋt = f (x t ), the equilibrium points are the real roots of the equation f (x) = 0. The equilibrium point x e is said to be globally asymptotically stable if for any initial state x 0 2 R n , x t ! x e as t ! 1. An important control design problem is to construct a state-feedback law u t = µ(x t ) such that the origin becomes the globally asymptotically stable equilibrium point of (1). To design a statefeedback law to meet such a goal, control Lyapunov function plays a central role, which is defined in the following definition. Definition 2.1 (Control Lyapunov function (Sontag, 2013) ). A positive definite function V : R n ! R is called a control Lyapunov function (CLF) if for all x 6 = 0, there exists a corresponding control input u 2 R m that satisfies the inequality, r x V (x) > f (x, u) < 0 for all x 6 = 0. Once such a CLF is found, then it guarantees that there exists the control law that stabilizes the system. Moreover, the corresponding state-feedback control law can be extracted from the CLF, e.g., µ(x) = arg min u r x V (x) > f (x, u) provided that the minimum exists and unique. The concept of control Lyapunov function will be used in the derivations of our main results. For the autonomous system, ẋt = f (x t ), and Lypaunov function V : R n ! R, Lie derivative is defined as L f V (x) := r x V (x) > f (x) so that V (x t ) = L f V (x t ) along the solution.

2.2. STOCHASTIC APPROXIMATION AND O.D.E. APPROACH

Including Q-learning (Watkins & Dayan, 1992) and TD-learning (Sutton, 1988) , reinforcement learning algorithms can be considered as stochastic approximation (Robbins & Monro, 1951) described by x k+1 = x k + ↵ k (f (x k ) + ✏ k ), where f : R n ! R n is a nonlinear mapping, and ✏ k is an i.i.d. noise. Borkar and Meyn theorem (Borkar & Meyn, 2000) is a well-known method to bridge the asymptotic convergence of stochastic approximation and the stability of its corresponding O.D.E. model, which can be expressed as ẋt = f (x t ), x 0 2 R n , where x 0 2 R n is initial state, and t 2 R, t 0 is the time. Borkar and Meyn theorem (Borkar & Meyn, 2000) states that under the conditions in Assumption 7.1 in the Appendix, global asymptotic stability of the O.D.E. (3) leads to asymptotic convergence of the stochastic approximation update (2), which is formally stated in the following lemma. Lemma 2.1 (Borkar and Meyn theorem (Borkar & Meyn, 2000) ). Suppose that Assumption 7.1 in the Appendix holds, and consider the stochastic approximation in (2). Then, for any initial x 0 2 R n , sup k 0 ||x k || < 1 with probability one. In addition , x k ! x e as k ! 1 with probability one, where x e is the unique equilibrium point of the O.D.E. in (3). The main idea of Borkar and Meyn theorem is as follows: iterations of a stochastic recursive algorithm follow the solution of its corresponding O.D.E. in the limit when the step-size satisfies the so-called Robbins-Monro condition (Robbins & Monro, 1951) in ( 33) in the Appendix. Therefore, by proving asymptotic stability of the O.D.E., we can induce convergence of the original algorithm. In this paper, we will use an O.D.E. model of TD-learning, which is expressed as a linear timeinvariant system.

2.3. BACKSTEPPING CONTROL

This section provides the concept of the backstepping control (Kokotovic, 1992; Khalil, 2015) , which will be the main tool in this paper to derive TD-learning algorithms. The backstepping technique is a popular tool for generating a CLF (control Lyapunov function) for nonlinear systems with specific structures. In particular, let us start with the following general nonlinear system: ẏt = f (y t ) + g(y t )x t (4) ẋt = u t , where y t 2 R m , x t 2 R m are the states, u t 2 R m is the input, and f : R m ! R m and g : R m ! R are continuous functions. The first system is a nonlinear system with a particular affine structure, and the second system is simply an integrator. It can be seen as a cascade interconnection of two systems, where the second system's state is injected to the input of the first system. The backstepping control technique gives us a systematic way to generate a CLF for such particular nonlinear systems provided that the first system admits a CLF independently. To this end, we suppose that the first system admits a CLF. Through the backstepping approach, designing a stable control law for the above system can be summarized in the following steps: Step 1. Consider x t in (4) as virtual input x(y t ) (state-feedback controller), and consider the following system: ˙ t = f (y t ) + g(y t )x(y t ). Design x(y t ) such that the above system admits a CLF V , i.e., it admits a positive definite and radially unbounded function V such that its time derivative is negative definite, i.e., V (y t ) < 0, 8y t 6 = 0. Step 2. Denote the error between the virtual state-feedback controller x(y t ) and state variable x t as z t := x t x(y t ). Now, rewrite the original O.D.E. in (4) with the new variable (y t , z t ): d dt  y t z t =  f (y t ) + g(y t )x(y t ) + g(y t )z t u t ẋ(y t ) Step 3. Design the control input u t such that the above system is stable. One popular choice is to consider the CLF V c (y t , z t ) := V (y t ) + ||z t || 2 /2, where V (y t ) is defined in Step 1. Then choose u t such that the time derivative of V c (y t , z t ) to be negative definite. A simple example of designing stabilizing control law by backstepping technique is given in Appendix Section 7.3.

2.4. MARKOV DECISION PROCESS

In this paper, we consider a Markov decision process (MDP) characterized by the tuple (S, A, P, , r), where S := {1, 2, . . . , |S|} stands for the set of finite state space, |S| denotes the size of S, A := {1, 2, . . . , |A|} denotes the set of finite action space, |A| is the size of A, 2 (0, 1) is the discount factor, P : S ⇥ A ⇥ S ! [0, 1] denotes the Markov transition kernel, and r : S ⇥ A ⇥ S ! R means the reward function. In particular, if an agent at state s 2 S, takes action a 2 A, then the current state transits to the next state s 0 2 S with probability P(s, a, s 0 ), and the agent receives reward r(s, a, s 0 ). Each element of the state to state transition matrix under policy ⇡, denoted by P ⇡ 2 R |S|⇥|S| is [P ⇡ ] ij := P a2A ⇡(a|i)P(i, a, j), 1  i, j  |S|, where [P ⇡ ] ij corresponds to i-th row and j-th column element of matrix P ⇡ . Moreover, the stationary state distribution induced by policy µ, is denoted as d µ : S ! [0, 1], i.e., d µ> P µ = d µ> . With the above setup, we define the following matrix notations: D µ := 2 6 4 d µ (1) . . . d µ (|S|) 3 7 5 2 R |S|⇥|S| , R ⇡ = 2 6 6 4 E a⇠⇡ [r(s, a, s 0 )|s = 1] E a⇠⇡ [r(s, a, s 0 )|s = 2] . . . E a⇠⇡ [r(s, a, s 0 )|s = |S|] 3 7 7 5 2 R |S| , where D µ is a diagonal matrix of the state distribution induced by behavior policy µ, each element of R ⇡ is the expected reward under policy ⇡ at the corresponding state. The policy evaluation problem aims to approximate the value function at state s 2 S, v ⇡ (s) := E ⇥ P 1 k=0 k r(S k , A k , S k+1 ) S 0 = s, ⇡ ⇤ , where the trajectory is generated under policy ⇡ : S ⇥ A ! [0, 1]. In this paper, we consider the linear function approximation to approximate the value function v ⇡ (s). In particular, we parameterize the value function v ⇡ (s) with > (s)⇠, where : S ! R n is a pre-selected feature vector with (s ) := [ 1 (s) • • • n (s) ], 1 , . . . , n : S ! R are feature functions, and ⇠ 2 R n is the learning parameter. The goal of the policy evaluation problem is then to approximate the value function v ⇡ (s) using this linear parameterization, i.e., > (s)⇠ ⇡ v ⇡ (s). Moreover, using the matrix notation : = [ (1), (2), • • • , (|S|)] > 2 R |S|⇥n , called the feature matrix, the linear parameterization can be written in the vector form ⇠. We also assume that is full column rank matrix throughout the paper, which is a standard assumption (Sutton et al., 2008; 2009; Ghiassian et al., 2020; Lee et al., 2021) .

2.5. TEMPORAL DIFFERENCE LEARNING

This section provides a brief background on TD-learning (Sutton, 1988) . Suppose that we have access to stochastic samples of state s k from the state stationary distribution induced by the behavior policy µ, i.e., s k ⇠ d µ (•), and action is chosen under behavior policy µ, i.e., a k ⇠ µ(•|s k ). Then, we observe the next state s 0 k following s 0 k ⇠ P(•, a k , s k ), and receive the reward r k := r(s k , a k , s 0 k ). Using the simplified notations for the feature vectors k := (s k ), 0 k = (s 0 k ). the TD-learning update at time step k with linear function approximation can be expressed as ⇠ k+1 = ⇠ k + ↵ k ⇢ k k (⇠ k ) k , where ↵ k > 0 is the step-size, k (⇠ k ) := r k + 0> k ⇠ k > k ⇠ k is called the temporal difference or temporal difference error (TD-error), and ⇢ k := ⇢(s k , a k ) = ⇡(a k |s k ) µ(a k |s k ) is called the importance sampling ratio (Precup et al., 2001) . The importance sampling ratio reweights the TD-error to handle the mismatch between the behavior policy µ and target policy ⇡. It is known that TD-learning with linear function approximation and off-policy learning scheme does not guarantee convergence in general. The above stochastic approximation aims to find fixed point of the following projected Bellman equation, which is, after some manipulations, expressed as: > D µ ⇠ ⇤ > D µ P ⇡ ⇠ ⇤ = > D µ R ⇡ . To simplify the expressions, let use introduce one more piece of notations: A := E s⇠d µ (s),s 0 ⇠P ⇡ (s 0 |s) [ (s)( (s) (s 0 )) > ] = > D µ D µ P ⇡ 2 R n⇥n , b := E s⇠d µ (s),a⇠⇡(a|s),s 0 ⇠P (s 0 |s,a) [r(s, a, s 0 ) (s)] = > D µ R ⇡ 2 R n⇥1 . Even though we can use arbitrary distribution, for simplicity we assume stationary distribution of µ. Now, we can rewrite (5) compactly as A⇠ ⇤ = b. The corresponding O.D.E. for TD-learning can be written as ⇠t = A⇠ t b, ⇠ 0 2 R n . Using the coordinate transform x k := ⇠ k ⇠ ⇤ , we get the O.D.E. ẋt = Ax t , x 0 2 R n , whose origin is globally asymptotically stable equilibrium point if ⇢(s, a) = ⇡(a|s) µ(a|s) = 1 for all (s, a) 2 S ⇥ A. Throughout the paper we will use the vector x k := ⇠ k ⇠ ⇤ to represent the coordinate transform of ⇠ k to the origin, and will use ⇠ t and x t to denote the corresponding continuous-time counterparts of ⇠ k and x k , respectively.

2.6. GRADIENT TEMPORAL DIFFERENCE LEARNING

To fix the instability issue of off-policy TD-learning under linear function approximation, Sutton et al. ( 2008) and Sutton et al. ( 2009) introduced various stable off-policy TD-learning algorithms, called GTD (gradient TD-learning), GTD2, and TDC (temporal difference correction). The idea behind these algorithms is to minimize the mean-square error of projected Bellman equation (MSPBE) min ⇠2R n 1 2 || > D µ (R ⇡ + P ⇡ ⇠ ⇠)|| 2 ( > D µ ) 1 , where ||x|| D := p x > Dx, and the global minimizer of MSPBE corresponds to the solution of ( 6). The core idea of the algorithms is to introduce an additional variable k 2 R n to approximate the stochastic gradient descent method for MSPBE as an objective function. In particular, GTD2 update can be written as k+1 = k + ↵ k ( > k k + ⇢ k k (⇠ k )) k , ⇠ k+1 = ⇠ k + ↵ k ( > k k k ⇢ k > k k 0 k ). We denote t to denote continuous time part of k . Since the fixed point for k is zero, it doesn't require coordinate transformation. It is a single time-scale algorithm because it uses a single stepsize ↵ k . The corresponding O.D.E. is expressed as ˙ t = C t Ax t , ẋt = A > t , where C := E s⇠d µ (s) [ (s) > (s)] = > D µ 2 R n⇥n . Similarly, TDC update can be written as k+1 = k + ↵ k ( > k k + ⇢ k k (⇠ k )) k (7) ⇠ k+1 = ⇠ k + k ( ⇢ k > k k 0 k + ⇢ k k (⇠ k ) k ), where the step-sizes, ↵ k and k , satisfy ↵ k / k ! 0 as k ! 1 and the Robbins and Monro step-size condition (Robbins & Monro, 1951) 2021) . Most of the previous works on off-policy TD-learning algorithms (e.g., GTD2 and TDC) first start with an objective function, and then derive GTD algorithms based on optimization perspectives. Then, the convergence is proved using the corresponding O.D.E. models and stability theory of linear time-invariant systems. A natural question arises is, can we derive off-policy TDlearning algorithms following a reversed step? In other words, can we develop a stable O.D.E. model first using tools in control theory, and then recover back the corresponding off-policy TD-learning algorithms? In this paper, we reveal that a class of off-policy TD-learning algorithms can be derived based on purely control theoretic motivations following such a reversed process. By doing so, this work provides additional insights on off-policy TD-learning algorithms and gives a sound theoretical foundation on off-policy TD-learning algorithms for further developments of new algorithms. Designing stabilizing control laws for continuous-time nonlinear system has been successful over the past decades (Khalil, 2015) . One such technique, so called backstepping, is a popular controller design method in non-linear control literature (Khalil, 2015) . With the help of the backstepping method (Khalil, 2015), we design stabilizing control laws for continuous-time systems, and then the corresponding off-policy TD-learning algorithms are derived, and are shown to be convergent via Borkar and Meyn theorem (Borkar & Meyn, 2000) in Lemma 2.1. The brief procedure is explained in the following steps: Step 1) Choose an appropriate continuous-time dynamic model such that (a) we can recover the TD-fixed point ⇠ ⇤ in (6) via its equilibrium point; (b) the corresponding stochastic approximation algorithm can be implementable only through transitions of MDP and accessible data.; Step 2) Using the backstepping method, design a control input to stabilize the dynamic model chosen in Step 1).

3.1. BACKSTEPPING TD

Now, we introduce a new off-policy TD-learning algorithm, which we call Backstepping TD (BTD). Firstly, we will develop a stabilizing control law for the following the continuous-time system: ˙ t = ( C + ⌘A) t Ax t (9) ẋt = u t ( ) The idea stems from finding a control system for which we can easily apply the backstepping techinque. In details, the backstepping techinqiue can be applied to the two interconnected systems where one subsystem, namely (4), can be stabilized with x t in (4) as a control input. Therefore, our first aim is to find such a system. To this end, we can try a natural choice of O.D.E. to solve the TD problem, i.e., ˙ t = A t , which is however unstable in the off-policy case. Therefore, we can develop a modified O.D.E. ˙ t = ( C + ⌘A) t Ax t , where x t is the control input, the negative definite matrix C is introduced to stabilize the system, and ⌘ > 0 is introduced to provide additional degrees of freedom in design. Now, the constructed system can be stabilized through the state-feedback controller x t = ⌘ t and admits the simple control Lypaunov function V ( ) = || || 2 . Moreover, A should be included in the right-hand side in order to implement the corresponding algorithm without knowing the solution because x k = ⇠ k ⇠ ⇤ and ⇠ ⇤ should be removed using A⇠ ⇤ = b in the final step. Simply setting x t = ⌘ t may cancel out A in the right-hand side, the O.D.E. becomes ˙ t = C t , Therefore, as mentioned before, we can apply the backstepping technique by adding an additional dynamic controller. As the next step, the backstepping technique is applied, and one needs to observe what would be the final form of the control system. In summary, if we consist f ( t ) with the combination of A and C (not necessarily C, it may be I) , it can be a reasonable candidate to apply the backstepping technique. Cancelling A with virtual input only leaves C, which guarantees stability from its negative definiteness. Therefore, ( 9) and ( 10) is a reasonable candidate for the dynamics where we can apply the backstepping technique. In particular, our aim is to design an appropriate control input u t for the above system such that the origin is the unique asymptotically stable equilibrium point, i.e., ( t , x t ) ! 0 as t ! 1 for any ( 0 , x 0 ) 2 R n ⇥ R n . The overall procedure is depicted in Figure 1 in the Appendix, and we show how to choose the control input u t in the following lemma. Lemma 3.1. Consider the O.D.E. in ( 9) and ( 10). If we choose the control input u t := (A > +⌘ 2 A ⌘C) t ⌘Ax t , then the above O.D.E. has globally asymptotically stable origin, i.e., ( t , x t ) ! (0, 0) as t ! 1 for any ( 0 , x 0 ) 2 R n ⇥ R n . Proof sketch. The proof follows the steps given in the backstepping scheme in Section 3. First, substituting x t in (9) with a virtual controller x( t ), we will design a control law x( t ) that stabilizes the following new virtual system: ˙ t = ( C + ⌘A) t Ax( t ). One natural choice of the virtual controller is x( t ) = ⌘ t . Plugging it into (11) leads to ˙ t = C t , and we can verify the global asymptotic stability of the above system with the following Lyapunov function: V ( t ) := || t || 2 2 2 . ( ) We now consider the original O.D.E. in ( 9) and ( 10). Applying simple algebraic manipulations yield ˙ t = C t A(x t ⌘ t ), ẋt = u t . The error between x t and the virtual controller x( t ) can be expressed as new variable z t , which is z t := x t x( t ) = x t ⌘ t . Rewriting the O.D.E. in ( 9) and ( 10) with ( t , z t ) coordinates, we have ˙ t = C t Az t (13) żt = u t + ⌘C t + ⌘Az t . To prove the global asymptotic stability of the above system, consider the function V c ( t , z t ) := V ( t )+ 1 2 ||z t || 2 2 where V ( t ) is defined in (12) . By taking u t as u t = A > t ⌘C t ⌘Az t , we can apply LaSall'es invariance principle in Lemma 7.1. The full proof is in Appendix Section 7.4.1. Using the relation z t := x t ⌘ t , the control input in the original coordinate ( t , x t ) can be written as u t := A > t ⌘C t ⌘Az t = (A > + ⌘ 2 A ⌘C) t ⌘Ax t . Plugging this input into the original open-loop system in ( 9) and ( 10), the closed-loop system in the original coordinate ( t , x t ) can written as ˙ t = ( C + ⌘A) t Ax t (14) ẋt = (A > + ⌘ 2 A ⌘C) t ⌘Ax t , whose origin is also globally asymptotically stable according to Lemma 3.1. Recovering back from x t to ⇠ t , we have d dt  t ⇠ t =  C + ⌘A A A > + ⌘ 2 A ⌘C ⌘A  t ⇠ t +  b ⌘b . The corresponding stochastic approximation of the O.D.E. in Theorem 3.1 becomes k+1 = k + ↵ k ((( 1 + ⌘) > k ⌘⇢ k 0> k ) k + ⇢ k k (⇠ k )) k ⇠ k+1 = ⇠ k + ↵ k ((( ⌘ + ⌘ 2 ) > k ⌘ 2 ⇢ k 0> k ) k k + ⌘⇢ k k (⇠ k ) k + ( > k k k ⇢ k > k k 0 k )). The equilibrium point of the above O.D.E. is (0, ⇠ ⇤ ). Hence, we only need to transform the coordinate of ⇠ t to x t = ⇠ t ⇠ ⇤ , which results to the O.D.E. in ( 14) and ( 15). With the above result, we are now ready to prove convergence of Algorithm 1. The proof simply follows from Borkar and Meyn theorem in Lemma 2.1, of which the details can be found in Sutton et al. (2009) . Theorem 3.1. Under the step size condition (33) , with Algorithm 1 in Appendix, ⇠ k ! ⇠ ⇤ as k ! 1 with probability one, where ⇠ ⇤ is the fixed point of (6). Proof. The proof is done by checking Assumption 7.1 in Appendix. Remark 3.1. Theorem 3.1 doesn't require any condition on ⌘. Therefore, we can set ⌘ = 0, which results to GTD2 developed in Sutton et al. (2009) .

3.2. RECOVERING SINGLE TIME-SCALE TDC

In this section, we derive a single-time scale version of TDC (Sutton et al., 2009) through the backstepping design in the previous section. TDC (Sutton et al., 2009) was originally developed as a two-time scale algorithm in Sutton et al. (2009) . Even though the two time-scale method provides theoretical guarantee for a larger class of algorithms, the single time-scale scheme provides more simplicity in practice, and shows faster convergence empirically. Subsequently, Maei (2011) provided a single-time scale version of TDC by multiplying a large enough constant ⌘ > 0 to the faster time scale part (7), which leads to k+1 = k + k ⌘( > k k + ⇢ k k (⇠ k )) k (18) ⇠ k+1 = ⇠ k + k ( ⇢ k > k k 0 k + ⇢ k k (⇠ k ) k ), where ⌘ > max 0, min C 1 (A + A > )/2 . ( ) Here, we derive another version of single-time TDC by multiplying a constant to the slower timescale part in (8), which results in k+1 = k + ↵ k ( > k k + ⇢ k k (⇠ k )) k ⇠ k+1 = ⇠ k + ↵ k ( > k k k ⇢ k > k k 0 k + ⇢ k k (⇠ k ) k ), where satisfies 0 < < min (C) min (A) if min (A) < 0, else > 0. ( ) We can derive the above algorithm following similar steps as in Section 3.1. Let us first consider the following dynamic model: ˙ t = C t Ax t (24) ẋt = u t (25) Using the backstepping technique, we can prove that the above system admits the origin as a global asymptotically stable equilibrium point with the control input u t := (A > C) t A⇠ t , which is shown in the following lemma: Lemma 3.2. Consider the O.D.E. in ( 24) and ( 25). Suppose that we choose the control input is the fixed point of ( 6). u t := (A > C) t A⇠ t ), We will call the Algorithm 4 as TDC-slow, and single-time version of TDC suggested by Maei (2011) as TDC-fast. Other than the multiplication of a constant reflecting two-time scale property, we can make TDC into a single-time algorithm, which we call a single time-scale TDC2, while the original version in Maei (2011) will be called the single time-scale TDC. The derivation is given in Appendix Section 7.5. The performance of such versions of TDC are evaluated in Appendix Section 7.9.1. Even though not one of the algorithms outperforms each other, TDC-slow and TDC2 shows better performance in general.

3.3. GENERALIZING TDC++

This section provides versions of TDC++ (Ghiassian et al., 2020) , which is variant of TDC. With an additional regularization term ⇠ k on both updates of TDC in ( 7) and ( 8), the update is written as follows: k+1 = k + ↵ k ⌘( > k k + ⇢ k k (⇠ k )) k k ) ⇠ k+1 = ⇠ k + ↵ k ( ⇢ k > k k 0 k k + ⇢ k k (⇠ k ) k ), where ⌘ > 0 satisfies (20) and > 0 is a new parameter. Note that TDC++ can be simply viewed as variant of TDC by adding the term k in the update, which can be seen as a regularization term. Therefore, letting = 0 yields the original TDC. In this paper, we prove that our controller design leads to the following update: k+1 = k + ↵ k ⌘( > k k + ⇢ k k (⇠ k )) k k ) ⇠ k+1 = ⇠ k + ↵ k ( ⇢ k > k k 0 k + (1 ⌘) > k k k  ⌘ k + ⇢ k ⌘ k (⇠ k ) k ), where  and are new parameters and when  = 1/⌘ it becomes TDC++. The difference with the original TDC++ can be seen in their corresponding O.D.E. forms. The corresponding O.D.E. for ( 26) and ( 27) (original TDC++) can be expressed as: d dt  t x t =  ⌘(C + I) ⌘A A > C I A  t x t . Meanwhile, the O.D.E. corresponding to ( 28) and ( 29) (new TDC++) becomes d dt  t x t =  ⌘(C + I) ⌘A A > ⌘(C + I) ⌘A  t x t . We experiment under different of  and ⌘ to examine the behavior of new TDC++. The result shows that in general, smaller  leads to better performance. The results are given in Appendix Section 7.9. Lemma 3.3. Consider the following O.D.E.: ˙ t = ⌘(C + I) t ⌘Ax t (30) ẋt = u t . Suppose that we choose the control input u t := (A > ⌘(C + I)) t ⌘Ax t . Assume ⌘ > 0 and and  satisfies the following condition: +  min (A) > min (C). Then, the above O.D.E. has globally asymptotically stable origin, i.e., ( t , x t ) ! (0, 0) as t ! 1. The proof is given in Appendix Section 7.4.3. With Lemma 2.1, we can prove the convergence of stochastic update with ( 28) and ( 29) whose pseudo code is given in Algorithm 5 in Appendix. Theorem 3.3. Consider Algorithm 5 in Appendix. Under the step-size condition (33) and if ⌘ satisfies (20), then ⇠ k ! ⇠ ⇤ as k ! 1 with probability one, where ⇠ ⇤ is the TD fixed point in (6). Remark 3.2. We can replace the regularization term with nonlinear terms satisfying certain conditions. The details are given in Appendix Section 7.6.

4. EXPERIMENTS

We verify the performance and convergence of the proposed BTD under standard benchmarks to evaluate off-policy TD-learning algorithms, including Baird environment (Baird, 1995) , Ran-domWalk (Sutton et al., 2009) with different features, and Boyan chain (Boyan, 2002) . The details about the environments are given in Appendix Section 7.7. From the experiments, we see how BTD behaves under different coefficients ⌘ 2 { 0.5, 0.25, 0, 0.25, 0.5}. We measure the Root Mean-Squared Projected Bellman Error (RMSPBE) as the performance metric, and every results are averaged over 100 runs. From Table 1 , the result with ⌘ = 0.5 shows the best performance except at Baird, where ⌘ = 0, corresponding to GTD2 performs best. There exist two aspects on the role of ⌘. First of all, it can be thought of as a parameter that can mitigate the effect of instability coming from matrix A in (9). For example, a smaller ⌘ can stabilize the system. However, as a trade off, if ⌘ is too small, then the update rate might be too small as well. As a result, the overall convergence can be slower. Furthermore, ⌘ also controls the effect of C in (13) in the BTD update rules, where C corresponds to ( ⌘ + ⌘ 2 ) > k k k in (17). Note that the role of ⌘ in the final BTD update rule in (17) shows different perspectives compared to that in (9). In particular, ⌘ = 1/2 maximizes the effect of C in (17). From Table 1 , it leads to reasonably good performances in most domains. Another natural choice is to multiply ⌘ to C instead of A. However, in such cases, we need to introduce another constrain ⌘ > 0, whereas in the current BTD, convergence is guaranteed for all ⌘ 2 R. Finally, we note that simply multiplying C by a large positive constant does not lead to good results in general. This is because in this case, it may increase variance, and destabilize the algorithm. Overall results are given in Appendix Section 7.8. 

5. CONCLUSION

In this work, we have proposed a new framework to design off-policy TD-learning algorithms from control-theoretic view. Future research directions would be extending the framework to non-linear function approximation setting.



and satisfies condition(23). Then, the above O.D.E. has globally asymptotically stable origin, i.e., ( t , x t ) ! (0, 0) as t ! 1.The proof of Lemma 3.2 is given in Appendix Section 7.4.2. By Borkar and Meyn theorem in Lemma 2.1, we can readily prove the convergence of Algorithm 2 in Appendix, which uses stochastic recursive update (21) and (22). Consider Algorithm 2 in Appendix. Under the step size condition (33), and if satisfies (23), ⇠ k ! ⇠ ⇤ as k ! 1 with probability one, where ⇠ ⇤

Backstepping TD, step-size = 0.01 Boyan 1.51 ± 0.66 1.481 ± 0.656 1.452 ± 0.647 1.428 ± 0.64 1.408 ± 0.635 Dependent 0.11 ± 0.19 0.097 ± 0.163 0.086 ± 0.142 0.079 ± 0.128 0.076 ± 0.122 Inverted 0.21 ± 0.25 0.173 ± 0.218 0.151 ± 0.193 0.139 ± 0.177 0.136 ± 0.172 Tabular 0.17 ± 0.28 0.147 ± 0.238 0.133 ± 0.208 0.124 ± 0.191 0.122 ± 0.188 Baird 0.1 ± 0.64 0.09 ± 0.629 0.085 ± 0.625 0.087 ± 0.628 0.092 ± 0.637

6. ACKNOWLEDGEMENTS

This work was supported by the National Research Foundation under Grant NRF-2021R1F1A1061613, Institute of Information communications Technology Planning Evaluation (IITP) grant funded by the Korea government (MSIT)(No.2022-0-00469), and the BK21 FOUR from the Ministry of Education (Republic of Korea). (Corresponding author: Donghwan Lee.)

