BACKSTEPPING TEMPORAL DIFFERENCE LEARNING

Abstract

Off-policy learning ability is an important feature of reinforcement learning (RL) for practical applications. However, even one of the most elementary RL algorithms, temporal-difference (TD) learning, is known to suffer form divergence issue when the off-policy scheme is used together with linear function approximation. To overcome the divergent behavior, several off-policy TD-learning algorithms, including gradient-TD learning (GTD), and TD-learning with correction (TDC), have been developed until now. In this work, we provide a unified view of such algorithms from a purely control-theoretic perspective, and propose a new convergent algorithm. Our method relies on the backstepping technique, which is widely used in nonlinear control theory. Finally, convergence of the proposed algorithm is experimentally verified in environments where the standard TD-learning is known to be unstable.

1. INTRODUCTION

Since Mnih et al. (2015) , which has demonstrated that deep reinforcement learning (RL) outperforms human in several video games (Atari 2600 games), significant advances has been made in RL theory and algorithms. For instance, Van Hasselt et al. (2016) ; Lan et al. (2020) ; Chen et al. (2021) proposed some variants of the so-called deep Q-network (Mnih et al., 2015) that achieves higher scores in Atari games than the original deep Q-network. An improved deep RL was developed in Badia et al. (2020) that performs better than average human scores across 57 Atari games. Not only performing well in video games, but Schrittwieser et al. (2020) also have shown that an RL agent can self-learn chess, Go, and Shogi. Furthermore, RL has shown great success in real world applications, e.g., robotics (Kober et al., 2013 ), healthcare (Gottesman et al., 2019) , and recommendation systems (Chen et al., 2019) . Despite the practical success of deep RL, there is still a gap between theory and practice. One of the notorious phenomena is the deadly triad (Sutton & Barto, 2018) , the diverging issue of the algorithm when function approximation, off-policy learning, and bootstrapping are used together. One of the most fundamental algorithms, the so-called temporal-difference (TD) learning (Sutton, 1988) , is known to diverge under the deadly triad, and several works have tried to fix this issue for decades. In particular, the seminar works Sutton et al. (2008; 2009) introduced the so-called GTD, gradient-TD2 (GTD2), and TDC, which are off-policy, and have been proved to be convergent with linear function approximation. More recently, Ghiassian et al. (2020) suggested regularized version of TDC called TD learning with regularized correction (TDRC), and showed its favorable features under off-policy settings. Moreover, Lee et al. ( 2021) developed several variants of GTD based on primal dual formulation. On the other hand, backstepping control (Khalil, 2015) is a popular method in designing stable controllers for nonlinear systems with special structures. The design technique offers a wide range of stable controllers, and is proved to be robust under various settings. It has been used in various fields including quadrotor helicopters (Madani & Benallegue, 2006) , mobile robots (Fierro & Lewis, 1997) , and ship control (Fossen & Strand, 1999) . Using backstepping control technique, in this paper, we develop a new convergent off-policy TD-learning which is a single time-scale algorithm. In particular, the goal of this paper is to introduce a new unifying framework to design off-policy TDlearning algorithms under linear function approximation. The main contributions are summarized as follows: • We propose a systemic way to generate off-policy TD-learning algorithms including GTD2 and TDC from control theoretic perspective. • Using our framework, we derive a new TD-learning algorithm, which we call backstepping TD (BTD). • We experimentally verify its convergence and performance under various settings including where off-policy TD has known to be unstable. In particular, most of the previous works on off-policy TD-learning algorithms (e.g., GTD2 and TDC) are derived based on optimization perspectives starting with an objective function. Then, the convergence is proved by proving stability of the corresponding O.D.E. models. In this paper, we follow reversed steps, and reveal that an off-policy TD-learning algorithm (called backstepping TD) can be derived based on control theoretic motivations. In particular, we develop stable O.D.E. models first using the backstepping technique, and then recover back the corresponding off-policy TD-learning algorithms. The new analysis reveals connections between off-policy TD-learning and notions in control theory, and provides additional insights on off-policy TD-learning with simple concepts in control theory. This sound theoretical foundation established in this paper can potentially motivate further analysis and developments of new algorithms. Finally, we briefly summarize TD learning algorithms that guarantee convergence under linear function approximation. GTD (Sutton et al., 2008), GTD2 and TDC (Sutton et al., 2009) have been developed to approximate gradient on mean squared projected Belllman error. Later, GTD and GTD2 has been discovered to solve minimax optimization problem (Macua et al., 2014; Liu et al., 2020) . 

2. PRELIMINARIES

2.1 NONLINEAR SYSTEM THEORY Nonlinear system theory will play an important role throughout this paper. Here, we briefly review basics of nonlinear systems. Let us consider the continuous-time nonlinear system ẋt = f (x t , u t ), x 0 2 R n , where x 0 2 R n is the initial state, t 2 R, t 0 is the time, x t 2 R n is the state, u t 2 R n is the control input, and f : R n ⇥ R n ! R n is a nonlinear mapping. An important concept in dealing with nonlinear systems is the equilibrium point. Considering the state-feedback law u t = µ(x t ), the system can be written as ẋt = f (x t , u t ) = f (x t , µ(x t )) =: f (x t ), and a point x = x e in the state-space is said to be an equilibrium point of (1) if it has the property that whenever the state of the system starts at x e , it will remain at x e (Khalil, 2015) . For ẋt = f (x t ), the equilibrium points are the real roots of the equation f (x) = 0. The equilibrium point x e is said to be globally asymptotically stable if for any initial state x 0 2 R n , x t ! x e as t ! 1. An important control design problem is to construct a state-feedback law u t = µ(x t ) such that the origin becomes the globally asymptotically stable equilibrium point of (1). To design a statefeedback law to meet such a goal, control Lyapunov function plays a central role, which is defined in the following definition. that satisfies the inequality, r x V (x) > f (x, u) < 0 for all x 6 = 0. Once such a CLF is found, then it guarantees that there exists the control law that stabilizes the system. Moreover, the corresponding state-feedback control law can be extracted from the CLF, e.g., µ(x) = arg min u r x V (x) > f (x, u) provided that the minimum exists and unique. The concept of control Lyapunov function will be used in the derivations of our main results. For the autonomous



Such sadde-point view point of GTD has led to many interesting results including Du et al. (2017); Dai et al. (2018); Lee et al.(2021). TDRC(Ghiassian et al., 2020)  adds an additional term similar to regularization term to one-side of parameter update, and tries to balance between the performance of TD and stability of TDC. TDC++(Ghiassian et al., 2020)  also adds an additional regularization term on both sides of the parameter update. Even though TDRC shows good performance, it uses additional parameter condition to ensure convergence, whereas TDC++ does not.

Definition 2.1 (Control Lyapunov function (Sontag, 2013)). A positive definite function V : R n ! R is called a control Lyapunov function (CLF) if for all x 6 = 0, there exists a corresponding control input u 2 R m

