ON THE FAST CONVERGENCE OF UNSTABLE REIN-FORCEMENT LEARNING PROBLEMS

Abstract

For many of the reinforcement learning applications, the system is assumed to be inherently stable and with bounded reward, state and action space. These are key requirements for the optimization convergence of classical reinforcement learning reward function with discount factors. Unfortunately, these assumptions do not hold true for many real world problems such as an unstable linear-quadratic regulator (LQR) 1 . In this work, we propose new methods to stabilize and speed up the convergence of unstable reinforcement learning problems with the policy gradient methods. We provide theoretical insights on the efficiency of our methods. In practice, our method achieve good experimental results over multiple examples where the vanilla methods mostly fail to converge due to system instability.

1. INTRODUCTION

Reinforcement learning (RL), powered by the generalization ability of machine learning structures, has been fairly successful in classical control tasks (Hafner & Riedmiller, 2011; Lillicrap et al., 2016) and problems like Atari (Mnih et al., 2013) and Go (Silver et al., 2016) . RL aims to train a policy to achieve maximum reward or minimize the costfoot_1 , which is similar to control theoretic approaches on designing a controller. However, different from classical optimal control which requires full knowledge of transition dynamics, RL can learn the optimal policy from the past data directly by solving an optimization problem without the knowledge of the underlying dynamics. One of the mainstream methods to solve an RL problem is policy optimization via gradient descent. However, the convergence of the policy optimization algorithm heavily relies on an unapparent yet critical assumption of the system dynamics itself: stabilityfoot_2 . In addition, to ensure the convergence of policy optimization, it requires the Lipschitz property of the cost function and its gradient. In fact, in many of the existing RL benchmark examples such as OpenAI's classical control environments, the state space/actions/costs are clipped to ensure that the policy would never move to extreme conditions and costs/states are bounded, in order to reduce the error derivatives (Mnih et al., 2015) . Unfortunately, similar formulations are not directly applicable to unstable systems, such as a LQR with an unstable state matrix (i.e. the spectral radius of the state matrix is outside the unit circle), as the standard policy gradient based methods are likely to fail. Motivated by the above issue, in this paper we aim to enable and speed up the convergence of policy gradient methods for unstable RL problems, we propose a logarithmic mapping method on loss functions supported by rigorous theoretical proofs and experimental results. The key contributions are summarized as follows. • We formally define the unstable RL problem in the scope of "input-to-output" stability, with the input actions leading to a temporal growing effect against cost output. This is the first time the convergence issue of unstable RL problems is studied, we demonstrate that a major issue for policy gradient methods on unstable RL problems is the slow convergence rate, which is the due to large spectral radius of the Hessian matrix. • We propose a simple yet effective logarithmic mapping to alleviate this issue and speed up the convergence. We show both theoretical advantage and experimental results to support the contribution of faster convergence rate. Notably, our finite horizon problem setup does not require the bounded assumption of cost function. The experiments cover LQR examples to customized nonlinear cases with neural network based policy. • We provide an efficient method to find a better initialization of control policy by optimizing it over the spectral norm of controlled system. We use it as a fast pre-processing step to effectively save the computation cost and allow larger learning rate for fast convergence.

2. BACKGROUND AND RELATED WORK 2.1 UNSTABLE DYNAMICAL SYSTEM

For many real-world unstable dynamical systems with limited explicit knowledge and practical aspects, system identification (SYSID) (Mettler et al., 1999; Ananth & Chidambaram, 1999; Bond & Daniel, 2008) with control (Jordan & Jacobs, 1990; Arora et al., 2011) is a 2-step approach to maneuver such a system, where the SYSID step targets on learning the system and the "control" step designs a controller to stabilize the system and minimize the cost. Despite most of the existing works focus on the system instability itself, not many works realize the rise of optimization issues due to the diverging nature of the unstable system. Shahab & Doraiswami ( 2009) pointed out that the aforementioned vanilla approach may fail the system identification due to the rich sampling space required by unstable system and thus they proposed a close-loop algorithm to accommodate prior information into the identification process. Nar et al. ( 2020) noticed the imbalanced sample influence of unstable systems and used a time-weight loss to alleviate the effect. With a growing popularity of data-driven application in control/RL problems, the optimization issues from unstable systems naturally extend to many trending methods such as policy gradient. To the best of our knowledge, this work is the first attempt to investigate the convergence issues of policy gradient algorithm on unstable systems and propose practical methods to alleviate this issue.

2.2. LQR PROBLEM

For a discrete-time linear system, its state equation is represented by: x t+1 = Ax t + Bu t (1) where, x t ∈ R n and u t ∈ R m denote the system state and control action at time step t, A ∈ R n×n and B ∈ R n×m are the system transition matrices. The feedback gain is parameterized by the matrix K ∈ R m×n with u t = -Kx t . The intermediate cost function is in the quadratic form of state x t and control u t , where Q ∈ R n×n and R ∈ R m×m are given positive definite matrices to parameterize the quadratic cost. The optimal control problem can be formulated as minimizing the following target over K: C K,T ≜ E x0∼D T t=0 x ⊤ t Qx t + u ⊤ t Ru t where x t = (A -BK) t x 0 by plugging in u t = -Kx t and D is the distribution of initial state. If T → ∞, then the problem is called infinite horizon LQR, otherwise when T is a finite positive integer, it is called finite horizon LQR. According to Nise (2020), a T -time-step system is controllable if we can reach any target state x * from any initial state x 0 . The necessary and sufficient condition for controllability is that the controllability gramian is full rank, i.e., rank ([A T -1 B, A T -2 B, • • • , AB, B]) = n,



By unstable LQR we mean the state transition matrix A in LQR (see Equation (1)) has a spectral norm outside the unit circle. For the rest part of this paper, we will use "minimizing cost" as objective since this is closely related to optimal control and optimization. In this paper, "stability" denotes "input-to-output" stability, where a small perturbation of the input signal (control action/state perturbation) will not lead to a large deviation in the system output cost. We use model-free methods in this paper, therefore the system output target is the cost function. For formal stability definition, please refer to Section 3.1

