SYSTEM IDENTIFICATION AS A REINFORCEMENT LEARNING PROBLEM

Abstract

System identification, also known as learning forward models, has a long tradition both in science and engineering in different fields. Particularly, it is a recurring theme in Reinforcement Learning research, where forward models approximate the state transition function of a Markov Decision Process by learning a mapping function from current state and action to the next state. This problem is commonly defined and solved as a Supervised Learning problem. However, several difficulties appear due to inherent complexities of the dynamics to be be learn, for example, delayed effects, high non-linearity, non-stationarity, partial observability and error accumulation when using compounded predictions (i.e., predictions based on past predictions) over large time horizons. In this paper, we elaborate on why and how this problem fits naturally and sound as a Reinforcement Learning problem, and present some experimental results that demonstrate RL is a promising technique to learn forward models.

1. INTRODUCTION

One of the most distinguishing features of Reinforcement Learning (RL, Sutton & Barto, 1998) is the direct trial and error interactions between an agent (learner) and its environment (real world, plant, game, etc.), so that the agent can learn the consequences of its actions and thus adapt its behavior to optimize a goal in the long run. However, in real life applications and specifically in industrial settings, there are many mission critical assets where RL cannot be applied in its canonical form, "Online" RL, due to security or operational risks, i.e., the risk of unsafe exploratory actions or the interruption of asset operations. Among recent lines of research in this field, "batch" and "Offline" RL (Lange et al., 2012; Fujimoto et al., 2019; Levine et al., 2020) formulations break down many of these barriers leading to its successful application to these problems. Our goal is to perform offline RL for these critical real life situations (see dreaming or imagination, Ha & Schmidhuber, 2018; Hafner et al., 2020) , however, our method is to apply online RL two times (RL • RL)(data): first, for learning good enough forward models, which is the focus of the present paper, and second, for learning a policy using that forward model as the environment. We seek to avoid the well known negative results reported by Fujimoto et al. ( 2019), while leveraging the exploration ability of RL to learn good policies. Learning forward models has been an active area of research, with abundant contributions on the application of Machine Learning (ML) (see for instance Werbos, 1989; Fu & Li, 2013; Zhang, 2014; Abdufattokhov & Muhiddinov, 2019; Roehrl et al., 2020) . Particularly, it is a recurring topic of research within RL (Sutton, 1991; Sutton & Barto, 1998; Polydoros & Nalpantidis, 2017; Moerland et al., 2020) , where forward models usually represent the transition function s t+1 = T (s t , a t ) of some Markov Decision Process (MDP). We denote an MDP as a tuple M = (S, A, T , R), where S is the state space, A is the action space, T is the transition function and R is the reward function, thus, s t+1 = T (s t , a t ) represents the immediate state after the evolution of the system, starting at time t with state s t and conditioned by an action a t , and T is defined by mapping function s t+1 = f (s t , a t ). Learning a forward model is a task commonly defined as a Supervised Learning problem in a direct way (Jordan & Rumelhart, 1992; Moerland et al., 2020) , having the set of observations X = {(s t , a t ), ...}, labels y = {s t+1 , ...}, and a loss, e.g., L = ||f (s t , a t ) -s t+1 ||, however, this approach faces several challenges. Here, we propose that this problem has a more complete and natural definition as an RL problem, and show, experimentally, positive results.

