SYSTEM IDENTIFICATION AS A REINFORCEMENT LEARNING PROBLEM

Abstract

System identification, also known as learning forward models, has a long tradition both in science and engineering in different fields. Particularly, it is a recurring theme in Reinforcement Learning research, where forward models approximate the state transition function of a Markov Decision Process by learning a mapping function from current state and action to the next state. This problem is commonly defined and solved as a Supervised Learning problem. However, several difficulties appear due to inherent complexities of the dynamics to be be learn, for example, delayed effects, high non-linearity, non-stationarity, partial observability and error accumulation when using compounded predictions (i.e., predictions based on past predictions) over large time horizons. In this paper, we elaborate on why and how this problem fits naturally and sound as a Reinforcement Learning problem, and present some experimental results that demonstrate RL is a promising technique to learn forward models.

1. INTRODUCTION

One of the most distinguishing features of Reinforcement Learning (RL, Sutton & Barto, 1998) is the direct trial and error interactions between an agent (learner) and its environment (real world, plant, game, etc.), so that the agent can learn the consequences of its actions and thus adapt its behavior to optimize a goal in the long run. However, in real life applications and specifically in industrial settings, there are many mission critical assets where RL cannot be applied in its canonical form, "Online" RL, due to security or operational risks, i.e., the risk of unsafe exploratory actions or the interruption of asset operations. Among recent lines of research in this field, "batch" and "Offline" RL (Lange et al., 2012; Fujimoto et al., 2019; Levine et al., 2020) formulations break down many of these barriers leading to its successful application to these problems. Our goal is to perform offline RL for these critical real life situations (see dreaming or imagination, Ha & Schmidhuber, 2018; Hafner et al., 2020) , however, our method is to apply online RL two times (RL • RL)(data): first, for learning good enough forward models, which is the focus of the present paper, and second, for learning a policy using that forward model as the environment. We seek to avoid the well known negative results reported by Fujimoto et al. (2019) , while leveraging the exploration ability of RL to learn good policies. Learning forward models has been an active area of research, with abundant contributions on the application of Machine Learning (ML) (see for instance Werbos, 1989; Fu & Li, 2013; Zhang, 2014; Abdufattokhov & Muhiddinov, 2019; Roehrl et al., 2020) . Particularly, it is a recurring topic of research within RL (Sutton, 1991; Sutton & Barto, 1998; Polydoros & Nalpantidis, 2017; Moerland et al., 2020) , where forward models usually represent the transition function s t+1 = T (s t , a t ) of some Markov Decision Process (MDP). We denote an MDP as a tuple M = (S, A, T , R), where S is the state space, A is the action space, T is the transition function and R is the reward function, thus, s t+1 = T (s t , a t ) represents the immediate state after the evolution of the system, starting at time t with state s t and conditioned by an action a t , and T is defined by mapping function s t+1 = f (s t , a t ). Learning a forward model is a task commonly defined as a Supervised Learning problem in a direct way (Jordan & Rumelhart, 1992; Moerland et al., 2020) , having the set of observations X = {(s t , a t ), ...}, labels y = {s t+1 , ...}, and a loss, e.g., L = ||f (s t , a t ) -s t+1 ||, however, this approach faces several challenges. Here, we propose that this problem has a more complete and natural definition as an RL problem, and show, experimentally, positive results. 

2. MOTIVATION AND PROBLEM DEFINITION

Why learning forward models with RL? The domains, tasks, and problems to which forward modeling is being applied are of increasing complexity, including time delayed dynamical effects, high degree of non-linearity, partial observability (POMDPs) and the compounding error issue: a compounded rollout (i.e., generating samples based on past generated samples iteratively) induces a compounding error "when an imperfect model is used to generate sample rollouts, its errors tend to compound -a flawed sample is given as input to the model, which causes more errors, and so on" (Talvitie, 2014) . This situation raised the need for additional techniques to adapt the Supervised Learning framework to deal with this increasing complexity, for instance: rollout testing for beyond single step learning robustness, loss accumulation over rollouts for large horizon prediction, recurrent networks or frame-stacking or neural Turing Machines for partial observability, curriculum learning over increasing horizons, data augmentation to aid learning symmetries in data, ensembles of stochastic neural networks to increase the prediction accuracy and reduce bias, and a set of works focused on studying the compounding error (c.f. Oh et al., 2015; Talvitie, 2017; Silver et al., 2017; Fleming, 2018; Asadi et al., 2018; 2019; Xiao et al., 2019; Lambert et al., 2021; 2022) . On the other hand, RL has intrinsic features that provides a natural way to deal with many of those complexities: rollout learning by working with episodic tasks, minimization of the compounding error by optimizing for the long-run by solving Bellman's equation, continuous learning from new experience without requiring a full retraining, and it is by design well suited for stochastic scenarios and works in situations of partial observability. Does RL come with an unreasonable extra search cost? Solving a regression problem by RL has an extra cost due to the required exploration, i.e., searching for a point y i ∈ y (label) that is indeed already known. Thus, why solving this problem with RL? Is it an unreasonable cost? May we have now two problems instead of one? Let's suppose that we run compounded rollouts of certain length h (time horizon), while minimizing the errors between each predicted point x t = f (s t , a t ) and its corresponding true target (label) y t ∈ y. Since the predictions in the compounded rollouts trajectories are sequentially dependent (by composition), then we face a temporal credit assignment problem, q.e., the root cause of the compounding error. The RL framework deals with this problem naturally through the Bellman's optimality criteria (Bellman 1957) . Thus, in this problem setting, we shall not assume that we are given "true targets". Moreover, since we are relying on composition, then, in the Supervised Learning setting, we need the corresponding observation points x t for the predicted labels ŷt = f (x t ), however, these are not in the dataset, since x t are predictions as well. Finally, this extra search cost allows to learn a Q-function that learns explicitly the expected prediction error that the model will commit. Forward model learning as an RL problem: Learning forward models with RL can be directly achieved by just translating a regression problem to an RL problem, where the observation (state) is formed by the current state of the system and the previous observed action, the actions of the agent represent the predictions of the next state, and the reward signal is just the negated total/cumulative prediction error. More formally, given an MDP, M := (S, A, T , R), the forward learning problem is defined as the MDP M F := (S F , A F , (D|O) F , L F ), where D|O refers to a time series of transitions stored in a dataset (D) or observed (O) from a real world process, L F = ||s t+1 -ŝt+1 || is



Figure 1: The typical RL setting (left). RL flow for learning a forward model (right).

