CONTROL-AWARE REPRESENTATIONS FOR MODEL-BASED REINFORCEMENT LEARNING

Abstract

A major challenge in modern reinforcement learning (RL) is efficient control of dynamical systems from high-dimensional sensory observations. Learning controllable embedding (LCE) is a promising approach that addresses this challenge by embedding the observations into a lower-dimensional latent space, estimating the latent dynamics, and utilizing it to perform control in the latent space. Two important questions in this area are how to learn a representation that is amenable to the control problem at hand, and how to achieve an end-to-end framework for representation learning and control. In this paper, we take a few steps towards addressing these questions. We first formulate a LCE model to learn representations that are suitable to be used by a policy iteration style algorithm in the latent space. We call this model control-aware representation learning (CARL). We derive a loss function and three implementations for CARL. In the offline implementation, we replace the locally-linear control algorithm (e.g., iLQR) used by the existing LCE methods with a RL algorithm, namely model-based soft actor-critic, and show that it results in significant improvement. In online CARL, we interleave representation learning and control, and demonstrate further gain in performance. Finally, we propose value-guided CARL, a variation in which we optimize a weighted version of the CARL loss function, where the weights depend on the TD-error of the current policy. We evaluate the proposed algorithms by extensive experiments on benchmark tasks and compare them with several LCE baselines.

1. INTRODUCTION

Control of non-linear dynamical systems is a key problem in control theory. Many methods have been developed with different levels of success in different classes of such problems. The majority of these methods assume that a model of the system is known and its underlying state is low-dimensional and observable. These requirements limit the usage of these techniques in controlling dynamical systems from high-dimensional raw sensory data (e.g., image), where the system dynamics is unknown, a scenario often seen in modern reinforcement learning (RL). Recent years have witnessed a rapid development of a large arsenal of model-free RL algorithms, such as DQN (Mnih et al., 2013) , TRPO (Schulman et al., 2015) , PPO (Schulman et al., 2017), and SAC (Haarnoja et al., 2018) , with impressive success in solving high-dimensional control problems. However, most of this success has been limited to simulated environments (e.g., computer games), mainly due to the fact that these algorithms often require a large number of samples from the environment. This restricts their applicability in real-world physical systems, for which data collection is often a difficult process. On the other hand, model-based RL algorithms, such as PILCO (Deisenroth & Rasmussen, 2011) , MBPO (Janner et al., 2019), and Visual Foresight (Ebert et al., 2018) , despite their success, still face difficulties in learning a model (dynamics) in a high-dimensional (pixel) space. To address the problems faced by model-free and model-based RL algorithms in solving highdimensional control problems, a class of algorithms have been developed, whose main idea is to first learn a low-dimensional latent (embedding) space and a latent model (dynamics), and then use this model to control the system in the latent space. This class has been referred to as learning controllable embedding (LCE) and includes algorithms, such as E2C (Watter et al., 2015) , RCE (Banijamali et al., 2018) , SOLAR (Zhang et al., 2019) , PCC (Levine et al., 2020 ), Dreamer (Hafner et al., 2020a; b), PC3 (Shu et al., 2020), and SLAC (Lee et al., 2020) . The following two properties are extremely important in designing LCE models and algorithms. First, to learn a representation that is the most suitable for the control problem at hand. This suggests incorporating the control algorithm in the process of learning representation. This view of learning control-aware representations is aligned with the value-aware and policy-aware model learning, VAML (Farahmand, 2018) and PAML (Abachi et al., 2020) , frameworks that have been recently proposed in model-based RL. Second, to interleave the representation learning and control, and to update them both, using a unifying objective function. This allows to have an end-to-end framework for representation learning and control. LCE methods, such as SOLAR, Dreamer, and SLAC, have taken steps towards the second objective by performing representation learning and control in an online fashion. This is in contrast to offline methods like E2C, RCE, PCC, and PC3 that learn a representation once and then use it in the entire control process. On the other hand, methods like PCC and PC3 address the first objective by adding a term to their representation learning loss function that accounts for the curvature of the latent dynamics. This term regularizes the representation towards smoother latent dynamics, which are suitable for the locally-linear controllers, e.g., iLQR (Li & Todorov, 2004) , used by these methods. In this paper, we take a few steps towards the above two objectives. We first formulate a LCE model to learn representations that are suitable to be used by a policy iteration (PI) style algorithm in the latent space. We call this model control-aware representation learning (CARL) and derive a loss function for it that exhibits a close connection to the prediction, consistency, and curvature (PCC) principle for representation learning (Levine et al., 2020) . We derive three implementations of CARL: offline, online, and value-guided. Similar to offline LCE methods, such as E2C, RCE, PCC, and PC3, in offline CARL, we first learn a representation and then use it in the entire control process. However, in offline CARL, we replace the locally-linear control algorithm (e.g., iLQR) used by these LCE methods with a PI-style (actor-critic) RL algorithm. Our choice of RL algorithm is the model-based implementation of soft actor-critic (SAC) (Haarnoja et al., 2018) . Our experiments show significant performance improvement by replacing iLQR with SAC. Online CARL is an iterative algorithm in which at each iteration, we first learn a latent representation by minimizing the CARL loss, and then perform several policy updates using SAC in this latent space. Our experiments with online CARL show further performance gain over its offline version. Finally, in value-guided CARL (V-CARL), we optimize a weighted version of the CARL loss function, in which the weights depend on the TD-error of the current policy. This would help to further incorporate the control algorithm in the representation learning process. We evaluate the proposed algorithms by extensive experiments on benchmark tasks and compare them with several LCE baselines: PCC, SOLAR, and Dreamer.

2. PROBLEM FORMULATION

We are interested in learning control policies for non-linear dynamical systems, where the states s ∈ S ⊆ R ns are not fully observed and we only have access to their high-dimensional observations x ∈ X ⊆ R nx , n x n s . This scenario captures many practical applications in which we interact with a system only through high-dimensional sensory signals, such as image and audio. We assume that the observations x have been selected such that we can model the system in the observation space using a Markov decision process (MDP)foot_0 M X = X , A, r, P, γ , where X and A are observation and action spaces; r : X × A → R is the reward function with maximum value R max , defined by the designer of the system to achieve the control objective; 2 P : X ×A → P(X ) is the unknown transition kernel; and γ ∈ (0, 1) is the discount factor. Our goal is to find a mapping from observations to control signals, µ : X → P(A), with maximum expected return, i.e., J( µ) = E[ ∞ t=0 γ t r(x t , a t ) | P, µ]. Since the observations x are high-dimensional and the observation dynamics P is unknown, solving the control problem in the observation space may not be efficient. As discussed in Section 1, the class of learning controllable embedding (LCE) algorithms addresses this by learning a low-dimensional latent (embedding) space Z ⊆ R nz , n z n x , together with a latent dynamics, and controlling the system there. The main idea behind LCE is to learn an encoder E : X → P(Z), a latent space dynamics F : Z × A → P(Z), and a decoder D : Z → P(X ), 3 such that a good or optimal controller (policy) in Z performs well in the observation space X . This means that if we model the control problem in Z as a MDP M Z = Z, A, r, F, γ and solve it using a model-based RL algorithm to obtain a policy π : Z → P(A), the image of π back in the observation space, i.e.,



A method to ensure observations are Markovian is to buffer them for several time steps(Mnih et al., 2013).2 For example, in a goal tracking problem in which the agent (robot) aims at finding the shortest path to reach the observation goal xg (the observation corresponding to the goal state sg), we may define the reward for each observation x as the negative of its distance to xg, i.e., -x -xg 2 .3 Some recent LCE models, such as PC3(Shu et al., 2020), are advocating latent models without a decoder. Although we are aware of the merits of such approach, we use a decoder in the models proposed in this paper.

