CONTROL-AWARE REPRESENTATIONS FOR MODEL-BASED REINFORCEMENT LEARNING

Abstract

A major challenge in modern reinforcement learning (RL) is efficient control of dynamical systems from high-dimensional sensory observations. Learning controllable embedding (LCE) is a promising approach that addresses this challenge by embedding the observations into a lower-dimensional latent space, estimating the latent dynamics, and utilizing it to perform control in the latent space. Two important questions in this area are how to learn a representation that is amenable to the control problem at hand, and how to achieve an end-to-end framework for representation learning and control. In this paper, we take a few steps towards addressing these questions. We first formulate a LCE model to learn representations that are suitable to be used by a policy iteration style algorithm in the latent space. We call this model control-aware representation learning (CARL). We derive a loss function and three implementations for CARL. In the offline implementation, we replace the locally-linear control algorithm (e.g., iLQR) used by the existing LCE methods with a RL algorithm, namely model-based soft actor-critic, and show that it results in significant improvement. In online CARL, we interleave representation learning and control, and demonstrate further gain in performance. Finally, we propose value-guided CARL, a variation in which we optimize a weighted version of the CARL loss function, where the weights depend on the TD-error of the current policy. We evaluate the proposed algorithms by extensive experiments on benchmark tasks and compare them with several LCE baselines. Control of non-linear dynamical systems is a key problem in control theory. Many methods have been developed with different levels of success in different classes of such problems. The majority of these methods assume that a model of the system is known and its underlying state is low-dimensional and observable. These requirements limit the usage of these techniques in controlling dynamical systems from high-dimensional raw sensory data (e.g., image), where the system dynamics is unknown, a scenario often seen in modern reinforcement learning (RL). Recent years have witnessed a rapid development of a large arsenal of model-free RL algorithms, such as DQN (Mnih et al., 2013), TRPO (Schulman et al., 2015), PPO (Schulman et al., 2017), and SAC (Haarnoja et al., 2018), with impressive success in solving high-dimensional control problems. However, most of this success has been limited to simulated environments (e.g., computer games), mainly due to the fact that these algorithms often require a large number of samples from the environment. This restricts their applicability in real-world physical systems, for which data collection is often a difficult process. On the other hand, model-based RL algorithms, such as PILCO (Deisenroth & Rasmussen, 2011) , MBPO (Janner et al., 2019), and Visual Foresight (Ebert et al., 2018), despite their success, still face difficulties in learning a model (dynamics) in a high-dimensional (pixel) space. To address the problems faced by model-free and model-based RL algorithms in solving highdimensional control problems, a class of algorithms have been developed, whose main idea is to first learn a low-dimensional latent (embedding) space and a latent model (dynamics), and then use this model to control the system in the latent space. This class has been referred to as learning controllable embedding (LCE) and includes algorithms, such as E2C (Watter et al.



process of learning representation. This view of learning control-aware representations is aligned with the value-aware and policy-aware model learning, VAML (Farahmand, 2018) and PAML (Abachi et al., 2020) , frameworks that have been recently proposed in model-based RL. Second, to interleave the representation learning and control, and to update them both, using a unifying objective function. This allows to have an end-to-end framework for representation learning and control. LCE methods, such as SOLAR, Dreamer, and SLAC, have taken steps towards the second objective by performing representation learning and control in an online fashion. This is in contrast to offline methods like E2C, RCE, PCC, and PC3 that learn a representation once and then use it in the entire control process. On the other hand, methods like PCC and PC3 address the first objective by adding a term to their representation learning loss function that accounts for the curvature of the latent dynamics. This term regularizes the representation towards smoother latent dynamics, which are suitable for the locally-linear controllers, e.g., iLQR (Li & Todorov, 2004) , used by these methods. In this paper, we take a few steps towards the above two objectives. We first formulate a LCE model to learn representations that are suitable to be used by a policy iteration (PI) style algorithm in the latent space. We call this model control-aware representation learning (CARL) and derive a loss function for it that exhibits a close connection to the prediction, consistency, and curvature (PCC) principle for representation learning (Levine et al., 2020) . We derive three implementations of CARL: offline, online, and value-guided. Similar to offline LCE methods, such as E2C, RCE, PCC, and PC3, in offline CARL, we first learn a representation and then use it in the entire control process. However, in offline CARL, we replace the locally-linear control algorithm (e.g., iLQR) used by these LCE methods with a PI-style (actor-critic) RL algorithm. Our choice of RL algorithm is the model-based implementation of soft actor-critic (SAC) (Haarnoja et al., 2018) . Our experiments show significant performance improvement by replacing iLQR with SAC. Online CARL is an iterative algorithm in which at each iteration, we first learn a latent representation by minimizing the CARL loss, and then perform several policy updates using SAC in this latent space. Our experiments with online CARL show further performance gain over its offline version. Finally, in value-guided CARL (V-CARL), we optimize a weighted version of the CARL loss function, in which the weights depend on the TD-error of the current policy. This would help to further incorporate the control algorithm in the representation learning process. We evaluate the proposed algorithms by extensive experiments on benchmark tasks and compare them with several LCE baselines: PCC, SOLAR, and Dreamer.

2. PROBLEM FORMULATION

We are interested in learning control policies for non-linear dynamical systems, where the states s ∈ S ⊆ R ns are not fully observed and we only have access to their high-dimensional observations x ∈ X ⊆ R nx , n x n s . This scenario captures many practical applications in which we interact with a system only through high-dimensional sensory signals, such as image and audio. We assume that the observations x have been selected such that we can model the system in the observation space using a Markov decision process (MDP)foot_0 M X = X , A, r, P, γ , where X and A are observation and action spaces; r : X × A → R is the reward function with maximum value R max , defined by the designer of the system to achieve the control objective; 2 P : X ×A → P(X ) is the unknown transition kernel; and γ ∈ (0, 1) is the discount factor. Our goal is to find a mapping from observations to control signals, µ : X → P(A), with maximum expected return, i.e., J(µ) = E[ ∞ t=0 γ t r(x t , a t ) | P, µ]. Since the observations x are high-dimensional and the observation dynamics P is unknown, solving the control problem in the observation space may not be efficient. As discussed in Section 1, the class of learning controllable embedding (LCE) algorithms addresses this by learning a low-dimensional latent (embedding) space Z ⊆ R nz , n z n x , together with a latent dynamics, and controlling the system there. The main idea behind LCE is to learn an encoder E : X → P(Z), a latent space dynamics F : Z × A → P(Z), and a decoder D : Z → P(X ), 3 such that a good or optimal controller (policy) in Z performs well in the observation space X . This means that if we model the control problem in Z as a MDP M Z = Z, A, r, F, γ and solve it using a model-based RL algorithm to obtain a policy π : Z → P(A), the image of π back in the observation space, i.e., Algorithm 1 Latent Space Learning with Policy Iteration (LSLPI) 1: Inputs: E (0) , F (0) , D (0) ; 2: Initialization: µ (0) = random policy; D ← samples generated from µ (0) ; 3: for i = 0, 1, . . . do 4: Compute π (i) as the projection of µ (i) in the latent space w.r.t. DKL π • E || µ ; # µ (i) ≈ π (i) • E (i) 5: Compute the value function of π (i) and set V (i) = V π (i) ; # policy evaluation (critic) 6: Compute the greedy policy w.r.t. V (i) and set π ) ]; # policy improvement (actor) 7: (i) + = G[V (i Set µ (i+1) = π (i) + • E (i) ; # project the improved policy π (i) + back into the observation space 8: Learn (E (i+1) , F (i+1) , D (i+1) , r(i+1) ) from D, π (i) , and π + ; # representation learning 9: Generate samples D (i+1) = {(xt, at, rt, xt+1)} n t=1 from µ (i+1) ; D ← D ∪ D (i+1) ; 10: end for (π • E)(a|x) = z dE(z|x)π(a|z), should have high expected return. Thus, the loss function to learn Z and (E, F, D) from observations {(x t , a t , r t , x t+1 )} should be designed to comply with this goal. This is why in this paper, we propose a LCE framework that tries to incorporate the control algorithm used in the latent space in the representation learning process. We call this model, control-aware representation learning (CARL). In CARL, we set the class of control (RL) algorithms used in the latent space to approximate policy iteration (PI), and more specifically to soft actor-critic (SAC) (Haarnoja et al., 2018) . Before describing CARL in details in the following sections, we present a number of useful definitions and notations here. For any policy µ in X , we define its value function U µ and Bellman operator T µ as U µ (x) = E[ ∞ t=0 γ t r µ (x t ) | P µ , x 0 = x], T µ [U ](x) = E x ∼Pµ(•|x) [r µ (x) + γU (x )], for all x ∈ X and U : X → R, where r µ (x) = a dµ(a|x)r(x, a) and P µ (x |x) = a dµ(a|x)P (x |x, a) are the reward function and dynamics induced by µ. Similarly, for any policy π in Z, we define its induced reward function and dynamics as rπ (z) = a dπ(a|z)r(z, a) and F π (z |z) = a dπ(a|z)F (z |z, a). We also define its value function V π and Bellman operator T π as V π (z) = E[ ∞ t=0 γ t rπ (z t ) | F π , z 0 = z], T π [V ](z) = E z ∼Fπ(•|z) [r π (z) + γV (z )]. (2) For any policy π and value function V in the latent space Z, we denote by π • E and V • E, their image in the observation space X , given encoder E, and define them as (π • E)(a|x) = z dE(z|x)π(a|z), (V • E)(x) = z dE(z|x)V (z).

3. CARL MODEL: A CONTROL PERSPECTIVE

In this section, we formulate our LCE model, which we refer to as control-aware representation learning (CARL). As described in Section 2, CARL is a model for learning a low-dimensional latent space Z and the latent dynamics, from data generated in the observation space X , such that this representation is suitable to be used by a policy iteration (PI) style algorithm in Z. In order to derive the loss function used by CARL to learn Z and its dynamics, i.e., (E, F, D, r), we first describe how the representation learning can be interleaved with PI in Z. Algorithm 1 contains the pseudo-code of the resulting algorithm, which we refer to as latent space learning policy iteration (LSLPI). Each iteration i of LSLPI starts with a policy µ (i) in the observation space X , which is the mapping of the improved policy in Z in iteration i -1, i.e., π (i-1) + , back in X through the encoder E (i-1) (Lines 6 and 7). We then compute π (i) , the current policy in Z, as the image of µ (i) in Z through the encoder E (i) (Line 4). Note that E (i) is the encoder learned at the end of iteration i -1 (Line 8). We then use the latent space dynamics F (i) learned at the end of iteration i -1 (Line 8), and first compute the value function of π (i) in the policy evaluation or critic step, i.e., V (i) = V π (i) (Line 5), and then use V (i) to compute the improved policy π (i) + , as the greedy policy w.r.t. V (i) , i.e., π (i+1) = G[V (i) ], in the policy improvement or actor step (Line 6). Using the samples in the buffer D, together with the current policies in Z, i.e., π (i) and π (i) + , we learn the new representation (E (i+1) , F (i+1) , D (i+1) , r(i+1) ) (Line 8). Finally, we generate samples D (i+1) by following µ (i+1) , the image of the improved policy π (i) + back in X using the old encoder E (i) (Line 7), and add it to the buffer D (Line 9), and the algorithm iterates. It is important to note that both critic and actor operate in the low-dimensional latent space Z. LSLPI is a PI algorithm in Z. However, what is desired is that it also acts as a PI algorithm in X , i.e., it results in (monotonic) policy improvement in X , i.e., U µ (i+1) ≥ U µ (i) . Therefore, we define the representation learning loss function for CARL, such that it ensures LSLPI also results in policy improvement in X . The following theorem, whose proof is reported in Appendix A, shows the relationship between the value functions of two consecutive polices generated by LSLPI in X . Theorem 1. Let µ, µ + , π, π + , and (E, F, D, r) be the policies µ (i) , µ (i+1) , π (i) , π (i) + , and the learned latent representation (E (i+1) , F (i+1) , D (i+1) , r(i+1) ) at iteration i of the LSLPI algorithm (Algorithm 1). Then, the following holds for the value functions of µ and µ + : U µ+ (x) ≥ U µ (x)- 1 1 -γ π∈{π,π+} E d γ π•E [∆(E, F, D, r, π, •)|x 0 = x] + √ 2γR max 1 -γ • E d γ π•E [ D KL (π • E)(• |•) || µ(• |•) Lreg(E,µ,π,•) |x 0 = x] , for all x ∈ X , where d γ π•E (x |x 0 ) = (1 -γ) • ∞ =0 γ P(x = x |x 0 ; π • E) is the γ-stationary distribution induced by policy π • E, and the error term ∆ for a policy π is given by ∆(E, F, D, r, π, x) = Rmax 1 -γ (I)=L ed (E,D,x) -1 2 z dE(z|x) log D(x|z) + 2 (II)=Lr(E,r,π,x) rπ•E(x) - z dE(z|x)rπ(z) + γRmax √ 2(1 -γ) DKL Pπ•E(•|x) || (D • Fπ • E)(•|x) (III)=Lp(E,F,D,π,x) + DKL (E • Pπ•E)(•|x) || (Fπ • E)(•|x) (IV) . It is easy to see that LSLPI guarantees (policy) improvement in X , if the terms in the parentheses on the RHS of (4) are zero. We now describe these terms. The last term on the RHS of ( 4) is the KL between π (i) • E and µ i) . This term can be seen as a regularizer to keep the new encoder E close to the old one E (i) . The four terms in (5) are: (I) The encoding-decoding error to ensure x ≈ (D • E)(x); (II) The error that measures the mismatch between the reward of taking action according to policy π • E at x ∈ X , and the reward of taking action according to policy π at the image of x in Z under E; (III) The error in predicting the next observation through paths in X and Z. This is the error between x and x shown in Fig. 1(a) ; and (IV) The error in predicting the next latent state through paths in X and Z. This is the error between z and z shown in Fig. 1(b) . Representation Learning in CARL Theorem 1 provides us with a recipe (loss function) to learn the latent space Z and (E, F, D, r). In CARL, we propose to learn a representation for which the terms in the parentheses on the RHS of (4) are small. As mentioned earlier, the second term, L reg (E, µ, π, x), can be considered as a regularizer to keep the new encoder E close to the old one E -, when the policy µ is given by π • E -. Term (I) minimizes the reconstruction error between encoder and decoder, which is standard for training auto-encoders (Kingma & Welling, 2013) . Term (II) that measures the mismatch between rewards can be kept small, or even zero, if the designer of the system selects the rewards in a compatible wayfoot_2 . Although CARL allows us to learn a reward function in the latent space, similar to several other LCE works (Watter et al., 2015; Banijamali et al., 2018; Levine et al., 2020; Shu et al., 2020) , in this paper, we assume that a compatible latent reward function is given. Terms (III) and (IV) are the equivalent of the prediction and consistency terms in PCC (Levine et al., 2020) for a particular latent space policy π. Since PCC has been designed for an offline setting (i.e., one-shot representation learning and control), its prediction and consistency terms are independent of a particular policy and are defined for state-action pairs. While CARL is designed for an online setting (i.e., interleaving representation learning and control), and thus, its loss function at each iteration depends on the current latent space policies π and π + . As we will see in Section 4, in our offline implementation of CARL, these two terms are similar to prediction and consistency terms in PCC. Note that (IV) is slightly different than the consistency term in PCC. However, if we upper-bound it using Jensen inequality: (i) = π (i) • E ( (IV) ≤ Lc(E, F, π, x) := x ∈X dPπ•E(x |x) • DKL E(•|x ) || (Fπ • E)(•|x) , the resulted loss, L c (E, F, π, x), would be similar to the consistency term in PCC. Similar to PCC, we also add a curvature loss to the loss function of CARL to encourage having a smoother latent space dynamics F π . Putting all these terms together, we obtain the following loss function for CARL: min E,F,D x∼D λ ed L ed (E, D, x) + λ p L p (E, F, D, π, x) + λ c L c (E, F, π, x) + λ cur L cur (F, π, x) + λ reg L reg (E, µ, π, x), where (λ ed , λ p , λ c , λ cur , λ reg ) are hyper-parametersfoot_3 of the algorithm, (L ed , L p ) are the encodingdecoding and prediction losses defined in (5), L c is the consistency loss defined above, Lcur = Ex,u[E fZ (z + z , u + u) -fZ (z, u) -(∇zfZ (z, u) • z + ∇ufZ (z, u) • u) 2 2 | E] is the curvature loss that regulates the 2 nd derivative of f Z , the mean of latent dynamics F , in which z , u are standard Gaussian noise, and L reg is the regularizer that ensures the new encoder remains close to the old one.

4. DIFFERENT IMPLEMENTATIONS OF CARL

The CARL loss function in (6) introduces an optimization problem that takes a policy π in Z as input and learns a representation suitable for its evaluation and improvement. To optimize this loss in practice, similar to the PCC model (Levine et al., 2020) , we define P = D • F π • E as a latent variable model that is factorized as P (x t+1 , z t , ẑt+1 |x t , π) = P (z t |x t ) P (ẑ t+1 |z t , π) P (x t+1 |ẑ t+1 ), and use a variational approximation to the interactable negative log-likelihood of the loss terms in (6). The variational bounds for these terms can be obtained similar to Eqs. 6 and 7 in Levine et al. (2020) . Below we describe three instantiations of the CARL model in practice. Implementation details can be found in Algorithm 2 in Appendix D. Although CARL is compatible with most PI-style (actor-critic) RL algorithms, we choose soft actor-critic (SAC) (Haarnoja et al., 2018) as its control algorithm. Since most actor-critic algorithms are based on first-order gradient updates, as discussed in Section 3, we regularize the curvature of the latent dynamics F (see Eqs. 8 and 9 in Levine et al. 2020 ) in CARL to improve its empirical stability and performance in policy learning. 1. Offline CARL We first implement CARL in an offline setting, where we generate a (relatively) large batch of observation samples {(x t , a t , r t , x t+1 )} N t=1 using an exploratory (e.g., random) policy. We then use this batch to optimize the CARL's loss function (6) via the variational approximation scheme described above, and learn a latent representation Z and (E, F, D). Finally, we solve the decision problem in Z using a model-based RL algorithm, which in our case is model-based SACfoot_4 . The learned policy π * in Z is then used to control the system from observations as a t ∼ (π * •E)(•|x t ). This is the setting that has been used in several recent LCE works, such as E2C (Watter et al., 2015) , RCE (Banijamali et al., 2018) , PCC (Levine et al., 2020) , and PC3 (Shu et al., 2020) . Our offline implementation is different than those in which 1) we replace their locally-linear control algorithm, namely iterative LQR (iLQR) (Li & Todorov, 2004) , with model-based SAC, which results in significant performance improvement, as shown in Section 5, and 2) we optimize the CARL loss function, that despite close connection, is still different than the one used by PCC. The CARL loss function presented in Section 3 has been designed for an online setting in which at each iteration, it takes a policy as input and learns a representation that is suitable for evaluating and improving this policy. However, in the offline setting, the learned representation should be good for any policy generated in the course of running the PI-style control algorithm. Therefore, we marginalize out the policy from the (online) CARL's loss function and use the RHS of the following corollary (proof in Appendix B) to construct the CARL loss function used in our offline experiments. Corollary 2. Let µ and µ + be two consecutive policies in X generated by a PI-style control algorithm in the latent space constructed by (E,F,D,r). Then, the following holds for the value functions of µ and µ + , where ∆ is defined by ( 5) (in modulo replacing sampled action a ∼ π•E with action a): U µ+ (x) ≥ U µ (x) - 2 1 -γ • max x,∈X ,a∈A ∆(E, F, D, r, a, x), ∀x ∈ X . ( ) 2. Online CARL In the online implementation of CARL, at each iteration i, the current policy π (i) is the improved policy of the last iteration, π (i-1) + . We first generate a relatively (to offline CARL) small batch of samples using the image of the current policy in X , i.e., µ (i) = π (i) • E (i-1) , and then learn a representation (E (i) , F (i) , D (i) ) suitable for evaluating and improving the image of µ (i) in Z under the new encoder E (i) . This means that with the new representation, the current policy that was the image of µ (i) in Z under E (i-1) , should be replaced by its image π (i) under the new encoder, i.e., π (i) • E (i) ≈ µ (i) . In online CARL, we address this by the following policy distillation step in which we minimize the following loss:foot_5  π (i) ∈ arg min π x∼D D KL (π • E (i) )(•|x) || (π (i-1) + • E (i-1) )(•|x) . After the current policy π (i) is set, we perform multiple steps of (model-based) SAC in Z using the current model, (F (i) , r(i) ), and then send the resulting policy π + to the next iteration. 3. Value-Guided CARL (V-CARL) While Theorem 1 shows that minimizing the loss in (6) guarantees performance improvement, this loss does not contain any information about the performance of the current policy µ, and thus, the LCE model trained with this loss may have low accuracy in regions of the latent space that are crucial for learning good RL policies. In V-CARL, we tackle this issue by modifying the loss function in a way that the resulted LCE model has more accuracy in regions with higher anticipated future returns. To derive the V-CARL's loss function, we use the variational model-based policy optimization (VMBPO) framework by Chow et al. (2020) in which the optimal dynamics for model-based RL can be expressed in closed-form as P * (x |x, a) = P (x |x, a) • exp τ γ (r(x, a) + γ Ũµ(x ) - Wµ(x, a)) , where Ũµ(x) := 1 τ log E exp τ ∞ t=0 γ t rµ,t |Pµ, x0 = x and Wµ(x, a) := r(x, a) + γ τ log E x ∼P (•|x,a) [exp(τ Uµ(x )) ] are the optimistic value and action-value functionsfoot_6 of policy µ, and τ > 0 is a temperature parameter. Note that in the VMBPO framework, the optimal dynamics P * is value-aware, because it re-weighs P with an exponential-twisting weight exp( τ γ w(x, a, x )), where w(x, a, x ) := r(x, a) + γ Ũµ(x ) -Wµ(x, a) is the temporal difference (TD) error. In V-CARL, we use the VMBPO framework to modify the CARL's prediction loss Lp(E, F, D, π, x). Since the regularizer loss Lreg(E, µ, π, x) in CARL forces policies π • E and µ to be close to each other, we may replace the transition dynamics  • w(x, a, x )) • log(D • Fπ • E)(x |x), which is a weighted (by the exponential TD w(x, a, x )) log-likelihood function (w.r.t. P ). Note that this weight depends on the optimistic value functions Ũµ and Wµ . When τ > 0 is small (see Appendix C for more details), these value functions can be approximated by their standard counterparts, i.e., Ũµ(x) ≈ Uµ(x) and Wµ(x, a) ≈ Wµ(x, a) := r(x, a)+ x dP (x |x, a)Uµ(x ), which can be further approximated by their latent-space counterparts, i.e., Uµ(x) ≈ (Vπ • E)(x) and Wµ(x, a) ≈ (Qπ • E)(x, a), according to Lemma 5 in Appendix A.1. Since the latent reward function r is defined such that r(x, a) ≈ (r • E)(x, a), we may write the TD-error w(x, a, x ) in terms of the encoder E and the latent value functions as w(x, a, x ) := z,z dE(z|x) • dE(z |x ) • (r(z, a) -Qπ(z, a) + γVπ(z )). Dreamer As described in Section 2, most LCE algorithms, including E2C, PCC, and CARL variants, assume the observation space X is selected such that the system is Markovian there. In contrast, Dreamer does not make this assumption and has been designed for more general class of control problems that can be modeled as POMDPs. Thus, it is expected that it performs inferior (requires more samples to achieve the same performance) to CARL when the system is Markov in the observation space. Moreover, CARL and other LCE methods define the reward as the negative distance to the goal in the latent space. This cannot be done in Dreamer, where the encoder is an RNN that takes an entire observation trajectory as input. To address this, we propose two methods to train the Dreamer's reward function in the latent space, which we refer to as Dreamer Pixel and Dreamer Oracle. While Dreamer Pixel uses the negative distance to the goal in the observation space X as the signal to train the reward function, Dreamer Oracle uses the negative distance in the (unobserved) underlying state space S. Thus, it is more fair to compare the CARL algorithms with Dreamer Pixel than Dreamer Oracle that has the advantage of having access to the underlying state space (see Appendix F.6 for more details). As it was expected, our results show that although both Dreamer's implementations learn reasonably-performing policies for most tasks (except Planar), they require twice to 100-times more samples to achieve the same performance as the CARL algorithms. We report longer (more samples) experiments with Dreamer on all tasks in Appendix F.6 (Fig. 12 ). Results with Environment-biased Sampling In the previous experiments, all the online LCE algorithms are warm-started with data collected by a uniformly random policy over the entire environment. With sufficient data the latent dynamics is accurate enough on most parts of the state space for control, therefore we do not observe a significant difference between online CARL and V-CARL. To further illustrate the advantage of V-CARL over online CARL, we modify the experimental setting by gathering initial samples only from a specific region of the environment (see Appendix E.1 for more details). Fig. 3 shows the learning curves of online CARL and V-CARL in this case. As expected, with biased data, both algorithms experience a certain level of performance degradation, yet, V-CARL clearly outperforms online CARL -this verifies our conjecture that control-aware LCE models are more robust to initial data distribution and superior in policy optimization. 

6. CONCLUSIONS

In this paper, we argued for incorporating control in the representation learning process and for the interaction between control and representation learning in learning controllable embedding (LCE) algorithms. We proposed a LCE model called control-aware representation learning (CARL) that learns representations suitable for policy iteration (PI) style control algorithms. We proposed three implementations of CARL that combine representation learning with model-based soft actor-critic (SAC), as the controller, in offline and online fashions. In the third implementation, called valueguided CARL, we further included the control process in representation learning by optimizing a weighted version of the CARL loss function, in which the weights depend on the TD-error of the current policy. We evaluated the proposed algorithms on benchmark tasks and compared them with several LCE baselines. The experiments show the importance of SAC as the controller and of the online implementation. Future directions include 1) investigating other PI-style algorithms in place of SAC, 2) developing LCE models suitable for value iteration style algorithms, and 3) identifying other forms of bias for learning an effective embedding and latent dynamics.



A method to ensure observations are Markovian is to buffer them for several time steps(Mnih et al., 2013).2 For example, in a goal tracking problem in which the agent (robot) aims at finding the shortest path to reach the observation goal xg (the observation corresponding to the goal state sg), we may define the reward for each observation x as the negative of its distance to xg, i.e., -x -xg 2 .3 Some recent LCE models, such as PC3(Shu et al., 2020), are advocating latent models without a decoder. Although we are aware of the merits of such approach, we use a decoder in the models proposed in this paper. For example, in goal-based RL problems, a compatible reward function can be the one that measures the negative distance between a latent state and the image of the goal in the latent space. Theorem 1 provides a high-level guideline for selecting the hyper-parameters of the loss function:λed = 2Rmax/(1 -γ) 2 , λc = λp = √ 2γRmax/(1 -γ) 2, and λreg = √ 2γRmax/(1 -γ). By model-based SAC, we refer to learning a latent policy with SAC using synthetic trajectories generated by unrolling the learned latent dynamics model F , similar to the MBPO algorithm(Janner et al., 2019). Our experiments reported in Appendix F.1 show that adding distillation improves the performance in online CARL. Thus, all our results for online CARL and V-CARL, unless mentioned, are with policy distillation. We refer to Ũµ as the optimistic value function(Ruszczyński & Shapiro, 2006), because it models the right tail of the return via the exponential utility ρτ(U (•)|x, a) = 1 τ log E x ∼P (•|x,a) [exp(τ • U (x ))]. We did not include E2C and RCE in our experiments, becauseLevine et al. (2020) has previously shown that PCC outperforms them.



Figure 1: (a) Paths from the current observation x to the next one, (left) in X and (right) through Z. (b) Paths from the current observation x to the next latent state, (left) through X followed by encoding and (right) starting with encoding and then through Z.

Figure 2: Training curves of offline CARL, online CARL, V-CARL, and two implementations of Dreamer. The shaded region represents mean ± standard error.

Figure 3: Training curves of Online CARL and V-CARL with environment-biased initial samples.

Pπ•E with Pµ in Lp. This makes minimizing Lp equivalent to maximizing the log-likelihood x dPµ(x |x) • log(D • Fπ • E)(x |x). Finally, we replace Pµ with P * µ in this log-likelihood and obtain a dµ(a|x) x dP (x |x, a) • exp( τ γ



5. EXPERIMENTAL RESULTS

In this section, we experiment with the following continuous control domains: (i) Planar System, (ii) Inverted Pendulum (Swingup), (iii) Cartpole, (iv) Three-link Manipulator (3-Pole), and compare the performance of our CARL algorithms with three LCE baselines: PCC (Levine et al., 2020) , SOLAR (Zhang et al., 2019) , SLAC (Lee et al., 2020) , and two implementations of Dreamer (Hafner et al., 2020a) (described below). 9 These tasks have underlying start and goal states that are "not" observable, instead, the algorithms only have access to the start and goal observations. We report the detailed setup of the experiments in Appendix E, in particular, the description of the domains in Appendix E.1 and the implementation of the algorithms in Appendix E.3.

annex

To evaluate the performance of the algorithms, similar to Levine et al. (2020) , we report the %time spent in the goal. The initial policy that is used for data generation is uniformly random (see Appendix E.2 for more details). To measure performance reproducibility for each experiment, we (i) train 25 models, and (ii) perform 10 control tasks for each model. For SOLAR, due to its high computation cost, we only train and evaluate 10 different models. Besides the average results, we also report the results from the best LCE models, averaged over the 10 control tasks.

General Results

Table 1 shows the means and standard errors of %-time spent in goal, averaged over all models and control tasks, and averaged over all control tasks for the best model. To compare data efficiency, we also report the number of samples required to train the latent space and controller in each algorithm. We also show the training curves (performance vs. number of samples) of the algorithms in Fig. 2 . We report more experiments and ablation studies in Appendix F.Below summarizes our main observations of the experiments. First, offline CARL that uses modelbased SAC as its control algorithm achieves significantly better performance than PCC that uses iLQR in all tasks. This can be attributed to the advantage that SAC is more robust and effective in non-(locally)-linear environments. We report more detailed comparison between PCC and offline CARL in Appendix F.3, where we explicitly compare their control performance and latent representation maps. Second, in all tasks, online CARL is more data-efficient than its offline counterpart, i.e., it achieves similar or better performance with fewer samples. In particular, online CARL is notably superior in Planar, Cartpole, and Swingup, in which it achieves similar performance to offline CARL with 2, 2.5, and 4 times less samples, respectively (see Fig. 2 ). In Appendix F.3, we show how the latent representation of online CARL progressively improves through the iterations of the algorithm (in particular, see Fig. 11 ). Third, in the simpler tasks (Planar, Swingup, Cartpole), V-CARL performs even better than online CARL. This corroborates our hypothesis that CARL can achieve extra improvement when its LCE model is more accurate in the regions of the latent space with higher temporal difference (regions with higher anticipated future return). In 3-pole, the performance of V-CARL is worse than online CARL. This is likely due to the instability in representation learning resulted from sample variance amplification by the exponential-TD weight. Fourth, SOLAR requires significantly more samples to learn a reasonable latent space for control, and with limited data it fails to converge to a good policy. Even with the fine-tuned latent space from Zhang et al. (2019) , its performance is incomparable to those of CARL variants and Dreamer. We report more experiments with SOLAR in Appendix F.5, in which we show that SOLAR can perform better, especially in Planar when we fix the start and goal locations. However, the improved performance is still incomparable with those of CARL and Dreamer. Fifth, we include an ablation study in Appendix F.2 to demonstrate how each term of the CARL's loss function impacts policy learning. It shows the importance of the prediction and consistency terms, without which the resulting algorithms struggle, and the (relatively) minor role of the curvature and encoder-decoder terms in the performance of the algorithms.

