SIMPLIFYING MODEL-BASED RL: LEARNING REPRESENTATIONS, LATENT-SPACE MODELS, AND POLICIES WITH ONE OBJECTIVE

Abstract

While reinforcement learning (RL) methods that learn an internal model of the environment have the potential to be more sample efficient than their model-free counterparts, learning to model raw observations from high dimensional sensors can be challenging. Prior work has addressed this challenge by learning lowdimensional representation of observations through auxiliary objectives, such as reconstruction or value prediction. However, the alignment between these auxiliary objectives and the RL objective is often unclear. In this work, we propose a single objective which jointly optimizes a latent-space model and policy to achieve high returns while remaining self-consistent. This objective is a lower bound on expected returns. Unlike prior bounds for model-based RL on policy exploration or model guarantees, our bound is directly on the overall RL objective. We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods. While sample efficient methods typically are computationally demanding, our method attains the performance of SAC in about 50% less wall-clock time 1 .

1. INTRODUCTION

While RL algorithms that learn an internal model of the world can learn more quickly than their model-free counterparts (Hafner et al., 2018; Janner et al., 2019) , figuring out exactly what these models should predict has remained an open problem: the real world and even realistic simulators are too complex to model accurately. Although model errors may be rare under the training distribution, a learned RL agent will often seek out the states where an otherwise accurate model makes mistakes (Jafferjee et al., 2020) . Simply training the model with maximum likelihood will not, in general, produce a model that is good for model-based RL (MBRL) . The discrepancy between the policy objective and the model objective is called the objective mismatch problem (Lambert et al., 2020) , and remains an active area of research. The objective mismatch problem is especially important in settings with high-dimensional observations, which are challenging to predict with high fidelity. Prior model-based methods have coped with the difficulty to model high-dimensional observations by learning the dynamics of a compact representation of observations, rather than the dynamics of the raw observations. Depending on their learning objective, these representations might still be hard to predict or might not contain task relevant information. Besides, the accuracy of prediction depends not just on the model's parameters, but also on the states visited by the policy. Hence, another way of reducing prediction errors is to optimize the policy to avoid transitions where the model is inaccurate, while achieving high returns. In the end, we want to train the model, representations, and policy to be self-consistent: the policy should only visit states where the model is accurate, the representation should encode information that is task-relevant and predictable. Can we design a model-based RL algorithm that automatically learns compact yet sufficient representations for model-based reasoning? the model to produce high-return policies (Eysenbach et al., 2021a; Amos et al., 2018; Nikishin et al., 2021) . However, effectively addressing the objective mismatch problem for latent-space models remains an open problem. Our method makes progress on this problem by proposing a single objective to be used for jointly optimizing the model, policy, and representation. Because all components are optimized with the same objective, updates to the representations make the policy better (on this objective), as do updates to the model. While prior theoretical work in latent space models has proposed bounds on exploratory behavior of the policy (Misra et al., 2020) , and on learning compressed latent representations (Efroni et al., 2021) , our analysis lifts some of the assumptions (e.g., removing the block-MDP assumption), and bounds the overall RL objective in a model-based setting.

3. A UNIFIED OBJECTIVE FOR LATENT-SPACE MODEL-BASED RL

We first introduce notation, then provide a high-level outline of the objective, and then derive our objective. Sec. 4 will discuss a practical algorithm based on this objective.

3.1. PRELIMINARIES

The agent interacts with a Markov decision process (MDP) defined by states s t , actions a t , an initial state distribution p 0 (s), a dynamics function p(s t+1 | s t , a t ), a positive reward function r(s t , a t ) ≥ 0 and a discount factor γ ∈ [0, 1). The RL objective is to learn a policy π(a t | s t ) that maximizes the discounted sum of expected rewards within an infinite-horizon episode: max π E st+1∼p(•|st,at),at∼π(•|st) (1 -γ) ∞ t=0 γ t r(s t , a t ) . (1) The factor of (1 -γ) does not change the optimal policy, but simplifies the analysis (Janner et al., 2020; Zahavy et al., 2021) . We consider policies that are factored into two parts: an observation encoder e φ (z t | s t ) and a representation-conditioned policy π φ (a t | z t ). Our analysis considers infinite-length trajectories τ , which include the actions a t , observations s t , and the corresponding observation representations z t : τ (s 0 , a 0 , z 0 , s 1 , a 1 , z 1 , • • • ). To simplify notation, we write the discounted sum of rewards as R(τ ) (1 -γ) ∞ t=0 γ t r(s t , a t ). Lastly, we define the Q-function of a policy parameterized by φ, as Q(s t , a t ) = E τ ∼π φ ,e φ ,p [R(τ ) | s t , a t ].

3.2. METHOD OVERVIEW

Figure 2 : Aligned Latent Models (ALM) performs model-based RL by jointly optimizing the policy, the latent-space model, and the representations produced by the encoder using the same objective: maximize predicted rewards while minimizing the errors in the predicted representations. This objective corresponds to RL with an augmented reward function r. ALM estimates this objective without predicting high-dimensional observations st+1. Our method consists of three components, shown in Fig. 2 . The first component is an encoder e φ (z t | s t ), which takes as input a high-dimensional observation s t and produces a compact representation z t . This representation should be as compact as possible, while retaining the bits for selecting good actions and for predicting the Q-function. The second component is a dynamics model of representations, m φ (z t+1 | z t , a t ), which takes as input the representation of the current observation and the action and predicts the representation of the next observation. The third component is a policy π φ (a t | z t ), which takes representations as inputs and chooses an action. This policy will be optimized to select actions to maximize rewards, while also keeping the agent in states where the dynamics model is accurate.

3.3. DERIVING THE OBJECTIVE

To derive our objective, we build on prior work (Toussaint, 2009; Kappen et al., 2012) and view the RL objective as a latent-variable problem, where the return R(τ ) is the likelihood and the trajectory τ is the latent variable. Different from prior work, we include representations in this trajectory, in addition to the raw states, a difference which allows our method to learn good representations for MBRL. We can write the RL objective (Eq. 1) in terms of trajectories as E p(τ ) [R(τ )] by defining the distribution over trajectories p(τ ) as p φ (τ ) p 0 (s 0 ) ∞ t=0 p(s t+1 | s t , a t )π φ (a t | z t )e φ (z t | s t ). (2) Estimating and optimizing this objective directly is challenging because drawing samples from p(τ ) requires interacting with the environment, an expensive operation. What we would like to do instead is estimate this same expectation via trajectories sampled from a different distribution, q(τ ). We can estimate a lower bound on the expected return objective using samples from this different objective, by using the standard evidence lower bound (Jordan et al., 1999) : log E p(τ ) [R(τ )] ≥ E q(τ ) [log R(τ ) + log p(τ ) -log q(τ )] . This lower bound resolves a first problem, allowing us to estimate (a bound on) the expected return by drawing samples from the learned model, rather than from the true environment. However, learning a distribution over trajectories is difficult due to challenges in modeling high-dimensional observations, and potential compounding errors during sampling that can cause the policy to incorrectly predict high returns. We resolve these issues by only sampling compact representations of observations, instead of the observations themselves. By learning to predict the rewards and Q-values as a function of these representations, we are able to estimate this lower bound without sampling high-dimensional observations. Further, we carefully parameterize the learned distribution q(τ ) to support an arbitrary length of model rollouts (K), which allows us to estimate the lower bound accurately: q K φ (τ ) =p 0 (s 0 )e φ (z 0 | s 0 )π φ (a 0 | z 0 ) K t=1 p(s t | s t-1 , a t-1 )m φ (z t | z t-1 , a t-1 )π φ (a t | z t ). While it may seem strange that the future representations sampled from q K φ (τ ) are independent of states, this is an important design choice. It allows us to estimate the lower bound for any policy, using only samples from the latent-space model, without access to high dimensional states from the environment's dynamics function. Combining the lower bound (Eq. 3) with this choice of parameterization, we obtain the following objective for model-based RL: L K φ E q K φ (τ ) K-1 t=0 γ t r(st, at, st+1) + γ K log Q(sK , aK ) , where r(st, at, st+1) = (1 -γ) log r(st, at) (a) + log e φ (zt+1 | st+1) -log m φ (zt+1 | zt, at) . This objective is an evidence lower bound on the RL objective (see Proof in Appendix A.2). Theorem 3.1. For any representation e φ (z t | s t ), latent-space model m φ (z t+1 | z t , a t ), policy π φ (a t | z t ) and K ∈ N, the ALM objective L K φ corresponds to a lower bound on the expected return objective: 1 1 -γ exp(L K φ ) ≤ E p φ (τ ) t γ t r(st, at) . Here, we provide intuition for our objective and relate it to objectives in prior work. We start by looking at the augmented reward. The first term (a: extrinsic term) in this augmented reward function is the log of true rewards, which is analogous to maximizing the true reward function in the real environment, albeit on a different scale. The second term (b: intrinsic term), i.e., the negative KL divergence between the latent-space model and the encoder, is reminiscent of the prior methods (Goyal et al., 2019; Eysenbach et al., 2021b; Bharadhwaj et al., 2021; Rakelly et al., 2021) that regularize the encoder against a prior, to limit the number of bits used from the observations. Taken together, all the components are trained to make the model self-consistent with the policy and representation. Algorithm 1 The ALM objective can be optimized with any RL algorithm. We present an implementation based on DDPG (Lillicrap et al., 2015) . Select action a n ∼ π φ (• | e φ (s n )) using the current policy.

4:

Execute action a n and observe reward r n and next state s n+1 .

5:

Store transition (s n , a n , r n , s n+1 ) in B.

6:

Sample length-K sequences from the replay buffer (s i , a i , s i+1 } t+K-1 i=t ∼ B

7:

Compute the objective L K e φ ,m φ ((s i , a i , s i+1 } t+K-1 i=t ) using the sampled sequence. Eq. 7 8: Update encoder and model by gradient ascent on L K e φ ,m φ . 9: Compute the objective L K π φ ((s i=t )) using on-policy model-based trajectories. Eq. 8 10: Update policy by gradient ascent on L K π φ . 11: Update classifier, Q-function and reward using losses L C θ , L Q θ , L r θ . Eq. 9, 10, 11 The last part of our objective is the length of rollouts (K), that the model is used for. Our objective is directly applicable to all prior model-based RL algorithms which use the model only for a fixed number of rollouts rather than the entire horizon, like SVG style updates (Amos et al., 2020; Heess et al., 2015) and trajectory optimization (Tedrake, 2022) . Larger values of K correspond to looser bounds (see Appendix A.6). Although this suggests that a model-free estimate is the tightest, a Q-function learned using function approximation with TD-estimates (Thrun & Schwartz, 1993 ) is biased and difficult to learn. A larger value of K decreases this bias by reducing the dependency on a learned Q-function. While the lower bound in Theorem 3.1 is not tight, we can include a learnable discount factor such that it becomes tight (see Appendix A.5). In our experiments 4, we find that the objective in Theorem 3.1 is still a good estimate of true expected returns. In Appendix A.4, we derive a closed loop form for the optimal latent-dynamics and show that they are biased towards high-return trajectories: they reweight the true probabilities of trajectories with their rewards. We also derive a lower bound for the model-based offline RL setting, obtaining a similar objective with an additional behavior cloning term (Appendix A.7).

4. A PRACTICAL ALGORITHM

We now describe a practical method to jointly train the policy, model, and encoder using the lower bound (L K φ ). We call the resulting algorithm Aligned Latent Models (ALM) because joint optimization means that the objectives for the model, policy, and encoder are the same; they are aligned. For training the encoder and model (latent-space learning phase), q K φ (τ ) is unrolled using actions from a replay buffer, whereas for training the policy (planning phase), q K φ (τ ) is unrolled using actions imagined from the latest policy. To estimate our objective using just representations, our method also learns to predict the reward and Q-function from the learned representations z t using real data only (see Appendix C for details). Algorithm 1 provides pseudocode. Maximizing the objective with respect to the encoder and latent-space model. To train the encoder and the latent-space model, we estimate the objective L K φ using K-length sequences of transitions {s i , a i , s i+1 } t+K-1 i=t sampled from the replay buffer: L K e φ ,m φ ({si, ai, si+1} t+K-1 i=t ) = E e φ (z i=t |s t ) m φ (z i>t |z t:i-1 ,a i-1 ) γ K Q θ (zK , π(zK )) + t+K-1 i=t γ i r θ (zi, ai) -KL(m φ (zi+1 | zi, ai) e φtarg (zi+1 | si+1)) . (7) To optimize this objective, we sample an initial representation from the encoder and roll out the latent-space model using action sequences taken in the real environment (see Fig. 2 )foot_1 . We find that using a target encoder e φtarg (z t | s t ) to calculate the KL consistency term leads to stable learning. Maximizing the objective with respect to the policy. The latent-space model allows us to evaluate the objective for the current policy by generating on-policy trajectories. Starting from a sampled state s t the latent-space model is recurrently unrolled using actions from the current policy to generate a Klength trajectory of representations and actions (z t:t+K , a t:t+K ). Calculating the intrinsic term in the augmented reward (log e φ (z t+1 | s t+1 ) -log m φ (z t+1 | z t , a t )) for on-policy actions is challenging, as we do not have access to the next high dimensional state s t+1 . Following prior work (Eysenbach et al., 2020; Eysenbach et al., 2021a) , we learn a classifier to differentiate between representations sampled from the encoder e φ (z t+1 | s t+1 ) and the latent-space model m φ (z t+1 | z t , a 1 ) to estimate this term (see Appendix C for details). We train the latent policy by recurrently backpropagating stochastic gradients (Heess et al., 2015; Amos et al., 2020; Hafner et al., 2019) of our objective evaluated on this trajectory: L K π φ (st) = E q K φ (z t:K ,a t:K |s t ) t+K-1 i=t γ i-t r θ (zi, ai) + c • log C θ (zi+1, ai, zi) 1 -C θ (zi+1, ai, zi) + γ K Q θ (zK , π(zK )) . During policy training (Eq. 8), estimating the intrinsic reward, log e φ (z t+1 | s t+1 ) -log m φ (z t+1 | z t , a t ) , is challenging because we do not have access to the next high dimensional state (s t+1 ). Following prior work (Eysenbach et al., 2020; Eysenbach et al., 2021a) , we note that a learned classifier between representations sampled from the encoder e φ (z t+1 | s t+1 ) versus the latent-space model m φ (z t+1 | z t , a 1 ) can also be used to estimate the difference between log-likelihoods under them, which is exactly equal to the augmented reward function: log e φ (z t+1 | s t+1 ) -log m φ (z t+1 | z t , a t ) ≈ log C θ (z t+1 , a t , z t ) 1 -C θ (z t+1 , a t , z t ) . Here, C θ (z t+1 , a t , z t ) ∈ [0, 1] is a learned classifier's prediction of the probability that z t+1 is sampled from the encoder conditioned on the next state after starting at (z t , a t ) pair. The classifier is trained via the standard cross entropy loss: L C θ (z t ∼ e φ (• | s t ), z t+1 , z t+1 ) = log(C θ (z t+1 , z t , a t )) + log(1 -C θ ( z t+1 , z t , a t )). where z t+1 ∼ e φ (• | s t+1 ) is the real next representation and z t+1 ∼ m φ (• | z t , a t ) is the imagined next representation, both from the same starting pair (z t , a t ). Differences between theory and experiments. While our theory suggests a coefficient c = 1, we use c = 0.1 in our experiments because it slightly improves the results. We do provide an ablation which shows that ALM performs well across different values of c Figure 16 . We also omit the log of the true rewards in both Eq. 8 and 7. We show that this change is equivalent to the first one for all practical purposes (see Appendix A.8.). Nevertheless, these changes mean that the objective we use in practice is not guaranteed to be a lower bound.

5. EXPERIMENTS

Our experiments focus on whether jointly optimizing the model, representation, and policy yields benefits relative to prior methods that use different objectives for different components. We use SAC-SVG (Amos et al., 2020) as the main baseline, as it structurally resembles our method but uses different objectives and architectures. While design choices like ensembling (MBPO, REDQ) are orthogonal to our paper's contribution, we nonetheless show that ALM achieves similar sample efficiency MBPO and REDQ without requiring ensembling; as a consequence, it achieves good performance in ∼ 6× less wall-clock time. Additional experiments analyze the Q-values, ablate components of our method, and visualize the learned representations and model. All plots and tables show the mean and standard deviation across five random seeds. Where possible, we used hyperparameters from prior works; for example, our network dimensions were taken from Amos et al. (2020) . Additional implementation details and hyperparameters are in Appendix D and a summary of successful and failed experiments are in Appendix G. We have released the codefoot_2 . Baselines. We provide a quick conceptual comparison to baselines in Table 2 . In Sec. 5.1, we will compare with the most similar prior methods, SAC-SVG (Amos et al., 2020) and SVG (Heess et al., 2015) . Like ALM, these methods use a learned model to perform SVG-style actor updates. SAC-SVG also maintains a hidden representation, using a GRU, to make recurrent predictions in the observation space. While SAC-SVG learns the model and representations using a reconstruction objective, ALM trains these components with the same objective as the policy. The dynamics model from SAC-SVG bears a resemblance to Dreamer-v2 (Hafner et al., 2020) and other prior work that targets image-based tasks (our experiments target state-based tasks). SAC-SVG reports better results than many prior model-based methods (POPLIN-P (Wang & Ba, 2019) , SLBO (Luo et al., 2018) , ME-TRPO (Kurutach et al., 2018) ). In Sec. 5.2, we focus on prior methods that use ensembles. MBPO (Janner et al., 2019 ) is a model-based method that uses an ensemble of dynamics models for both actor updates and critic updates. REDQ (Chen et al., 2021 ) is a model-free method which achieves sample efficiency onpar with model-based methods through the use of ensembles of Q-functions. We also compare to TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018) ; while typically not sample efficient, these methods can achieve good performance asymptotically.

5.1. IS THE ALM OBJECTIVE USEFUL?

We start by comparing ALM with the baselines on the locomotion benchmark proposed by Wang et al. (2019) . In this benchmark, methods are evaluated based on the policy return after training for 2e5 environment steps. The TruncatedAnt-v2 and TruncatedHumanoid-v2 tasks included in this benchmark are easier than the standard Ant-v2 and Humanoid-v2 task, which prior model-based methods struggle to solve (Janner et al., 2019; Chua et al., 2018; Amos et al., 2020; Shen et al., 2020; Rajeswaran et al., 2020; Feinberg et al., 2018; Buckman et al., 2018) . The results, shown in Table 1 , show that ALM achieves better performance than the prior methods on 4 /5 tasks. The results for SAC-SVG are taken from Amos et al. (2020) and the rest are from Wang et al. (2019) . Because SAC-SVG is structurally similar to ALM, the better results from ALM highlight the importance of training the representations and the model using the same objective as the policy.

5.2. CAN ALM ACHIEVE GOOD PERFORMANCE WITHOUT ENSEMBLES?

Prior methods such as MBPO and REDQ use ensembles to achieve SOTA sample efficiency at the cost of long training times. We hypothesize that the self-consistency property of ALM will make the latent-dynamics simpler, allowing it to achieve good sample efficiency without the use of ensembles. Our next experiment studies whether ALM can achieve the benefits of ensembles without the computational costs. As shown in Figure 3 , ALM matches the sample complexity of REDQ and MBPO, but requires ∼ 6× less wall-clock time to train. Note that MBPO fails to solve the highest-dimensional tasks, Humanoid-v2 and Ant-v2 (R 376 and R 111 ). We optimized the ensemble training in the official REDQ code to be parallelized, leading to an increase in training speeds by 3×, which is still 2× slower than our method (which does not employ parallelization optimization).

5.3. WHY DOES ALM WORK?

To better understand why ALM achieves high sample complexity without ensembles, we analyzed the Q-values, ran ablation experiments, and visualized the learned representations. Analyzing the Q-values. One way of interpreting ALM is that it uses a model and an augmented reward function to obtain better estimates of the Q-values, which are used to train the policy. In contrast, REDQ uses a minimum of a random subset of the Q-function ensemble and SAC-AVG (baseline used in REDQ paper (Chen et al., 2021) ) uses an average value of the Q-function ensemble to obtain a low variance estimate of these Q-values. While ensembles can be an effective way to improve the estimates of neural networks (Garipov et al., 2018; Abdar et al., 2021) , we hypothesize that our latent-space model might be a more effective approach in the RL setting because it incorporates the dynamics, while also coming at a much lower computational cost. To test this hypothesis, we measure the bias of the Q-values, as well as the standard deviation of that bias, following the protocol of Chen et al. ( 2021); Fujimoto et al. (2018) . See Appendix E and Figure 4 for details. The positive bias will tell us whether the Q-values overestimates the true returns, while the standard deviation of this bias is more relevant for the purpose of selecting actions. We see from Fig. 4 that the standard deviation of the bias is lower for ALM than for REDQ and SAC-AVG, suggesting that the actions maximizing our objective are similar to actions that maximize true returns. Ablation experiments. In our first ablation experiment, we compare ALM to ablations that separately remove the KL term and the value term from the encoder objective (Eq. 7), and remove the classifier term from the policy objective (Eq. 8). As shown in Fig. 5a , the KL term, which is a purely self supervised objective Grill et al. (2020) , is crucial for achieving good performance. The classifier term stabilizes learning (especially on Ant and Walker), while the value term has little effect. We hypothesize that the value term may not be necessary because its effect, driving exploration, may already be incorporated by Q-value overestimation (a common problem for RL algorithms Sutton & Barto (2018) ; Fujimoto et al. (2018) ). A second ablation experiment (Fig. 5b ) shows the performance of ALM for different numbers of unrolling steps (K). We perform a third ablation experiment of ALM(3), which uses the TD3 actor loss for training the policy. This ablation investigates whether the representations learned by ALM(3) are beneficial for model-free RL. Prior work (Gupta et al., 2017; Eysenbach et al., 2021b; Zhang et al., 2020) has shown that representations learning can facilitate properties like exploration, generalization and transfer. In fig. 5c , we find that the end to end representation learning of ALM(3) achieves high returns faster than standard model-free RL.

6. CONCLUSION

This paper introduced ALM, an objective for model-based RL that jointly optimizes representations, latent-space models, and policies all using the same objective. This objective mends the objective mismatch problem and results in a method where the representations, model, and policy all cooperate to maximize the expected returns. Our experiments demonstrate the benefits of such joint optimization: it achieves better performance than baselines that use separate objectives, and it achieves the benefits of ensembles without their computational costs. At a high level, our end-to-end method is reminiscent of the success deep supervised learning. Deep learning methods promise to learn representations in an end-to-end fashion, allowing researchers to avoid manual feature design. Similarly, our algorithm suggests that the algorithmic components themselves, like representations and models, can be learned in an end-to-end fashion to optimize the desired objective. Limitations and future work. The main limitation of our practical method is complexity: while simpler than prior model-based methods, it has more moving parts than model-free algorithms. One potential future work that can directly stem from ALM is an on-policy version of ALM, where Equation 8 can be calculated and simultaneously optimized with the encoder, model and policy, without using a classifier. Another limitation is that in our experiments, the value term in the equation 7, does not yield any benefit. An interesting direction is to investigate the reason behind this, or find tasks where an optimistic latent space is beneficial. While our theoretical analysis takes an important step in constructing lower bounds for model-based RL, it leaves many questions open, such as accounting for function approximation and exploration. Nonetheless, we believe that our proposed objective and method are not only practically useful, but may provide a template for designing even better model-based methods with learned representations.

Outline of Appendices.

In Appendix A, we include all the proofs. In Appendix B, we compare various components used by ALM and the baselines. Appendix C includes additional learning details. In Appendix D we have mentioned implementation details. In Appendix E, we have details about additional experiments. In Appendix G, we have a summary of experiments that were tried and did not help. Lastly, Appendix F compares our objective to the MnM objective from prior work.

A PROOFS

A.1 HELPER LEMMAS Lemma A.1. Let P K (H) be a truncated geometric distribution. P K (H) =    (1 -γ)γ H H ∈ [0, K -1] γ K H = K 0 H > K Given discount factor γ ∈ (0, 1) and a random variable x t , we have the following identity E P K (H) H t=0 x t = K H=0 P K (H) H t=0 x t = (1 -γ) K-1 H=0 γ t H t=0 x t + γ K K t=0 x t = x 0 ((1 -γ)(1 + γ + • • • + γ K-1 ) + γ K ) + x 1 ((1 -γ)(γ + γ 2 + • • • + γ K-1 ) + γ K ) + • • • + x K (γ K ) = x 0 ((1 -γ) 1 -γ K 1 -γ + γ K ) + x 1 ((1 -γ)( γ(1 -γ K-1 ) 1 -γ ) + γ K ) + • • • + x K (γ K ) = K t=0 γ t x t . Lemma A.2. Let P K (H) be a truncated geometric distribution and p φ (τ | H) be the distribution over H + 1 length trajectories: p φ (τ | H) = p 0 (s 0 )e φ (z 0 | s 0 )π φ (a 0 | z 0 ) H t=1 p(s t | s t-1 , a t-1 )π φ (a t | z t )e φ (z t | s t ) Then the RL objective can be re-written in the following way. E p φ (τ ) (1 -γ) t γ t r(s t , a t ) E p K φ (τ ) (1 -γ) K-1 t=0 γ t-1 r(s t , a t ) + γ K Q(s K , a K ) = E P K (H) E p φ (τ |H=H) [1{H ≤ K -1}r(s H , a H ) + 1{H = K}Q(s H , a H )] . The Q function is defined as Q(s H , a H ) = E p φ (τ ) [(1 -γ) ∞ t=0 γ t r(s t+H , a t+H )] . This lemma helps us to interpret the discounting in RL as sampling from a truncated geometric distribution over future time steps. Lemma A.3. Let p(x) be a distribution over R n . The following optimization problem can be solved using the methods of Lagrange multipliers to obtain an optimal solution analytically. max p(x) E p(x) [f (x) -log p(x)] such that p(x)dx = 1 Starting out by writing the Lagrangian for this problem, then differentiating it and then equating it to 0: L(p(s), λ) a = E p(x) [f (x) -log p(x)] -λ( p(x)dx -1) ∇ p(x) L = f (x) -log p(x) -1 -λ 0 = f (x) -log p(x) -1 -λ p * (x) b = e f (x)-1-λ We find the value of λ by substituting the value of p(x) from (b) in the equality constraint. e f (x)-1-λ dx = 1 e 1+λ c = e f (x) dx Substituting (c) to remove the dual variable from (b), we obtain p * (x) d = e f (x) e f (x) dx . A.2 A LOWER BOUND FOR K-STEP LATENT-SPACE In this section we present the proof of Theorem 3.1. We restate the theorem for clarity. Theorem 3.1. For any representation e φ (z t | s t ), latent-space model m φ (z t+1 | z t , a t ), policy π φ (a t | z t ) and K ∈ N, the ALM objective L K φ corresponds to a lower bound on the expected return objective: 1 1 -γ exp(L K φ ) ≤ E p φ (τ ) t γ t r(st, at) . Note that scaling the rewards by a constant factor (1 -γ) does not change the RL problem and finding a policy to maximize the log of the expected returns is the same as finding a policy to maximize expected returns, because log is monotonic. We want to estimate the RL objective with trajectories sampled from a different distribution q φ (τ ) which leads to an algorithm that avoids sampling high-dimensional observations. q φ (τ ) = p 0 (s 0 )e φ (z 0 | s 0 ) ∞ t=0 p(s t+1 | s t , a t )m φ (z t+1 | z t , a t )π φ (a t | z t ) When used as trajectory generative models, p φ (τ ) predicts representations z t using current state s t . Whereas q φ (τ ) predicts z t by unrolling the learned latent-model recurrently on z t-1 (and a t-1 ). Similar to variational inference, p φ (τ ) can be interpreted as the posterior and q φ (τ ) as the recurrent prior. Since longer recurrent predictions of a learned model significantly diverge from the true ones, we parameterize q to support an arbitrary length of model rollouts (K) during planning: q K φ (τ ) = p 0 (s 0 )e φ (z 0 | s 0 )π φ (a 0 | z 0 ) K t=1 p(s t | s t-1 , a t-1 )m φ (z t | z t-1 , a t-1 )π φ (a t | z t ) Proof. We derive a lower bound on the RL objective for K-step latent rollouts. log E p φ (τ ) (1 -γ) t γ t r(s t , a t ) a = log E p K φ (τ ) (1 -γ) K-1 t=0 γ t r(s t , a t ) + γ K Q(s K , a K ) b = log E P K (H)   E p φ (τ |H=H)   1{H ≤ K -1}r(s H , a H ) + 1{H = K}Q(s H , a H ) Ψ     c = log P K (H)p φ (τ | H = H) (Ψ) dτ dH d ≥ P K (H) log q φ (τ | H = H) p φ (τ | H = H) q φ (τ | H = H) (Ψ) dτ dH e ≥ P K (H)q φ (τ | H = H) log p φ (τ | H = H) q φ (τ | H = H) (Ψ) dτ dH f = P K (H)q φ (τ | H = ∞) H t=0 log e φ (z t+1 | s t+1 ) -log m φ (z t+1 | z t , a t ) + log (Ψ) dτ dH g = q φ (τ )P K (H) H t=0 log e φ (z t+1 | s t+1 ) -log m φ (z t+1 | z t , a t ) + log (Ψ) dτ dH h = q φ (τ ) K-1 t=0 γ t (log e φ (z t+1 | s t+1 ) -log m φ (z t+1 | z t , a t )) + (1 -γ)γ t log r(s t , a t ) + γ K log Q(s K , a K )dτ i = E q K φ (τ ) K-1 t=0 γ t (log e φ (z t+1 | s t+1 ) -log m φ (z t+1 | z t , a t )) + (1 -γ)γ t log r(s t , a t ) + γ K log Q(s K , a K ) In (a), we start with the K-step version of the RL objective. For (b), we use Lemma A.2. For (d), we use Jensen's inequality and multiply by |H=H) . For (e), we apply Jensen's inequality. For (f ), since all the terms inside the summation only depends on the first H steps of the trajectory, we change the integration from over H length to over infinite length trajectories. For (g), we change the order of integration. For (h), we use Lemma A.1 and the fact that q φ (τ |H=H) q φ (τ E P K (H) [log(1{H ≤ K -1}r(s H , a H ) + 1{H = K}Q(s H , a H ))] = K-1 t=0 (1 - γ)γ t log r(s t , a t ) + γ K log Q(z K , a K ). A.3 A LOWER BOUND FOR LATENT-SPACE LAMBDA WEIGHTED SVG(K) Using on-policy the data, the policy returns can be estimated using the k-step RL estimator (RL k ) which uses the sum of rewards for the first k steps and truncates it with the Q-function. RL k = (1 -γ) k-1 t=0 γ t r(s t , a t ) + γ k Q(s k , a k ) Like previous works (Schulman et al., 2015; Hafner et al., 2019) , we write the RL objective as a λ-weighted average of k-step returns for different horizons (different values of k), to substantially reduce the variance of the estimated returns at the cost of some bias. E p φ (τ ) (1 -γ) t γ t r(s t , a t ) = E p φ (τ ) (1 -λ) ∞ k=1 λ n-1 RL k Our K-step lower bound L K φ involves rolling out the model on-policy for K time-steps. A larger value of K reduces the dependency on a learned Q-function and hence reduces the estimation bias at the cost of higher variance. Lemma A.4. We show that a λ-weighted average of our lower bounds for different horizons (different values of K) is a lower bound on the RL objective. log E p φ (τ ) (1 -γ) t γ t r(s t , a t ) a = log E p φ (τ ) (1 -λ) ∞ k=1 λ k-1 RL k b = log(1 -λ) ∞ k=1 λ k-1 E p φ (τ ) RL k c = log E PGeom(k) E p φ (τ ) RL k+1 d ≥ E PGeom(k) log E p φ (τ ) RL k+1 e = (1 -λ) ∞ k=1 λ k-1 log E p φ (τ ) RL k f ≥ (1 -λ) ∞ k=1 λ k-1 L k (φ) In (a) we use λ-weighted average estimation of the RL objective. In (b) we use linearity of expectation. In (c), we note that for every k, the coefficient of RL k is actually the probability of k -1 under the geometric distribution. Hence, we rewrite it as an expectation over the geometric distribution. In (d), we use Jensen's inequality. In (e) we write out the probability values of the geometric distribution. In (f ), we use Theorem 3.1 for every value of k.

DISTRIBUTION

We define γ φ,K (H) to be a learned discount distribution over the first K + 1 timesteps {0, 1, . . . , K}. We use this learned discount factor when using data from imaginary rollouts. We start by lower bounding the RL objective and derive the optimal discount distribution γ * φ,K (H) and the optimal latent dynamics distribution q * (τ | H) that maximizes this lower bound. a = log E P K (H)   E p φ (τ |H=H)   1{H ≤ K -1}r(s H , a H ) + 1{H = K}Q(s H , a H ) Ψ     b = log γ φ,K (H)q φ (τ | H) γ φ,K (H)q φ (τ | H) P K (H)p φ (τ | H)ψ dτ dH c ≥ γ φ,K (H) q φ (τ | H) log P K (H)p φ (τ | H)ψ γ φ,K (H)q φ (τ | H) dτ dH Given a horizon H ∈ {1, • • • , K}, the optimal dynamics q * (τ | H) can be calculated analytically: q * (τ | H) d = p φ (τ | H)ψ p φ (τ | H)ψdτ We substitute this value of q * in equation (c): e = γ φ,K (H) p φ (τ | H)ψ p φ (τ | H)ψdτ log P K (H)p φ (τ | H)ψ p φ (τ | H)ψdτ γ φ,K (H)p φ (τ | H)ψ dτ dH f = γ φ,K (H) log P K (H) γ φ,K (H) p φ (τ | H)ψdτ dH We now calculate the optimal discount distribution analytically γ * φ,K (H): γ * φ,K (H) g = P K (H) p φ (τ | H)ψdτ K H=0 P K (H) p φ (τ | H)ψdτ In (a), we rewrite the RL objective using Lemma A.2. In (b), we multiply by |H) . In (c), we use Jensen's inequality. In (d) and (g) we use the method of Lagrange multipliers to derive optimal distributions. We use Lemma A.3 for this result. Below we write the optimal latent-space in terms of rewards and Q-function. γ φ,K (H)q φ (τ |H) γ φ,K (H)q φ (τ Writing out the optimal latent-space distribution: The optimal latent-dynamics are non-Markovian and do not match MDP dynamics, but are optimistic towards high-return trajectories. q * (τ | H) =    p φ (τ |H)r(s H ,a H ) Ep φ [r(s H ,a H )] H ∈ [1, K -1] p φ (τ |H)Q(s H ,a H ) Ep φ [Q(s H ,a H )] H = K Writing out the optimal discount distribution: We start from equation (f ) from Appendix A.4 the previous proof which is a lower bound on the RL objective. To derive (f ), we have already substituted the optimal latent-dynamics. We now substitute the value of γ * φ,K (H) and verify that the lower bound (f ) leads to the original objective, suggesting that the bound is tight.  γ * φ,K (H) =          (1-γ)γ H Ep φ [r(s H ,a H )] Ep φ [Q(s 0 ,a 0 )] H ∈ [0, K -1] γ K Ep φ [Q(s H ,a H )] Ep φ [Q(s 0 ,a 0 )] H = K 0 H > K A. f = γ φ,K (H) log P K (H) γ φ,K (H) p φ (τ | H)ψdτ dH h = P K (H) p φ (τ | H)ψdτ K H=0 P K (H) p φ (τ | H)ψdτ log P K (H) K H=0 P K (H) p φ (τ | H)ψdτ P K (H) ( ( ( ( ( ( ( ( p φ (τ | H)ψdτ p φ (τ | H)ψdτ ) dH i = log( K H=0 P K (H) p φ (τ | H)ψdτ ) P K (H) p φ (τ | H)ψdτ dH k = log E p φ (τ ) (1 -γ) t γ t r(s t , a t ) In (h) we substitute the value of γ * φ,K (H) and cancel out common terms. Since the integration in (i) is over H, we bring out all the terms that do not depend on H. For (k), we use the result from Lemma A.2. Hence we have proved that the bound becomes tight when using the optimal discount and trajectory distributions. Similar to Theorem 3.1, we want to find an encoder e φ (z t | s t ), a policy π φ (a t | s t ), and a model m φ (z t+1 | z t , a t ) to maximize the RL objective: max φ log E p φ,b (τ ) (1 -γ) t γ t r(s t , a t ) a = log E p K φ,b (τ ) (1 -γ) K-1 t=0 γ t r(s t , a t ) + γ K Q(s t , a t ) c = log P K (H)p φ,b (τ | H = H) (Ψ) dτ dH d ≥ P K (H) log q φ (τ | H = H) p φ,b (τ | H = H) q φ (τ | H = H) (Ψ) dτ dH e ≥ P K (H)q φ (τ | H = H) log p φ,b (τ | H = H) q φ (τ | H = H) (Ψ) dτ dH f = P K (H)q φ (τ | H = ∞) H t=0 log( e φ (z t+1 | s t+1 )π b (a t+1 | s t+1 ) m φ (z t+1 | z t , a t )π φ (a t+1 | z t+1 ) ) + log (Ψ) dτ dH g = q φ (τ )P K (H) H t=0 log( e φ (z t+1 | s t+1 )π b (a t+1 | s t+1 ) m φ (z t+1 | z t , a t )π φ (a t+1 | z t+1 ) ) + log (Ψ) dτ dH h = q φ (τ ) K-1 t=0 γ t log( e φ (z t+1 | s t+1 )π b (a t+1 | s t+1 ) m φ (z t+1 | z t , a t )π φ (a t+1 | z t+1 ) ) + (1 -γ)γ t log r(s t , a t ) + γ K log Q(s K , a K )dτ i = E q K φ (τ )      K-1 t=0 γ t log( e φ (z t+1 | s t+1 ) m φ (z t+1 | z t , a t ) ) KL consistency term +γ t log( π b (a t+1 | s t+1 ) π φ (a t+1 | z t+1 ) ) Behaviour cloning term +(1 -γ)γ t log r(s t , a t ) + γ K log Q(s K , a K )      The only main difference is between this proof and the proof A.2 of Theorem3.1 is in step (f ), where canceling out the common terms of p(τ ) and q(τ ) leaves an additional behavior cloning term. This derivation theoretically backs the additional behavior cloning term used in prior representation learning methods (Abdolmaleki et al., 2018; Peters et al., 2010) for the offline RL setting. A.8 OMISSION OF THE LOG OF REWARDS AND Q FUNCTION IS EQUIVALENT TO SCALING KL IN EQUATION 8 While our main results omit the logarithmic transformation of the rewards and Q function, in this section we describe that this omission is approximately equivalent to scaling the KL coefficient in Eq. 8. Using these insights, we applied ALM, with the log of rewards, to a transformed MDP with a shifted reward function. A large enough constant, a, was added to all original rewards(which doesn't change the optimal policy assuming that there are no terminal states) r new = r + a. Taking a log of this new reward is equal to the reward of the original objective, scaled by the constant, a a(log(r + a) -log(a)) ≈ r. The additive term can be ignored because it won't contribute to the optimization. We plot both y = r and y = a(log(r + a) -log(a)), to show that they are very similar for commonly used values of rewards in Figure 6 . The scaling constant a can be interpreted as the weight of the log of rewards, relative to the KL term in the ALM objective. Hence changing this value is approximately equivalent to scaling the KL coefficient. The results, shown in Fig 12,  show that this version of ALM which includes the logarithm transformation performs at par with the version of ALM without the logarithm. This results shows that we can add back the logarithm to ALM without hurting performance. For both Figure 6 and Figure 12 , we used the value of a = 10000. ALM objective using the transformed reward a = E q K φ (τ ) K-1 t=0 γ t (log e φ (z t+1 | s t+1 ) -log m φ (z t+1 | z t , a t )) + (1 -γ)γ t log(r new (s t , a t )) + γ K log Q new (s K , a K ) b ≡ E q K φ (τ ) [ K-1 t=0 γ t (log e φ (z t+1 | s t+1 ) -log m φ (z t+1 | z t , a t )) + (1 -γ)γ t r(s t , a t ) a + log(a) + γ K Q(s K , a K ) a + log(a) c = E q K φ (τ ) [ K-1 t=0 γ t (log e φ (z t+1 | s t+1 ) -log m φ (z t+1 | z t , a t )) + (1 -γ)γ t ( r(s t , a t ) a ) + γ K ( Q(s K , a K ) a ) d = E q K φ (τ ) [ K-1 t=0 γ t a (log e φ (z t+1 | s t+1 ) -log m φ (z t+1 | z t , a t )) + (1 -γ)γ t r(s t , a t ) + γ K Q(s K , a K ) In (a), we write the ALM objective using the new reward function. In (b), we use the fact that log(r new (s t , a t )) ≈ r(st,at) a + log(a) , and log Q new (s K , a K ) ≡ Q(s K ,a K ) a + log(a) for a large enough constant a. (See explanation above). In (c), we remove the extra constants which add up to 0. In (d), we note that scaling the rewards and Q function by 1/a, is equivalent to scaling the KL term by a, which is a large enough constant. Figure 6 : Analyzing logarithmic reward transformation.: While our theoretical derivation motivates a logarithmic transformation of the returns, here we show that appropriately scaling and shifting the rewards makes that logarithmic transformation have no effect. This throws light on the fact that omitting the logarithmic transformation can be thought of as scaling the two components in our augmented reward function Eq.6.

B COMPARISON TO BASELINES

Here we provide a quick comparison to baselines, conceptually in terms of number of gradient updates per env step, usage of ensembles are used for dynamics model or Q-function, and objectives for representations, model, and policy.  L Q θ (z t ∼ e φ (• | s t ), a t , r t , z t+1 ∼ e φ (• | s t+1 )) = (Q θ (z t , a t ) -(r t + γQ θtarg (z t+1 , π(z t+1 )))) 2 . (10) The TD target is computed using a target Q-function (Fujimoto et al., 2018) . We learn the reward function r θ (z t , a t ) using data s t , a t , r t from the real environment only. For r θ (z t , a t ), by minimizing the mean squared error between true and predicted rewards : L r θ (z t ∼ e φ (• | s t ), a t , r t ) = (r θ (z t , a t ) -r t ) 2 . ( ) Unlike prior representation learning methods (Zhang et al., 2020) , we do not use the reward or Q function training signals to train the encoder. The encoder, model, and policy are optimized using the principled joint objective only.

D IMPLEMENTATION DETAILS

We implement ALM using DDPG (Lillicrap et al., 2015) as the base algorithm. Following prior svg methods (Amos et al., 2020) , we parameterize the encoder, model, policy, reward, classifier and Q-function as 2-layer neural networks, all with 512 hidden units except the model which has 1024 hidden units. The model and the encoder output a multivariate Gaussian distribution over the latent-space with diagonal covariance. Like prior work (Hansen et al., 2022; Yarats et al., 2021) , we apply layer normalization (Ba et al., 2016) to the value function and rewards. Similar to prior work (Schulman et al., 2015; Hafner et al., 2019) , we reduce variance of the policy objective in Equation 8, by computing an exponentially-weighted average of the objective for rollouts of length 1 to K. This average is also a lower bound (Appendix A.3). To train the policy, reward, classifier and Q-function we use the representation sampled from the target encoder. For exploration, we use added normal noise with a linear schedule for the std (Yarats et al., 2021) . All hyperparameters are listed in Table 3 . The brief summary of all the neural networks, their loss functions and the inputs to their loss functions are listed in Table 4 .

E ADDITIONAL EXPERIMENTS

Analyzing the learned representations. Because the ALM objective optimizes the encoder and model to be self-consistent, we expect the ALM dynamics model to remain accurate for longer rollouts than alternative methods. We test this hypothesis using an optimal trajectory from the HalfCheetah-v2 task. Starting with the representation of the initial state, we autoregressively unroll the learned model, comparing each prediction z t to the true representation (obtained by applying the encoder to observation s t ). Fig. 7 (left) visualizes the first coordinate of the representation, and shows that the model learned via ALM remains accurate for ∼ 20 steps. The ablation of ALM that removes the KL term diverges after just two steps. Bias and Variance of the Lower Bound. We recreate the experimental setup of (Chen et al., 2021) to evaluate the average and the std value of the bias between the Monte-Carlo returns and the estimated returns (lower bound L K φ (s, a) for our experiments). Similar to (Chen et al., 2021) , since TruncatedAnt-v2 Figure 11 : ALM is robust across different values of the coefficient for the KL term in Equation 8. In our implementation, we deviate from the value 1, because it leads to relatively gradual increase in returns on some environments. This is expected because higher coefficient to the KL term leads to higher compression Eysenbach et al. (2021b) . We add that prior work on variational inference Wenzel et al. (2020) also finds that scaling the KL term can improve results. 

HalfCheetah-v2

Figure 12 : Using the logarithm does not actually hurt performance. Our theory suggests that we should take the logarithm of the reward function and Q-function. Naïvely implemented, this logarithmic transformation (pink) performs much worse than omitting the transformation (green). We also see that using a log of rewards for only training the encoder and model does not affect the performance (blue). We hypothesize that the non linearity of log(x) for reward values makes the Q-values similar for different actions. However, by transforming the reward function (which does not change the optimization problem), we are able to include the theoretically-suggested logarithm while retaining high performance (red). See Appendix A.8 for more details. The main difference between ALM and a prior joint optimization method (MnM), is that ALM learns the encoder function. Replacing that learned encoder with an identify function yields a method that resembles MnM, and performs much worse. This result supports our claim that RL methods that use latent-space models can significantly outperform state-space models. F COMPARISON TO PRIOR WORK ON LOWER BOUND FOR RL (MNM) Our approach of deriving an evidence lower bound on the RL objective is similar to prior work (Eysenbach et al., 2021a) . In this section we briefly go over the connection between our method and Eysenbach et al. (2021a) . The lower bound presented in (Eysenbach et al., 2021a ) is a special case of the lower bound in 3.1. By taking the limit K → ∞ and assuming an identity function for the encoder, we exactly reach the lower bound presented in Eysenbach et al. (2021a) . By using a bottleneck policy (policy with an encoder), ALM(K) learns to represent the observations according to their importance in the control problems rather than trying to reconstruct every observation feature with high fidelity. This is supported by the fact that ALM solves Humanoid-v2 and Ant-v2, which were not solvable by prior methods like MBPO and MnM. By using the model for K steps, we have a parameter to explicitly control the planning / Q-learning bias. Hence, the policy faces a penalty (intrinsic reward term) only for the first K steps rather than the entire horizon (Eysenbach et al., 2021a) , which could lead to lower returns on the training tasks (Eysenbach et al., 2021b) .

G FAILED EXPERIMENTS

Experiments that we tried and that did not help substantially: • Value expansion: Training the critic using data generated from the model rollouts. The added complexity did not add much benefit. • Warm up steps: Training the policy using real data for a fixed time-steps at the start of training. • Horizon scheduling: Scheduling the sequence length from 1 to K at the start of training. • Exponential discounting: Down-scaling the learning rate of future time-steps using a temporal discount factor, to avoid exploding or vanishing gradients. Experiments that were tried and found to help: • Target encoder: Using a target encoder for the KL term in Eq. 7 helped reduce variance in episode returns. • Elu activation: Switching from Relu to Elu activations for all networks for ALM resulted in more stable and sample efficient performance across all tasks.



Project website with code: https://alignedlatentmodels.github.io/ In our code, we do not use the γ discounting for Equation7. https://alignedlatentmodels.github.io/



Figure 1: (left) Most model-based RL methods learn the representations, latent-space model, and policy using three different objectives. (Right) We derive a single objective for all three components, which is a lower bound on expected returns. Based on this objective, we develop a practical deep RL algorithm.

Figure 3: Good performance without ensembles. Our method (ALM) can (Top) match the sample complexity of ensembling-based methods (MBPO, REDQ) while (Bottom) requiring less runtime. Compared to MBPO, ALM takes ∼ 10× less time per environment step. See Appendix Fig. 9 for results on other environments.

Figure 4: Analyzing Q-values. See text for details.

(a) Terms in the objective. (b) Sequence length. (c) Model free RL.

Figure 5: Ablation experiments (Left) Comparison of ALM (3) with no value term for the encoder, no KL term for the encoder and no classifier based rewards for the policy. Results reflect the importance of temporal consistency terms, especially for training the encoder. (Center) Comparison of ALM(K) for different values of K and baselines SAC. Using architectures that support larger values of K could promise further improvements in performance. (Right) Representation learning objective of ALM(3) leads to higher sample efficiency for model-free RL. To ensure the validity of these results, we implemented TD3 (TD3 (ours)), which uses the same architecture, exploration and learning parameters as our method. Ablation results for other environments can be found in Fig.10a, 10b, 10c.

5 TIGHTENING THE K-STEP LOWER BOUND USING THE OPTIMAL DISCOUNT DISTRIBUTION AND THE OPTIMAL LATENT DYNAMICS

K (H) p φ (τ | H)ψdτ( ( ( ( ( ( ( ( ( ( ( ( ( (

Figure 7: Analyzing the learned representations: (Left-top) The ground truth representations are obtained from the respective trained encoders on the same optimal trajectory. (Left-bottom) Without the KL term, the representations learnt are degenerate, i.e. they correspond to the same value for different states. (Right) The KL term in the ALM objective, trains the model to reduce the future K step prediction errors. The latent-space model is accurately able to approximate the true representations upto ∼ 20 rollout steps.

Figure 8: Bias and Variance of the Lower Bound: In accordance with what our theory suggests, the joint objective is a biased estimate of the true returns. The std values are uniform throughout training and consistently lower than REDQ, which could be the reason behind the sample efficiency of ALM(3)

Figure 10: Additional ablation experiments.

Figure 13: Comparison with an MnM ablation. The main difference between ALM and a prior joint

Figure 14: Asymptotic performance. Even after 1 million environment steps, ALM still outperforms the SAC baseline.

Initialize the encoder e φ (z t | s t ), model m φ (z t+1 | z t , a t ), policy π φ (a t | z t ), classifier C θ (z t+1 , a t , z t ), reward r θ (z t , a t ), Q-function Q θ (z t , a t ), replay buffer B. 2: for n = 1, • • • , N do do

On the model-based benchmark fromWang et al. (2019), ALM outperforms model-based and modelfree methods on 4 /5 tasks, often by a wide margin. We report mean and std. dev. across 5 random seeds. We use T-Humanoid-v2 and T-Ant-v2 to refer to the respective truncated environments fromWang et al. (2019).

Table showing conceptual comparisons of ALM to baselines.Estimating the Q-function and reward function. Unlike most MBRL algorithms, the Q function Q θ (z t , a t ) is learned using transitions s t , a t , r t , s t+1 from the real environment only, using the standard TD loss:

• In Figure20, we show the average final episodic returns achieved by ALM for different coefficients for the KL term in Equation8.

Where possible, we have tried to use the same hyperparameters and architectures as SAC-SVG. We use the same training hyperparameters like update to data collection ratio, batch size and rollout length when compared to sac-svg. Both methods use the same policy improvement technique: stochastic value gradients. The two exceptions are decisions that simplify our method: unlike SAC-SVG, we use a feedforward dynamics model instead of an RNN; unlike SAC-SVG, we use a simple random noise instead of a more complex entropy schedule for exploration. To test whether the soft actor critic's entropy, used in SAC-SVG can be a confounding factor causing SAC-SVG to perform worse than ALM, we compare a version of ALM which uses a soft actor critic entropy bonus like the SAC-SVG.

acknowledgement

Acknowledgments. The authors thank Shubham Tulsiani for helpful discussions throughout the project and feedback on the paper draft. We thank Melissa Ding, Jamie D Gregory, Midhun Sreekumar, Srikanth Vidapanakal, Ola Electric and CMU SCS for helping to set up the compute necessary for running the experiments. We thank Xinyue Chen and Brandon Amos for answering questions about the baselines used in the paper, and Nicklas Hansen for helpful discussions on model-based RL.

A.6 TIGHTNESS OF LOWER BOUND WITH LENGTH OF ROLLOUTS K

Proof. Proving L K ≥ L K+1 for all K ∈ N will do.In the offline RL setting (Levine et al., 2020; Prudencio et al., 2022) , we have access to a static dataset of trajectories from the environment. These trajectories are collected from one or more unknown policies. We derive a lower bound similar to Theorem 3.1. The main difference is that in Theorem 3.1, once a policy was updated (φ t → φ t+1 ), we were able to collect new data using it, while in offline RL we have to re-use the same static data. Similar to Theorem 3.1, we define the distribution over trajectories p(τ ):such that the offline dataset consists of trajectories sampled from this true distribution. We do not assume access to the data collection policies. Rather, π b (a t | s t ) is the behavior cloning policy obtained from the offline dataset. We want to estimate the RL objective with trajectories sampled from a different distribution q φ (τ ):which leads to an algorithm that avoids sampling high-dimensional observations. Classifier: C θ (z, a, z ) ∈ (0, 1) cross entropy loss (Eq. 9) 

Additional Ablations.

• In Table 5 we train ALM using the soft actor critic entropy bonus and compare it with SAC-SVG. • In Figure 11 we show that ALM is robust to a range of coefficients for the KL term in Equation 8. • In Figure 12 , we incorporate the logarithmic transformation of the reward function based on Appendix A.8 and show that it does not hurt performance. • In Figure 13 , we compare ALM to a version of MnM Eysenbach et al. (2021a) to show that learning the encoder is necessary for good performance on high dimensional tasks. • In Figure 14 , we show that ALM achieves higher asymptotic returns when compared to SAC. • In Figure 15 , we compare ALM to a version of ALM which uses reconstruction loss to learn the encoder. • In Figure 16 , we show that ALM works well even when using a linear classifier.• In Figure 17 , we add noise in the MDP dynamics to show the performance of ALM with varying aleatoric uncertainty. • In Figure 18 and Figure 19 , we compare ALM to a version of ALM which additionally optimizes the encoder and the latent space model to predict the value function. 

