LATENT OFFLINE DISTRIBUTIONAL ACTOR-CRITIC

Abstract

Offline reinforcement learning (RL) has emerged as a promising paradigm for real world applications, since it aims to train policies directly from datasets of past interactions with the environment. The past few years, algorithms have been introduced to learn policies from high-dimensional observational states in an offline settings. The general idea of these methods is to encode the environment into a smaller latent space and train policies on the top of this smaller representation. In this paper, we extend this general method to stochastic environments (i.e. where the reward function is stochastic) and considering a risk measure instead of the classical expected return. First, we show that under some assumptions it is equivalent to minimize a risk measure in the latent space and in the natural space. Based on this result, we present Latent Offline Distributional Actor-Critic (LO-DAC), an algorithm which is able to train policies in high-dimensional stochastic and offline settings to minimize a given risk measure. Empirically, we show that using LODAC to minimize Conditional Value-at-Risk (CVaR), outperforms previous methods in term of CVaR and return on stochastic environments. In this section, we introduce notations and recall concepts we will use later. Coherent Risk Measure. Let (Ω, F, P) a probability space and L 2 := L 2 (Ω, F, P). A functional R : L 2 → (-∞, +∞] is called a coherent risk measure (Rockafellar, 2007) if 1. R(C) = C for all constants C.

1. INTRODUCTION

In a lot of context, human decisions are stored and build interesting datasets. With the successes of modern machine learning tools comes the hope to exploit them to build useful decision helpers. To achieve this, we could use an imitation learning approach (Hussein et al., 2017) . But, in this case, we will at best be as good as humans. Moreover, the performance of this approach depends heavily on the quality of the training dataset. In this work we would like to avoid these behaviours and thus, we consider another framework: reinforcement learning (RL). In recent years, RL achieved impressive results in a number of challenging areas, including games (Silver et al., 2016; 2018) , robotic control (Gu et al., 2017; Haarnoja et al., 2018) or even healthcare (Shortreed et al., 2011; Wang et al., 2018) . In particular, offline RL seems to be really interesting for real world application since its goal is to train agents from a dataset of past interactions with the environment (Deisenroth et al., 2013) . With the digitization of society, more and more features could be used to represent the environment. Unfortunately, classical RL algorithms are not able to work with high-dimensional states. Obviously, we could manually choose a feature subset. However, this choice is not straightforward and it could have a huge impact on the performance. Therefore, it may be be more practical to use RL algorithms capable of learning from high-dimensional states. It is common to evaluate RL algorithms on deterministic (in the sense that the reward function is deterministic) environment such as the DeepMind Control suite (Tassa et al., 2018) . However, in a lot of real world applications, environments are not deterministic but stochastic. Therefore, it might be important to develop algorithms which are able to train policies in these special cases. The motivation of this paper is to provide a method for training policies in an offline settings and working in a high-dimensional stochastic environment. In this paper, we present Latent Distributional Actor-Critic (LODAC) an algorithm which is able to train policies in a high-dimensional, stochastic environment and in an offline settings. The main idea is to learn a smaller representation and to train the agent directly in this latent space. But instead of considering the expected return, we take into account a risk measure. First, assuming some hypothesis, we show that minimizing this risk measure in the latent space is equivalent to minimize the risk measure directly in the natural state. This theoretical result provides a natural framework : train a latent variable model to encode the natural space into a latent space and then train a policy on the top of this latent space using a risk-sensitive RL algorithm. In the experimental part, we evaluate our algorithm on high-dimensional stochastic datasets. In the best of our knowledge, we are the first authors to propose an algorithm to train policies in high-dimensional, stochastic and offline settings.

2. RELATED WORK

Before going further into our work, we present some related work. Offline RL. Offine RL (Levine et al., 2020) is a particular approach of RL where the goal is to learn policies directly from past interactions with the environment.This is a promising framework for realworld applications since it allows the deployment of already trained policies. Thus, it is not really surprising that offline RL has received a lot of attention the recent years (Wiering & Van Otterlo, 2012; Levine et al., 2020; Yang et al., 2021; Chen et al., 2021b; Liu et al., 2021; Yu et al., 2021a; Wang et al., 2021) . One of the main problem in offline RL is that the Q-function is too optimistic in the case of out-ofdistribution (OOD) states-actions pairs for which observational data is limited (Kumar et al. (2019) ). Different approaches have been introduced to deal with this problem. For example, there are algorithms that extend the importance sampling method (Nachum et al., 2019; Liu et al., 2019) or extend the dynamic programming approach (Fujimoto et al., 2019; Kumar et al., 2019) . Other authors build a conservative estimation of the Q-function for OOD states-actions pairs (Kumar et al., 2020; Yu et al., 2021b) . Finally it is also feasible to extend the model-based approach, (Rafailov et al., 2021; Argenson & Dulac-Arnold, 2020) . Learning with high-dimensional states. In the previous years, algorithms were proposed to train policies directly from high-states, like images (Lange & Riedmiller, 2010; Levine et al., 2016; Finn & Levine, 2017; Ha & Schmidhuber, 2018; Chen et al., 2021a) . Previous work has observed that learning a good representation of the observation is a key point for this type of problem (Shelhamer et al., 2016) . In some works, authors use data augmentation techniques to learn the best representation possible (Kostrikov et al., 2020; Laskin et al., 2020; Kipf et al., 2019) . But, it has been intensively studied how we should encode the high-dimensional states into the latent space (Nair et al., 2018; Gelada et al., 2019; Watter et al., 2015; Finn et al., 2016) . Then, it is a common approach to train the policies using a classical RL algorithm, like Soft Actor-Critic (SAC) (Haarnoja et al., 2018) , on the top of this latent representation (Han et al., 2019; Lee et al., 2020) . Furthermore, it is also possible to make some planifications in this latent space to improve the performance (Hafner et al., 2019b; a) . Risk-sensitive RL. Risk-sensitive RL is a particular approach to safe RL (Garcıa & Fernández, 2015) . In safe RL, policies are trained to maximize performance while respecting some safety constraints during training and/or in the deployment. In risk-sensitive RL, instead of maximizing the expectation of the cumulative rewards, we are more interesting in minimizing a measure of the risk induced by the cumulative rewards. Risk-sensitive RL has raised some attention the last few years (Fei et al., 2021; Zhang et al., 2021) . Depending on the context, we might consider different risk measures, like Exponential Utility (Rabin, 2013) Cumulative Prospect Theory (Tversky & Kahneman, 1992) or Conditional Value-at-Risk (CVaR) (Rockafellar & Uryasev, 2002) . Conditional Value-at-Risk has strong theoretical properties and is quite intuitive (Sarykalin et al., 2008; Artzner et al., 1999) . Therefore, CVaR is really popular and has been strongly studied in a context of RL (Chow & Ghavamzadeh, 2014; Chow et al., 2015; Singh et al., 2020; Urpí et al., 2021; Ma et al., 2021; 2020; Ying et al., 2021) . Previous work suggest that taking into account Conditional Valueat-Risk instead of the classical expectation, could prevent from the gap of performance between simulation and real world application (Pinto et al., 2017) . 2. R((1 -λ)X + λX) ≤ (1 -λ)R(X) + λR(X) for λ ∈ (0, 1). 3. R(X) ≤ R)(X ) when X ≤ X . 4. R(X) ≤ 0 when X k -X 2 → 0 with R(X k ) ≤ 0. 5. R(λX) = λR(X) for λ > 0. Coherent risk measures have some interesting properties. In particular, a risk measure is coherent if and only if, there is a risk envelope U such that R(X) = sup δ∈U E[δX] (1) (Artzner et al., 1999; Delbaen, 2002; Rockafellar et al., 2002) . A risk envelope is a nonempty convex subset of P that is closed and where P := {δ ∈ L 2 | δ ≥ 0, E[δ] = 1}. The definition of a risk measure R might depends on a probability distribution p. In some cases, it may be useful to specify which distribution we are working with. Thus, we sometimes use the notation R p . There exists a lot of different coherent risk measures, for example the Wang risk measure (Wang, 2000) , the entropic Value-at-Risk (Ahmadi-Javid, 2011) or Conditional Value-at-Risk (Rockafellar et al., 2000; Rockafellar & Uryasev, 2002) . Conditional Value-at-Risk (CVaR α ) with probability level α ∈ (0, 1) is defined as CVaR α (X) := min t∈R {t + 1 1 -α E[max{0, X -t}]} Moreover the risk envelope associated to CVaR α , can be written as et al., 2006; 2002) . This rigorous definitions may be not really intuitive, but roughly speaking CVaR α is the expectation of X in the conditional distribution of its upper α-tail, and thus is corresponds to the (1 -α) worst case. In this paper, we follow the classical definition of Conditional Value-at-Risk presented in risk measure literature. In particular, X should be interpreted as a loss function. U = {δ ∈ P | E[δ] = 1, 0 ≤ δ ≤ 1 α } (Rockafellar Offline RL. We consider a Markov Decision Process (MDP), (S, A, p, r, µ 0 , γ) where S is the environment space, A the action space, r the reward distribution ( r t ∼ r(•|s t , a t )), p is the transition probability distribution (s t+1 ∼ p(•|s t , a t )), µ 0 is the initial state distribution and γ ∈ (0, 1) denotes the discount factor. For the purpose of the notation, we define p(s 0 ) := µ 0 (s 0 ) and we write the MDP (S, A, p, r, γ). Actions are taking following a policy π which depends on the environment state, (i.e a t ∼ π(•|s t )). A sequence on the MDP (S, A, p, r, γ), τ = s 0 , a 0 , r 0 , . . . s H , with s i ∈ S and a i ∈ A is called a trajectory. A trajectory of fixed length H ∈ N is called an episode. Given a policy π, a rollout from state-action (s, a) ∈ S × A is a random sequence {(s 0 , a 0 , r 0 ), (s 1 , a 1 , r 1 ), . . .} where a 0 = a, s 0 = s, s t+1 ∼ p(•|s t , a t ), r t ∼ r(•|s t , a t ) and a t ∼ π(•|s t ). Given a policy π and a fixed length H, we have a trajectory distribution given by p π (τ ) = p(s 0 ) H-1 t=0 π(a t |s t )p(s t+1 |s t , a t )r(r t |s t , a t ) The goal of classical risk-neutral RL algorithms is to find a policy which maximizes the expected discounted return E π [ H t=0 γ t r t ] where H might be infinite. Equivalently, we can look for the policy which maximizes the Q-function which is defined as Q π : S × A → R, Q π (s t , a t ) := E π [ H t =t γ t -t r t )]. Instead of this classical objective function, other choices are possible. For example, the maximum entropy RL objective function E π [ H t=0 r t + H(π(•|s t ))] , where H denotes the entropy. This function has an interesting connection with variational inference (Ziebart, 2010; Levine, 2018) and it has shown impressive results in recent years (Haarnoja et al., 2018; Lee et al., 2019; Rafailov et al., 2021) . In offline RL, we have access to a fixed dataset D = {(s t , a t , r t , s t+1 )}, where s t ∈ S, a t ∈ A, r t ∼ r(s t , a t ) and s t+1 ∼ p(•|s t , a t ), and we aim to learn policies without interaction with the environment. A such dataset comes with the empirical behaviour policy π β (a|s) := (s t ,a t )∈D 1 (s t =s,a t =a) s t ∈S 1s t =s . Latent variable model. There are different methods to learn directly from high-dimensional states (Kostrikov et al., 2020; Hafner et al., 2019a; b) . But in this work, we build on the top of the framework presented in Stochastic Latent Actor-Critic (SLAC) (Lee et al., 2019) . The main idea of this work is to train a latent variable model to encode the natural MDP (S, A, p, r, γ) into a latent space (Z, A, q, r, γ) and to train policies directly in this space. To achieve this, the variational distribution q(z 1:H , a t+1:H |s 1:t , a 1:t ) is factorized into a product of inference term q(z i+1 |z i , s i+1 a i ), latent dynamic term q(z t+1 |z t , a t ) and policy term π(a t |s 1:t , a 1:t-1 ) as follow q(z 1:H , a t+1:H |s 1:t , a 1:t ) = t i=0 q(z i+1 |z i , s i+1 a i ) H-1 i=t+1 q(z t+1 |z t , a t ) H-1 i=t+1 π(a t |s 1:t , a 1:t-1 ) Using this factorization, the evidence lower bound (ELBO) (Odaibo, 2019) and a really interesting theoretical approach (Levine, 2018) , the following objective function for the latent variable model is derived E z1:t,a t+1:H ∼q t i=0 log D(s t+1 |z t+1 ) -D KL (q(z t+1 |s t+1 , s t , a t )||q(s t+1 |s t , a t )) where D KL is the Kullback-Leibler divergence and D a decoder. Distributional RL. The goal of distributional RL is to learn the distribution of the discounted cumulative negative rewards Z π := - H t γ t r t . Z π is a random variable. A classical approach of distributional RL is to learn Z π implicitly using its quantile function F -1 Z(s,a) : [0, 1] → R, which is defined as F -1 Z(s,a) (y) := inf{x ∈ R | y ≤ F Z(s,a) (x)} and where F Z(s,a) is the cu- mulative density function of the random variable Z(s, a). A model Q θ (η, s, a) is used to ap- proximate F -1 Z(s,a) (η). For (s, a, r, s ) ∼ D, a ∼ π(•|s ), η, η ∼ Uniform[0, 1], we define δ as δ := r + γQ θ (η , s , a ) -Q θ (η, s, a) . Q θ is trained using the τ -Huber quantile regression loss at threshold k (Huber, 1992) L k (δ, η) := |η -1 δ<0 |(δ 2 /2k) if |δ| < k |η -1 δ<0 |(|δ| -k/2) otherwise. ( ) With this function F Z(s,a) , different risk measures can be computed, like Cumulative Probability Weight (CPW) (Tversky & Kahneman, 1992) , Wang measure (Wang, 2000) or Conditional Valueat-Risk. For example, the following equation (Acerbi, 2002)  is used to compute CVaR α CVaR α (Z π ) = 1 1 -α 1 α F Z -1 (s,a) (τ )dτ It is also possible to extend distributional RL in a offline settings. For example, O-RAAC (Urpí et al., 2021) and following a previous work (Fujimoto et al., 2019) , decomposes the actor into two different component an imitation actor and a perturbation model. CODAC (Ma et al., 2021) extends DSAC (Duan et al., 2021) in a offline settings. More precisely, the Q θ (η, s, a) are trained using the following loss function αE η∼U E s∼D log a exp(Q θ (η, s, a)) -E (s,a)∼D [Q θ (η, s, a)] + L k (δ, η ) where U = Uniform[0, 1]. The first term of the equation, and based on previous work (Kumar et al., 2020) , is introduced to prevent for too optimistic estimation for OOD state-action pairs. The second term, L k (δ, τ ), is the classical objective function used to train the Q-function in a distributional settings.

4. THEORETICAL CONSIDERATIONS

In this paper, we aim to train policies in a high-dimensional stochastic and offline settings. For a practical point of view, our idea is quite straightforward. Since training with high dimensional space directly failed and following previous work (Rafailov et al., 2021; Lee et al., 2019) , we encode our high-dimensional states in more compact representation using φ : S → Z, where Z = φ(S). Using φ, we build a MDP on the latent space and train policies directly on the top of this space. However, for a theoretical point of view, this general idea is not really clear. Is this latent MDP always well defined ? If we find a policy which minimizes a risk measure in the latent space, will it also minimize the risk measure in the natural state ? The first goal of this section is to rigorously construct a MDP in the latent space. Then, we show that for special coherent risk measure and assuming some hypothesis, minimizing the risk measure in the latent space and in the natural space is equivalent. In particular, we show that is the case for Conditional Value-at-Risk.

4.1. THEORETICAL RESULTS

First, we make the following assumptions 1. ∀a ∈ A we have r(•|s, a) = r(•|s , a) if φ(s) = φ(s ). 2. We note P( Under these assumptions, φ induces a MDP (Z, A, q, r , γ) where the reward distribution r is defined as r (•|z, a) := r(•|s, a) for any s ∈ φ -1 (z). r is well defined by surjectivity of φ and by assumption (1). Q is well defined by hypothesis (2). Obviously, for a given policy π on the MDP (Z, A, q, r , γ) and fixed length H, we have a trajectory distribution q π (τ ) = q(z 0 ) H-1 t=0 π (a t |z t )q(z t+1 |z t , a t )r (r t |z t , a t ) Given a policy π and a fixed length H, we denote Ω the set of all trajectories on the MDP (S, A, p, r, γ), F the σ-algebra generated by these trajectories and P π the probability with probability distribution p π . Following the same idea and given a policy π , we denote Ω the set of trajectories of length H on the MDP (Z, A, q, r ), F the σ-algebra generated by these trajectories and Q π the probability with probability distribution q π . (Ω, F, P π ) and (Ω , F , Q π ) are probability spaces. φ : S → Z induces a map Ω : → Ω (s 0 , a 0 , r 0 , . . . , s H ) → (φ(s 0 ), a 0 , r 0 , . . . , φ(s 0 )) With a slight abuse of notation, we write it φ. We denote Π the set of all policies π on the MDP (S, A, p, r) which satisfies π(a|s) = π(a|s ) if φ(s) = φ(s ). If π ∈ Π, then π induces a policy π on the MDP (Z, A, q, r ), taking π (a|z) := π(a|s) where s is any element of φ -1 (z). For any π ∈ Π we denote π the associated policy on the MDP (Z, A, q, r ) as defined above. Furthermore, we note Π := {π | π ∈ Π}. If π ∈ Π , a policy π ∈ Π can be defined π(a|s) := π (a|φ(s)). X is a random variable on Ω and X a random variable on Ω . With all these notations, we can introduce our first result. Lemma 4.1.1. Let π ∈ Π. Then, Q π is the probability image of P π by φ. This result is not really surprising but it has an interesting implication. Indeed, it applies that if X • φ = X, then E q π [X ] = E pπ [X]. And thus, we get the following result. This last result points out the role of the risk envelope in this equivalence. However, the result of the last proposition is also verifies if sup δ ∈U E q π [δ X ] = sup δ∈U E pπ [δX]. Fortunately, this is the case for Conditional Value-at-Risk. Lemma 4.1.4. Let U, U the risk envelopes associated to CVaR α (X) and CVaR α (X ) respectively. Then, we have sup δ ∈U E q π [δ X ] = sup δ∈U E pπ [δX] Therefore, we can deduce the following result. Corollary 4.1.5. Suppose that X • φ = X. Then, if a policy π satisfies π = argmin π ∈Π CVaR α (X ) its associated policy π verifies π = argmin π∈Π CVaR α (X) In particular, for X(τ ) = H t=0 -γ t r t -βH(π(•|s t )), and X (τ ) := H t=0 -γ t r t -βH(π (•|z t )), (with β ∈ R) we have (X • φ)(τ ) = H t=0 -γ t r t -βH(π (•|φ(s t ))) = H t=0 -γ t r t -βH(π(•|s t )) = X(τ ) Thus, we obtain the following result. Corollary 4.1.6. Let X : Ω → R and X : Ω → R defined as X(τ ) := H t=0 -γ t r t -βH(π(•|s t )) and X (τ ) := H t=0 -γ t r t -βH(π (•|z t )). Then, if a policy π satisfies π = argmin π ∈Π CVaR α (X ) its associated policy π verifies π = argmin π∈Π CVaR α (X)

4.2. DISCUSSION

The results presented in the last section provide a theoretical equivalence between minimizing the risk measure in the latent space and in the natural space. However, to obtain this guarantee we need to make some assumptions. Assumptions (1), (2) guarantee that the reward distribution r(•|s t , a t ) and the probability measure P(•|s t , a t ) are insensitive to any change in φ -1 (z t ). Moreover π ∈ Π ensures that the policy distribution is stable to any change in φ -1 (z t ). Thus, and roughly speaking these assumptions guarantee we do not loss information encoding s t into φ(s t ), in term of reward distribution, transition probability measure and the optimal policy. These theoretical considerations points out that to learn a good latent representation of the natural MDP, we should consider all components of the MDP and not only focus on the environment space.

5. LATENT OFFLINE DISTRIBUTIONAL ACTOR-CRITIC

The theoretical results presented above justify our really natural idea: encode the natural environment space S into a smaller representation Z and then use a risk-sensitive offline RL algorithm to learn a policy on the top of this space. This is the general idea of LODAC.

5.1. PRACTICAL IMPLEMENTATIONS OF LODAC

In this section, we present the practical implementation of LODAC. First, we need to learn a latent variational latent model. To train a such model, and similar to Rafailov et al. (2021) , the following objective function is used E q θ H-1 t=0 log D(s t+1 |z t+1 ) -D KL (φ θ (z t+1 |s t+1 , z t , a t )||q θ (z t+1 |z t , a t )) Then, the dataset D is encoded into the latent space and stored into a replay buffer B latent . More precisely, B latent contain transitions of the form (z 1:H , r 1:H , a 1:H ) where z 1:H ∼ φ θ (•|s 1:H , a H-1 ) and s 1:H , r 1:H-1 , a 1:H-1 ∼ D. After that, we introduce a latent buffer B synthetic which contains rollouts transitions performed using the policy π θ , the latent model q θ and a reward estimator r θ . The actor π θ and the critic Q θ are trained on B := B synthetic ∪ B latent . To achieve this, we could use any offline risk-sensitive RL algorithm. But based on our empirical results we follow Ma et al. (2021) . Thus, the critic Q θ (η, z t , a t ) is used to approximate the inverse quantile function. Q θ is iteratively chosen to minimize αE η∼U E z∼B log a exp(Q θ (η, z, a)) -E (z,a)∼B [Q θ (η, z, a)] + L k (δ, η ) where U = Uniform[0, 1], δ = r t + γQ θ (η , z t+1 , a t+1 ) -Q θ (η, z t , a t ), with (z t , a t , r t ) ∼ B, a t+1 ∼ π θ (•|z t ) and L k is the τ -Huber quantile regression loss as defined is (4). The actor is trained to minimize Conditional Value-at-Risk of the negative cumulative reward, which can be computed using the formula CVaR α (Z π ) = 1 1 -α 1 α F Z -1 (s,a) (τ )dτ Following previous works (Rafailov et al., 2021; Yu et al., 2021b) , batches of equal data mixed from B latent and B synthetic are used to train π θ and Q θ . A summary of LODAC can be found in Appendix D.

5.2. EXPERIMENTAL SETUP

In this section, we evaluate the performance of LODAC. First, our method is compared with LOMPO (Rafailov et al., 2021) , which is an offline highdimensional risk-free algorithm. Then, we build a version of LODAC where the actor and the critic are trained using O-RAAC (Urpí et al., 2021) . We denote it LODAC-O. Moreover, it is also possible to use a risk-free offline RL. Thus, a risk-free policy is also trained in the latent space using COMBO (Yu et al., 2021b) . These algorithms are evaluated using the standard walker walk task from the DeepMind Control suite (Tassa et al., 2018) , but here, we learn directly from the pixels. As a standard practice, each action is repeated two times on the ground environment and episode of length 1000 are used. These algorithms are tested on three different datasets : expert, medium and expert-replay. Each dataset consists of 100K transitions steps. For the stochastic environment, we transform the reward using the following formula r t ∼ r(s t , a t ) -λ1 r(st,at)>r B p0 where r is the classical reward function. r, λ are hyperparameters and B p0 is a Bernoulli distribution of parameter p 0 . In our experiments, we choose λ = 8 and p 0 = 0.1. Different value of r are used for each dataset such that about the half of the states verify r(s t , a t ) > r. More details regarding the construction of these datasets can be found in Appendix B. Algorithms have been tested using the following procedure. First of all, to avoid computation time and for a more accurate comparison, we use the same latent variable model for all algorithms. We evaluate each algorithm using 100 episodes, reporting the mean and CVaR α of the returns. LODAC and LODAC-O are trained to minimize CVaR 0.7 . LOMPO and COMBO are trained to maximize the return. We run 4 different random seeds. As introduced in Fu et al. ( 2020), we use the normalized score to compare our algorithms. More precisely, a score of 0 corresponds to a fully random policy and a score of 100 corresponds to an expert policy on the deterministic task. However, and as suggested in Agarwal et al. (2021) , instead of taking the mean and the results , we consider the interquantile means (IQN).

5.3. RESULTS DISCUSSION

In this section, we discuss the results of our experiments. Our experimental results can be read in the table 1 for the stochastic environment in table 2 for the deterministic environment. We bold the highest score across all methods. Complete results of all different tests can be found in Appendix E. Stochastic environment. The first general observation is that LODAC and LODAC-O generally outperform risk-free algorithms. The only exception is with the expert replay dataset where LODAC-O provides worse results. This is not really surprising since actors trained with O-RAAC contains an imitation component and obviously the imitation agent obtains really poor performance on this dataset. Then, it can be noticed, that LODAC provides really interesting results. Indeed it significantly outperforms risk-free algorithm in term of CVaR 0.7 and return on all datasets. Moreover, it provides better results than LODAC-O on the medium and the expert replay dataset, while achieving comparable result on the expert dataset. A final observation is that risk-free RL policies provide generally bad performances on this stochastic environment. Deterministic environment. First, a drop of performance between the deterministic and the stochastic environment can be noticed. This is not really surprising since stochastic environments are more challenging than deterministic environments. However this modification seems to affect more risk-free algorithms than LODAC-O or LODAC. Indeed, we get a difference of more than 27% with LOMPO and COMBO on the medium and expert replay dataset in term of return between the deterministic and the stochastic environment. This difference even achieves 42% for COMBO on the expert replay dataset. For LODAC-O and for the same tasks, we obtain a deterioration of less than 17%. In the opposite to previous risk-free methods and in a lesser degree LODAC-O, adding stochasticity in the dataset does not seem to have a big impact of the performance of LODAC. Indeed, with this algorithm we observe a drop of performance of less than 9%. For the medium dataset, a difference of only 5, 23% can even be noticed.

6. CONCLUSION

While offline RL appears to be an interesting paradigm for real-world application, some of these realworld application are high-dimensional and stochastic. However, current high-dimensional offline RL algorithms are trained and tested in deterministic environments. Our empirical results suggest that adding stochasticity to the training dataset significantly decreases the performance of highdimensional risk-free offline RL algorithms. Based on this observation, we develop LODAC. LODAC can be used to train policies in highdimensional stochastic and offline settings. Our theoretical considerations in section 4 show that our algorithm relies on a strong theoretical foundation. In the opposite to previous risk-free methods, adding stochasticity in the dataset does not seem to have a big impact of the performance of LODAC. Finally, the use of LODAC to minimize Conditional Value-at-Risk empirically outperforms previous algorithms in term of CVaR and return on stochastic high-dimensional environments.

A PROOFS

A.1 LEMMA 4.1.1 Lemma A.1.1. Let π ∈ Π. Then, Q π is the probability image of P π by φ. Proof. Let B := Z 0 × A 0 × R 0 × . . . Z H ∈ F . We have Q π (B ) = B q π (τ )dτ = Z0 q(z 0 ) A0×R0×...×Z H π (a 0 |z 0 )r (r 0 |a 0 , z 0 ) . . . q(z z |z H-1 , a H-1 )da 0 dr 0 . . . dz H dz 0 = φ -1 (Z0) p(s 0 ) A0×R0×...×Z H π (a 0 |φ(s 0 ))r (r 0 |a 0 , φ(s 0 )) . . . q(z z |z H-1 , a H-1 )da 0 dr 0 . . . dz H ds 0 = φ -1 (Z0) p(s 0 ) A0×R0×...×Z H π(a 0 |s 0 )r(r 0 |a 0 , s 0 ) . . . q(z z |z H-1 , a H-1 )da 0 dr 0 . . . dz H ds 0 = φ -1 (Z0)×A0×R0 p(s 0 )π(a 0 |s 0 )r(r 0 |a 0 , s 0 ) Z1×A1×...×Z H q(z 1 |φ(s 0 ))π (a 1 |z 1 ) . . . q(z z |z H-1 , a H-1 )dz 1 . . . dz H ds 0 da 0 dr 0 = φ -1 (Z0)×A0×R0×φ -1 (Z1) p(s 0 )π(a 0 |s 0 )r(r 0 |a 0 , s 0 )p(s 1 |s 0 , a 0 ) A1×...×Z H q(z 1 |φ(s 0 ))π (a 1 |z 1 ) . . . q(z z |z H-1 , a H-1 )da 1 . . . dz H ds 0 da 0 dr 0 ds 1 = ... p π (τ )dτ = P π (φ -1 (B )) A.2 PROPOSITION 4.1.3 Proposition A.2.1. Let R be a coherent risk measure. Suppose that X • φ = X and U = U • φ. Then, if a policy π satisfies π = argmin π ∈Π R(X ) its associated policy π verifies π = argmin π∈Π R(X) Proof. Since U = U • φ, we have sup δ ∈U E q π [δ X ] = sup δ∈U E pπ [δX] Thus R q π (X ) := sup δ ∈U E q π [δ X ] = sup δ∈U E pπ [δX] = R pπ (X) Now, let π = argmin π ∈Π R q π (X ) and π the associated policy in S. By contradiction, suppose there exists π 1 with R pπ 1 (X) < R pπ (X). But in this case and by the above observation, we would have R q π 1 (X ) < R q π (X ). A .3 LEMMA 4.1.4 Lemma A.3.1. Let U, U the risk envelopes associated to CVaR α (X) and CVaR α (X ) respectively. Then, we have sup δ ∈U E q π [δ X ] = sup δ∈U E pπ [δX] Proof. Recall that the risk envelope of the CVaR α , takes the form U = {δ | 0 ≤ δ ≤ 1 α , E[δ] = 1}. • Let δ ∈ U . We define δ := δ • φ. By construction we have 0 ≤ δ ≤ 1 α , E pπ [δ] = E q π [δ ] = 1 and E q π [δ X ] = E pπ [δX]. • Let δ ∈ U. By definition, δp π is a density function. Let Pπ its associated probability measure. We consider Qπ the probability image of Pπ by φ. Then, remark that Qπ is absolutely continuous with respect to Q π . Indeed, if A ∈ F satisfies Q π (A) = 0, we have Qπ (A) = Pπ (φ -1 (A)) = φ -1 (A) δ(τ )p π (τ )dτ ≤ 1 α P π (φ -1 (A)) = 1 α Q π (A) = 0 Thus by Radon-Nikodym, there exists δ ≥ 0, such that for all B ∈ F Qπ (B) = B δ dQ π = B δ (τ )q π (τ )dτ We will show that à δ ∈ U . We define C := {τ ∈ Ω | δ (τ ) > 1 α }. By contradiction, suppose that Q π (C) > 0. One one hand have Qπ (C) = C δ (τ )q π (τ )dτ > 1 α C q π (τ )dτ = 1 α Q π (C) And the other hand, we have Qπ (C) = Pπ (φ -1 (C)) = φ -1 (C) δ(τ )p π (τ )dτ ≤ 1 α P π (φ -1 (C)) = 1 α Q π (C) And thus, we must have Q π (C) = 0. Thus, 0 ≤ δ ≤ 1 α a.e. and therefore δ ∈ U . And since, δ q π is the density function of the probability image of Pπ , we obtain E δpπ [X] = E δ q π [X ].

B DATASETS

In this section, we describe more precisely how we build our training datasets. • Expert. For the expert dataset, actions are chosen according to an expert policy which has been trained online, in a risk-free environment using SAC for 500K training steps. For this training, the states are the classical states provided by DeepMind Control suite. • Medium. In this dataset, actions are chosen according to a policy which has been trained using the same method as above, but here, the training has been stopped when it achieves about the half performance of the expert policy. • Expert replay. The expert replay dataset consists of episode which are sampled from the expert policy during the training. The setup presented above allow to build the deterministic datasets. For the stochastic environment, we transform the reward using the following formula r t ∼ r(s t , a t ) -λ1 r(st,at)>r B p0 where r is the classical reward function. r, λ are hyperparameters and B p0 is a Bernoulli distribution of parameter p 0 . In our experiments, we choose λ = 8 and p 0 = 0.1. Different value of r are used for each dataset such that about the half of the states verify r(s t , a t ) > r.

C IMPLEMENTATIONS DETAILS

In this section, we present some details related to our implementations. Latent variable model. Following previous works (Rafailov et al., 2021; Yu et al., 2021b) , our latent variable model contains the following components Image encoder : h t = E θ (s t ) Inference model : z t ∼ φ θ (•|h t , z t-1 , a t-1 ) Latent transition model : z t ∼ q θ (•|z t-1 , a t-1 ) Reward estimator : r t ∼ r θ (•|s t ) Image decoder : s t ∼ D θ (•|z t ) The image encoder E θ is a classical Convolutional Neural Network. The inference and the latent transition model are implemented as a Recurrent State Space Model (RSSM), (Hafner et al., 2019b) . More precisely, a latent state z t has two components z t = [d t , x t ]. d t is the deterministic part and x t the stochastic. d t is computed using a GRU cell d t = f θ (z t-1 , a t-1 ) and the stochastic component x t is computed using, x t ∼ φ θ (•|h t , d t ) , for the inference model, and x t ∼ q θ (•|d t , x t-1 , a t-1 ) for the latent transition model. Both q θ and the reward estimator are implemented as MLP. Finally, the decoder D θ is a Deconvolutional Neural Network. Actor-critic. First, recall that our initial critic loss function is Remark that if this difference (scaled to ω) is smaller than the parameter ζ, then α will be set to 0 and only the term L k (δ, η ) will be taken into account. α E η∼U E z∼B log a exp(Q θ (η, z, a)) -E (z,a)∼B [Q θ (η, z, a)] + L k (δ, η ) Moreover, since the action space A is continuous, the computation log a exp(Q θ (η, z, a) is intractable. To avoid this problem and as introduced in (Kumar et al. (2020) ) we use Then, we use an implicit quantile network (IQN), (Dabney et al., 2018) to represent our critic Q θ (η, z, a) and the actor π θ (•|z) consists of a simple MLP which is trained to minimize CVaR α (Z π ) = 1 1 -α 1 α Q θ (η, z, a)dη

D MAIN ALGORITHM

The full implementation of LODAC can be read in algorithm (1). Algorithm 1: LODAC Input: dataset D, models train steps, initial latent steps, number iterations, number actor-critic steps, rollout length L 1 for models train steps do 2 Sample a batch of sequence (s 1:H , a 1:H-1 , r 1:H-1 ) from D and train a variational latent model using equation ( 7) and a reward estimator r θ . Sample a random action a h-1 and z h ∼ q θ (•|z h-1 , a h-1 ); estimate the reward r t ∼ r θ 13 Add (z h-1 , a h-1 , z h , r h ) to B synthetic . 

E COMPLETE EXPERIMENTAL RESULTS

In this section, we present our complete empirical results. More precisely, in the table 4, all results of the stochastic environment can be found. And on the table 3, all results on the deterministic can be read.



Corollary 4.1.2. Suppose X • φ = X. Then, if a policy π satisfiesπ = arg min π ∈Π E q π [X ] its associated policy π satisfies π = arg min π∈Π E pπ [X]For a coherent risk measure R, we write U the risk envelope associated to R in Ω and U the corresponding associated risk envelope in Ω. If ∀δ ∈ U, ∃δ ∈ U such that δ = δ • φ and ∀δ ∈ U ∃δ ∈ U with δ = δ • φ almost everywhere, we write U = U • φ. Proposition 4.1.3. Let R be a coherent risk measure. Suppose that X • φ = X and U = U • φ.Then, if a policy π satisfies π = argmin π ∈Π R(X ) its associated policy π verifies π = argmin π∈Π R(X)

-1 (Z0)×...φ -1 (Z H ) p(s 0 )π(a 0 |s 0 )r(r 0 |a 0 , s 0 ) . . . p(s H |s H-1 , a H-1 )ds 0 . . . ds H = φ -1 (B )

But followingKumar et al. (2020);Ma et al. (2021), we add two parameters ζ, ω ∈ R >0 to the last equationmax α≥0 α E η∼U ω E z∼B log a exp(Q θ (η, z, a)) -E (z,a)∼B [Q θ (η, z, a)] -ζ +L k (δ, η ) ζ thresholds the difference between E (z,a)∼B [Q θ (η, z, a)] and the regulizer E z∼B [log a exp(Q θ (η, z, a))].The parameter ω scales this difference.

Unif(A) and we choose M = 10.

of sequence (s 1:H , a 1:H-1 , r 1:H-1 ) from D 6 Sample z 1:H from the latent model and add the transitions (z 1:H , a 1:H-1 r 1:H-1 ) to B latent . 7 end 8 for initial synthetic latent steps do 9 Sample a batch of sequences (s 1:H , a 1:H-1 , r H-1 ) from D 10 Sample a set of latent states S ∼ φ θ (•|s 1:H , a 1:H-1 ) from the trained latent model. for s 0 ∈ S do 11 for h ∈ {1 : L} do 12

B latent ∪ B synthetic 20 Train the critic to minimize (8) 21Train the actor to minimize CVaR α (Z π ) which can be computed using the Q(η, z, a) and equation (9).

of sequences (s 1:H , a 1:H-1 , r H-1 ) from D 25 Sample a set of latent states S ∼ φ θ (•|s 1:H , a 1:H-1 ) from the trained latent model. for s 0 ∈ S do 26 for h ∈ {1 : L} do 27Sample action ah-1 ∼ π θ (•|z h-1 ) and z h ∼ q θ (•|z h-1 , a h-1 ); estimate the reward r t 28 Add (z h-1 , a h-1 , z h , r h ) to B synthetic .

Performance for the stochastic offline high-dimensional walker walk task on expert, medium and expert replay datasets. We compare the mean and the CVaR 0.7 of the returns. LODAC outperforms other algorithms on the medium and expert replay datasets while achieving comparable results on the expert dataset. We bold the highest score across all methods.

Performance for the deterministic offline high-dimensional walker walk task on expert, medium and expert replay datasets. We compare the mean and the CVaR 0.7 of the returns. We bold the highest score across all methods.

