LATENT OFFLINE DISTRIBUTIONAL ACTOR-CRITIC

Abstract

Offline reinforcement learning (RL) has emerged as a promising paradigm for real world applications, since it aims to train policies directly from datasets of past interactions with the environment. The past few years, algorithms have been introduced to learn policies from high-dimensional observational states in an offline settings. The general idea of these methods is to encode the environment into a smaller latent space and train policies on the top of this smaller representation. In this paper, we extend this general method to stochastic environments (i.e. where the reward function is stochastic) and considering a risk measure instead of the classical expected return. First, we show that under some assumptions it is equivalent to minimize a risk measure in the latent space and in the natural space. Based on this result, we present Latent Offline Distributional Actor-Critic (LO-DAC), an algorithm which is able to train policies in high-dimensional stochastic and offline settings to minimize a given risk measure. Empirically, we show that using LODAC to minimize Conditional Value-at-Risk (CVaR), outperforms previous methods in term of CVaR and return on stochastic environments. In this section, we introduce notations and recall concepts we will use later. Coherent Risk Measure. Let (Ω, F, P) a probability space and L 2 := L 2 (Ω, F, P). A functional R : L 2 → (-∞, +∞] is called a coherent risk measure (Rockafellar, 2007) if 1. R(C) = C for all constants C.

1. INTRODUCTION

In a lot of context, human decisions are stored and build interesting datasets. With the successes of modern machine learning tools comes the hope to exploit them to build useful decision helpers. To achieve this, we could use an imitation learning approach (Hussein et al., 2017) . But, in this case, we will at best be as good as humans. Moreover, the performance of this approach depends heavily on the quality of the training dataset. In this work we would like to avoid these behaviours and thus, we consider another framework: reinforcement learning (RL). In recent years, RL achieved impressive results in a number of challenging areas, including games (Silver et al., 2016; 2018 ), robotic control (Gu et al., 2017;; Haarnoja et al., 2018) or even healthcare (Shortreed et al., 2011; Wang et al., 2018) . In particular, offline RL seems to be really interesting for real world application since its goal is to train agents from a dataset of past interactions with the environment (Deisenroth et al., 2013) . With the digitization of society, more and more features could be used to represent the environment. Unfortunately, classical RL algorithms are not able to work with high-dimensional states. Obviously, we could manually choose a feature subset. However, this choice is not straightforward and it could have a huge impact on the performance. Therefore, it may be be more practical to use RL algorithms capable of learning from high-dimensional states. It is common to evaluate RL algorithms on deterministic (in the sense that the reward function is deterministic) environment such as the DeepMind Control suite (Tassa et al., 2018) . However, in a lot of real world applications, environments are not deterministic but stochastic. Therefore, it might be important to develop algorithms which are able to train policies in these special cases. The motivation of this paper is to provide a method for training policies in an offline settings and working in a high-dimensional stochastic environment. In this paper, we present Latent Distributional Actor-Critic (LODAC) an algorithm which is able to train policies in a high-dimensional, stochastic environment and in an offline settings. The main idea is to learn a smaller representation and to train the agent directly in this latent space. But instead of considering the expected return, we take into account a risk measure. First, assuming some hypothesis, we show that minimizing this risk measure in the latent space is equivalent to minimize the risk measure directly in the natural state. This theoretical result provides a natural framework : train a latent variable model to encode the natural space into a latent space and then train a policy on the top of this latent space using a risk-sensitive RL algorithm. In the experimental part, we evaluate our algorithm on high-dimensional stochastic datasets. In the best of our knowledge, we are the first authors to propose an algorithm to train policies in high-dimensional, stochastic and offline settings.

2. RELATED WORK

Before going further into our work, we present some related work. Offline RL. Offine RL (Levine et al., 2020) is a particular approach of RL where the goal is to learn policies directly from past interactions with the environment.This is a promising framework for realworld applications since it allows the deployment of already trained policies. Thus, it is not really surprising that offline RL has received a lot of attention the recent years (Wiering & Van Otterlo, 2012; Levine et al., 2020; Yang et al., 2021; Chen et al., 2021b; Liu et al., 2021; Yu et al., 2021a; Wang et al., 2021) . One of the main problem in offline RL is that the Q-function is too optimistic in the case of out-ofdistribution (OOD) states-actions pairs for which observational data is limited (Kumar et al. ( 2019)). Different approaches have been introduced to deal with this problem. For example, there are algorithms that extend the importance sampling method (Nachum et al., 2019; Liu et al., 2019) or extend the dynamic programming approach (Fujimoto et al., 2019; Kumar et al., 2019) . Other authors build a conservative estimation of the Q-function for OOD states-actions pairs (Kumar et al., 2020; Yu et al., 2021b) . Finally it is also feasible to extend the model-based approach, (Rafailov et al., 2021; Argenson & Dulac-Arnold, 2020) . Learning with high-dimensional states. In the previous years, algorithms were proposed to train policies directly from high-states, like images (Lange & Riedmiller, 2010; Levine et al., 2016; Finn & Levine, 2017; Ha & Schmidhuber, 2018; Chen et al., 2021a) . Previous work has observed that learning a good representation of the observation is a key point for this type of problem (Shelhamer et al., 2016) . In some works, authors use data augmentation techniques to learn the best representation possible (Kostrikov et al., 2020; Laskin et al., 2020; Kipf et al., 2019) . But, it has been intensively studied how we should encode the high-dimensional states into the latent space (Nair et al., 2018; Gelada et al., 2019; Watter et al., 2015; Finn et al., 2016) . Then, it is a common approach to train the policies using a classical RL algorithm, like Soft Actor-Critic (SAC) (Haarnoja et al., 2018) , on the top of this latent representation (Han et al., 2019; Lee et al., 2020) . Furthermore, it is also possible to make some planifications in this latent space to improve the performance (Hafner et al., 2019b; a) . Risk-sensitive RL. Risk-sensitive RL is a particular approach to safe RL (Garcıa & Fernández, 2015) . In safe RL, policies are trained to maximize performance while respecting some safety constraints during training and/or in the deployment. In risk-sensitive RL, instead of maximizing the expectation of the cumulative rewards, we are more interesting in minimizing a measure of the risk induced by the cumulative rewards. Risk-sensitive RL has raised some attention the last few years (Fei et al., 2021; Zhang et al., 2021) . Depending on the context, we might consider different risk measures, like Exponential Utility (Rabin, 2013) Cumulative Prospect Theory (Tversky & Kahneman, 1992) or Conditional Value-at-Risk (CVaR) (Rockafellar & Uryasev, 2002) . Conditional Value-at-Risk has strong theoretical properties and is quite intuitive (Sarykalin et al., 2008; Artzner et al., 1999) . Therefore, CVaR is really popular and has been strongly studied in a context of RL (Chow & Ghavamzadeh, 2014; Chow et al., 2015; Singh et al., 2020; Urpí et al., 2021; Ma et al., 2021; 2020; Ying et al., 2021) . Previous work suggest that taking into account Conditional Valueat-Risk instead of the classical expectation, could prevent from the gap of performance between simulation and real world application (Pinto et al., 2017) .

