LATENT OFFLINE DISTRIBUTIONAL ACTOR-CRITIC

Abstract

Offline reinforcement learning (RL) has emerged as a promising paradigm for real world applications, since it aims to train policies directly from datasets of past interactions with the environment. The past few years, algorithms have been introduced to learn policies from high-dimensional observational states in an offline settings. The general idea of these methods is to encode the environment into a smaller latent space and train policies on the top of this smaller representation. In this paper, we extend this general method to stochastic environments (i.e. where the reward function is stochastic) and considering a risk measure instead of the classical expected return. First, we show that under some assumptions it is equivalent to minimize a risk measure in the latent space and in the natural space. Based on this result, we present Latent Offline Distributional Actor-Critic (LO-DAC), an algorithm which is able to train policies in high-dimensional stochastic and offline settings to minimize a given risk measure. Empirically, we show that using LODAC to minimize Conditional Value-at-Risk (CVaR), outperforms previous methods in term of CVaR and return on stochastic environments.

1. INTRODUCTION

In a lot of context, human decisions are stored and build interesting datasets. With the successes of modern machine learning tools comes the hope to exploit them to build useful decision helpers. To achieve this, we could use an imitation learning approach (Hussein et al., 2017) . But, in this case, we will at best be as good as humans. Moreover, the performance of this approach depends heavily on the quality of the training dataset. In this work we would like to avoid these behaviours and thus, we consider another framework: reinforcement learning (RL). In recent years, RL achieved impressive results in a number of challenging areas, including games (Silver et al., 2016; 2018 ), robotic control (Gu et al., 2017;; Haarnoja et al., 2018) or even healthcare (Shortreed et al., 2011; Wang et al., 2018) . In particular, offline RL seems to be really interesting for real world application since its goal is to train agents from a dataset of past interactions with the environment (Deisenroth et al., 2013) . With the digitization of society, more and more features could be used to represent the environment. Unfortunately, classical RL algorithms are not able to work with high-dimensional states. Obviously, we could manually choose a feature subset. However, this choice is not straightforward and it could have a huge impact on the performance. Therefore, it may be be more practical to use RL algorithms capable of learning from high-dimensional states. It is common to evaluate RL algorithms on deterministic (in the sense that the reward function is deterministic) environment such as the DeepMind Control suite (Tassa et al., 2018) . However, in a lot of real world applications, environments are not deterministic but stochastic. Therefore, it might be important to develop algorithms which are able to train policies in these special cases. The motivation of this paper is to provide a method for training policies in an offline settings and working in a high-dimensional stochastic environment. In this paper, we present Latent Distributional Actor-Critic (LODAC) an algorithm which is able to train policies in a high-dimensional, stochastic environment and in an offline settings. The main idea is to learn a smaller representation and to train the agent directly in this latent space. But instead of considering the expected return, we take into account a risk measure. First, assuming some hypothesis, we show that minimizing this risk measure in the latent space is equivalent to minimize the risk measure directly in the natural state. This theoretical result provides a natural framework : train a latent variable model to encode the natural space into a latent space and then train a policy on 1

