RISK-AVERSE OFFLINE REINFORCEMENT LEARNING

Abstract

Training Reinforcement Learning (RL) agents online in high-stakes applications is often prohibitive due to the risk associated with exploration. Thus, the agent can only use data previously collected by safe policies. While previous work considers optimizing the average performance using offline data, we focus on optimizing a risk-averse criterion. In particular, we present the Offline Risk-Averse Actor-Critic (O-RAAC), a model-free RL algorithm that is able to learn risk-averse policies in a fully offline setting. We show that O-RAAC learns policies with higher risk-averse performance than risk-neutral approaches in different robot control tasks. Furthermore, considering risk-averse criteria guarantees distributional robustness of the average performance with respect to particular distribution shifts. We demonstrate empirically that in the presence of natural distribution-shifts, O-RAAC learns policies with good average performance.

1. INTRODUCTION

In high-stakes applications, the deployment of highly-performing Reinforcement Learning (RL) agents is limited by prohibitively large costs at early exploration stages (Dulac-Arnold et al., 2019) . To address this issue, the offline (or batch) RL setting considers learning a policy from a limited batch of pre-collected data. However, high-stakes decision-making is typically also risk-averse: we assign more weight to adverse events than to positive ones (Pratt, 1978) . Although several algorithms for risk-sensitive RL exist (Howard & Matheson, 1972; Mihatsch & Neuneier, 2002) , none of them addresses the offline setting. On the other hand, existing offline RL algorithms consider the average performance criterion and are risk-neutral (Ernst et al., 2005; Lange et al., 2012) . Main contributions We present the first approach towards learning a risk-averse RL policy for high-stakes applications using only offline data: the Offline Risk-Averse Actor-Critic (O-RAAC). The algorithm has three components: a distributional critic that learns the full value distribution (Section 3.1), a risk-averse actor that optimizes a risk averse criteria (Section 3.2) and an imitation learner implemented with a variational auto-encoder (VAE) that reduces the bootstrapping error due to the offline nature of the algorithm (Section 3.3). In Figure 1 , we show how these components interact with each other. Finally, in Section 4 we demonstrate the empirical performance of O-RAAC. Our implementation is freely available at Github: https://github.com/nuria95/O-RAAC.

1.1. RELATED WORK

Risk-Averse RL The most common risk-averse measure in the literature is the Conditional Valueat-Risk (CVaR) (Rockafellar & Uryasev, 2002) , which corresponds to the family of Coherent Risk-Measures (Artzner et al., 1999) , and we focus mainly on these risk-measures. Nevertheless, other risk criteria such as Cumulative Prospect Theory (Tversky & Kahneman, 1992) or Exponential Utility (Rabin, 2013) can also be used with the algorithm we propose. In the context of RL, Petrik & Subramanian (2012); Chow & Ghavamzadeh (2014); Chow et al. (2015) propose dynamic programming algorithms for solving the CVaR of the return distribution with known tabular Markov Decision Processes (MDPs). For unknown models, Morimura et al. (2010) propose a SARSA algorithm for (CVaR) optimization but it is limited to the on-policy setting and small action spaces. To scale to larger systems, Tamar et al. (2012; 2015) propose on-policy Actor-Critic algorithms for Coherent Risk-Measures. However, they are extremely sample inefficient due to sample discarding to compute

