RISK-AVERSE OFFLINE REINFORCEMENT LEARNING

Abstract

Training Reinforcement Learning (RL) agents online in high-stakes applications is often prohibitive due to the risk associated with exploration. Thus, the agent can only use data previously collected by safe policies. While previous work considers optimizing the average performance using offline data, we focus on optimizing a risk-averse criterion. In particular, we present the Offline Risk-Averse Actor-Critic (O-RAAC), a model-free RL algorithm that is able to learn risk-averse policies in a fully offline setting. We show that O-RAAC learns policies with higher risk-averse performance than risk-neutral approaches in different robot control tasks. Furthermore, considering risk-averse criteria guarantees distributional robustness of the average performance with respect to particular distribution shifts. We demonstrate empirically that in the presence of natural distribution-shifts, O-RAAC learns policies with good average performance.

1. INTRODUCTION

In high-stakes applications, the deployment of highly-performing Reinforcement Learning (RL) agents is limited by prohibitively large costs at early exploration stages (Dulac-Arnold et al., 2019) . To address this issue, the offline (or batch) RL setting considers learning a policy from a limited batch of pre-collected data. However, high-stakes decision-making is typically also risk-averse: we assign more weight to adverse events than to positive ones (Pratt, 1978) . Although several algorithms for risk-sensitive RL exist (Howard & Matheson, 1972; Mihatsch & Neuneier, 2002) , none of them addresses the offline setting. On the other hand, existing offline RL algorithms consider the average performance criterion and are risk-neutral (Ernst et al., 2005; Lange et al., 2012) . Main contributions We present the first approach towards learning a risk-averse RL policy for high-stakes applications using only offline data: the Offline Risk-Averse Actor-Critic (O-RAAC). The algorithm has three components: a distributional critic that learns the full value distribution (Section 3.1), a risk-averse actor that optimizes a risk averse criteria (Section 3.2) and an imitation learner implemented with a variational auto-encoder (VAE) that reduces the bootstrapping error due to the offline nature of the algorithm (Section 3.3). In Figure 1 , we show how these components interact with each other. Finally, in Section 4 we demonstrate the empirical performance of O-RAAC. Our implementation is freely available at Github: https://github.com/nuria95/O-RAAC.

1.1. RELATED WORK

Risk-Averse RL The most common risk-averse measure in the literature is the Conditional Valueat-Risk (CVaR) (Rockafellar & Uryasev, 2002) , which corresponds to the family of Coherent Risk-Measures (Artzner et al., 1999) , and we focus mainly on these risk-measures. Nevertheless, other risk criteria such as Cumulative Prospect Theory (Tversky & Kahneman, 1992) or Exponential Utility (Rabin, 2013) can also be used with the algorithm we propose. In the context of RL, Petrik & Subramanian ( 2012 2020) propose an off-policy algorithm that approximates the return distribution with a Gaussian distribution and learns its moments using the Bellman equation for the mean and the variance of the distribution. Instead, we learn the full return distribution without making the Gaussianity assumption (Bellemare et al., 2017) . Perhaps most closely related is the work of Singh et al. ( 2020), who consider also a distributional critic but their algorithm is limited to the CVaR and they do not address the offline RL setting. Furthermore, they use a sample-based distributional critic, which makes the computation of the CVaR inefficient. Instead, we modify Implicit Quantile Networks (Dabney et al., 2018) in order to compute different risk criteria efficiently. Although (Dabney et al., 2018) already investigated risk-related criteria, their scope is limited to discrete action spaces (e.g., the Atari domain) in an off-policy setting whereas we consider continuous actions in an offline setting. Offline RL The biggest challenge in offline RL is the Bootstrapping Error: a Q-function is evaluated at state-action pairs where there is little or no data and these get propagated through the Bellman equation (Kumar et al., 2019) . In turn, a policy optimized with offline data induces a state-action distribution that is shifted from the original data (Ross et al., 2011 ). To address this, Fujimoto et al. (2019) propose to express the actor as the sum between an imitation learning component and a perturbation model to control the deviation of the behavior policy. Other approaches to control the difference between the data-collection policy and the optimized policy include regularizing the policies with the behavior policy using the MMD distance (Kumar et al., 2019) or f-divergences (Wu et al., 2020; Jaques et al., 2019) , or using the behavior policy as a prior (Siegel et al., 2020 ). An alternative strategy in offline RL is to be pessimistic with respect to the epistemic uncertainty that arises due to data scarcity. Yu et al. ( 2020) take a model-based approach and penalize the per-step rewards with the epistemic uncertainty of their dynamical model. Using a model-free approach Kumar et al. ( 2020); Buckman et al. (2020) propose to learn a lower bound of the Q-function using an estimate of the uncertainty as penalty in the target of the equation. Our work uses ideas from both strategies to address the offline risk-averse problem. First, we use an imitiation learner to control the bootstrapping error. However, by considering a risk-averse criterion, we are also optimizing over a pessimistic distribution compatible with the empirical distribution in the data set. The connections between risk-aversion and distributional robustness are well studied in supervised learning (Shapiro et al., 2014; Namkoong & Duchi, 2017; Curi et al., 2020; Levy et al., 2020) and in reinforcement learning (Chow et al., 2015; Pan et al., 2019) .

2. PROBLEM STATEMENT

We consider a Markov Decision Process (MDP) with possibly continuous s ∈ S and possibly continuous actions a ∈ A, transition kernel P (•|s, a), reward kernel R(•|s, a) and discount factor γ. We denote by π a stationary policy, i.e., a mapping from states to distribution over actions. We have



Figure 1: Visualization of the algorithm components. Solid lines indicate the forward flow of data whereas dashed lines indicate the backward flow of gradients. Data is stored in the fixed buffer.The VAE, in blue, learns a generative model of the behavior policy. The actor, in green, perturbs the VAE and outputs a policy. The critic, learns the Z-value distribution of the policy. The actor optimizes a risk-averse distortion of the Z-value distribution, which we denote by DZ. On the right, we show a typical probability density function of Z learned by the critic in red. In dashed black we indicate the expected value of Z, which a risk-neutral actor intends to maximize. Instead, a risk-averse actor intends to maximize a distortion DZ, shown in dashed green. In this particular visualization, we show the ubiquitous Conditional Value at Risk (CVaR).

