HOW TO ENABLE UNCERTAINTY ESTIMATION IN PROXIMAL POLICY OPTIMIZATION

Abstract

While deep reinforcement learning (RL) agents have showcased strong results across many domains, a major concern is their inherent opaqueness and the safety of such systems in real-world use cases. To resolve these issues, we need agents that can quantify their uncertainty and detect out-of-distribution (OOD) states. Existing uncertainty estimation techniques, like Monte-Carlo Dropout or Deep Ensembles, have not been extended to on-policy deep RL. We posit that this is due to two reasons: concepts like uncertainty and OOD states are not well defined compared to supervised learning, especially for on-policy RL methods. Secondly, available implementations and comparative studies for uncertainty estimation methods in RL have been limited. To fill the first gap, we propose definitions of uncertainty and OOD for Actor-Critic RL algorithms, namely, proximal policy optimization (PPO), and present potentially applicable measures. In particular, we discuss the concepts of value and policy uncertainty. The second problem is addressed by implementing different uncertainty estimation methods and comparing them across a number of environments. The OOD detection performance is evaluated via a custom evaluation benchmark of in-distribution (ID) and OOD states for various RL environments. We identify a trade-off between reward and OOD detection performance. To overcome this, we formulate a bi-objective meta optimization problem in which we simultaneously optimize hyperparameters for reward and OOD detection performance. Our experiments show that the recently proposed method of Masksembles (Durasov et al. ( 2021)) strikes a favourable balance among the methods, enabling high-quality uncertainty estimation and OOD detection that matches the performance of original RL agents.

1. INTRODUCTION

Agents trained via deep RL have achieved several high-profile successes over the last years, e.g., in domains like game playing (Badia et al. (2020) ), robotics (Gu et al. (2017) ), or navigation (Kahn et al. (2017) ). However, generalization and robustness to different conditions and tasks remain a considerable challenge (Kirk et al. (2021) ; Lütjens et al. (2020) ). Without specialized training procedures, e.g., large-scale multitask learning, RL agents only learn a very specific task in given environment conditions (OEL Team (2021) . Moreover, even slight variations in the conditions compared to the training environment can lead to severe failure of the agent (Lütjens et al. (2019) ; Dulac-Arnold et al. (2019) ). This is especially relevant for potential safety critical applications. Methods to uncover an agent's uncertainty can help to combat some of the most severe consequences of incorrect decision-making. Agents that could indicate high uncertainty and, therefore, e.g., query for human interventions or use a robust baseline policy (e.g., a hard-coded one) could vastly improve the reliability of systems deploying RL-trained decision-making. 2021)), there is almost a complete lack of studies on uncertainty estimation for actor-critic algorithms, despite the fact that actor-critic algorithms are widely used in practice, e.g., in continuous control applications. Additionally, work specifically exploring OOD detection has been mostly focused on very simple environments, and neural network architectures (Mohammed & Valdenegro-Toro (2021); Sedlmeier et al. (2019) ). This, therefore, limits the transferability of results to more commonly used research environments. In this paper, we bridge this research gap through a systematic study of in-distribution (ID) and out-of-distribution (OOD) detection for on-policy actor-critic RL across a set of more complex environments. We implement and evaluate a set of uncertainty estimation methods for Proximal Policy Estimation (PPO) (Schulman et al. ( 2017)), a versatile on-policy actor-critic algorithm that is popular in research due to its simplicity and high performance. We compare a series of established uncertainty estimation methods in this setting. Specifically, we introduce the recently proposed method of Masksembles (Durasov et al. ( 2021)) to the domain of RL. For multi-sample uncertainty estimation methods like Ensembles or Dropout, we identify a trade-off in on-policy RL: During training, a "closeness" to the on-policy setting is required, i.e., models should train on their own recent trajectory data. However, for OOD detection, a sufficiently diverse set of sub-models is required, which leads to the training data becoming off-policy. Existing methods like Monte Carlo Dropout (MC Dropout), Monte Carlo Dropconnect (MC Dropconnect), or Ensembles, therefore, struggle either to achieve good reward or OOD detection performance. The recently proposed method of Masksembles has the ability to smoothly interpolate between different degrees of inter-model correlation via its hyperparameters. We propose a bi-objective meta optimization to find the ideal configuration. We show that Masksembles can produce competitive results for OOD detection while maintaining good performance. The contributions of this paper are three-fold: (1) We examine the concept of uncertainty for on-policy actor-critic RL, discussing ID and OOD settings; to tackle this, we define value-and policy-uncertainty, as well as a multiplicative measure (Sec. 3). ( 2) We show different methods to enable uncertainty estimation in PPO; as part of it, we establish Masksembles as a robust method in the domain of RL, particularly for on-policy actor-critic RL (Sec. 4). ( 3) We present an OOD detection benchmark to evaluate the quality of uncertainty estimation for a series of MuJoCo and Atari environments (Sec. 5), we evaluate the presented methods and highlight key results of the comparison (Sec. 6). The full source-code of the implementations, fully compatible with StableBaselines3 (Raffin et al. ( 2021)), will be released as open-source.

2. RELATED WORK

Defining uncertainty in RL Defining uncertainty for RL is less well studied compared to the supervised setting (Abdar et al. (2021) ). Existing work studied aleatoric and epistemic uncertainty in reinforcement learning (Clements et al. (2019); Charpentier et al. (2022) ), e.g., disentangling the two types of uncertainty. Here, we are interested in OOD detection during inference which is achieved by capturing epistemic uncertainty. However, as we do not directly attempt to disentangle both sources, our proposed uncertainty measures can also be interpreted as estimating total uncertainty. Methods for uncertainty estimation in RL Different methods have been proposed in supervised learning settings to tackle uncertainty estimation and OOD detection. Bayesian deep learning (Kendall & Gal (2017) ; Wang & Yeung (2020)) methods provide robust uncertainty estimation rooted in deep theoretical background. However, Bayesian deep learning methods often come with a considerable overhead in complexity, computation, and sample inefficiency compared to non-Bayesian methods. This has slowed their adoption in deep learning in general and in RL in particular. Deep ensemble models have proven a reliable and scalable method for uncertainty and OOD detection in supervised deep learning (Lakshminarayanan et al. (2017); Fort et al. (2019) ). An obvious drawback of ensemble methods is the need for a number of independently trained models, which drives up computational cost and inference time. MC Dropout has been proposed as a more resource-effective method for single-model uncertainty estimation (Gal & Ghahramani (2016) ). As a disadvantage, enabling drop-out during inference can lead to a lower model performance overall (Durasov et al. (2021) ). Although there has been work in the space of using uncertainty estimation in RL, it has mostly been limited to either simple bandit settings (Auer ( 2003 



In response, uncertainty estimation has been recognized as an important open issue in deep learning (Abdar et al. (2021); Malinin & Gales (2018)) and has been investigated in domains such as computer vision (Lakshminarayanan et al. (2017); Durasov et al. (2021)) and natural language processing (Xiao & Wang (2018); He et al. (2020)). The topic of uncertainty estimation for deep RL has also gained traction (Clements et al. (2019)). While some work has explored OOD detection for Q-learning-based algorithms (Chen et al. (2017; 2021); Lee et al. (

); Lu & Van Roy (2017)), model-based RL (Kahn et al. (2017)), or applied in training routines to improve the agent performance, e.g. by boosting exploration in uncertain states (Wu et al. (2021); Chen et al. (2017); Osband et al. (2016; 2018)). Work explicitly investigating OOD detection for RL has been limited (Mohammed & Valdenegro-Toro (2021); Sedlmeier et al. (2019)). Out of the existing work, most methods have been almost exclusively applied to Q-learning (Mohammed & Valdenegro-Toro (2021); Sedlmeier et al. (2019);

