HOW TO ENABLE UNCERTAINTY ESTIMATION IN PROXIMAL POLICY OPTIMIZATION

Abstract

While deep reinforcement learning (RL) agents have showcased strong results across many domains, a major concern is their inherent opaqueness and the safety of such systems in real-world use cases. To resolve these issues, we need agents that can quantify their uncertainty and detect out-of-distribution (OOD) states. Existing uncertainty estimation techniques, like Monte-Carlo Dropout or Deep Ensembles, have not been extended to on-policy deep RL. We posit that this is due to two reasons: concepts like uncertainty and OOD states are not well defined compared to supervised learning, especially for on-policy RL methods. Secondly, available implementations and comparative studies for uncertainty estimation methods in RL have been limited. To fill the first gap, we propose definitions of uncertainty and OOD for Actor-Critic RL algorithms, namely, proximal policy optimization (PPO), and present potentially applicable measures. In particular, we discuss the concepts of value and policy uncertainty. The second problem is addressed by implementing different uncertainty estimation methods and comparing them across a number of environments. The OOD detection performance is evaluated via a custom evaluation benchmark of in-distribution (ID) and OOD states for various RL environments. We identify a trade-off between reward and OOD detection performance. To overcome this, we formulate a bi-objective meta optimization problem in which we simultaneously optimize hyperparameters for reward and OOD detection performance. Our experiments show that the recently proposed method of Masksembles (Durasov et al. ( 2021)) strikes a favourable balance among the methods, enabling high-quality uncertainty estimation and OOD detection that matches the performance of original RL agents.



)). This is especially relevant for potential safety critical applications. Methods to uncover an agent's uncertainty can help to combat some of the most severe consequences of incorrect decision-making. Agents that could indicate high uncertainty and, therefore, e.g., query for human interventions or use a robust baseline policy (e.g., a hard-coded one) could vastly improve the reliability of systems deploying RL-trained decision-making. 2021)), there is almost a complete lack of studies on uncertainty estimation for actor-critic algorithms, despite the fact that actor-critic algorithms are widely used in practice, e.g., in continuous control applications. Additionally, work specifically



deep RL have achieved several high-profile successes over the last years, e.g., in domains like game playing (Badia et al. (2020)), robotics (Gu et al. (2017)), or navigation (Kahn et al. (2017)). However, generalization and robustness to different conditions and tasks remain a considerable challenge (Kirk et al. (2021); Lütjens et al. (2020)). Without specialized training procedures, e.g., large-scale multitask learning, RL agents only learn a very specific task in given environment conditions (OEL Team (2021). Moreover, even slight variations in the conditions compared to the training environment can lead to severe failure of the agent (Lütjens et al. (2019); Dulac-Arnold et al. (

In response, uncertainty estimation has been recognized as an important open issue in deep learning (Abdar et al. (2021); Malinin & Gales (2018)) and has been investigated in domains such as computer vision (Lakshminarayanan et al. (2017); Durasov et al. (2021)) and natural language processing (Xiao & Wang (2018); He et al. (2020)). The topic of uncertainty estimation for deep RL has also gained traction (Clements et al. (2019)). While some work has explored OOD detection for Q-learning-based algorithms (Chen et al. (2017; 2021); Lee et al. (

