USER-INTERACTIVE OFFLINE REINFORCEMENT LEARNING

Abstract

Offline reinforcement learning algorithms still lack trust in practice due to the risk that the learned policy performs worse than the original policy that generated the dataset or behaves in an unexpected way that is unfamiliar to the user. At the same time, offline RL algorithms are not able to tune their most important hyperparameter -the proximity of the learned policy to the original policy. We propose an algorithm that allows the user to tune this hyperparameter at runtime, thereby addressing both of the above mentioned issues simultaneously. This allows users to start with the original behavior and grant successively greater deviation, as well as stopping at any time when the policy deteriorates or the behavior is too far from the familiar one.

1. INTRODUCTION

Recently, offline reinforcement learning (RL) methods have shown that it is possible to learn effective policies from a static pre-collected dataset instead of directly interacting with the environment (Laroche et al., 2019; Fujimoto et al., 2019; Yu et al., 2020; Swazinna et al., 2021b) . Since direct interaction is in practice usually very costly, these techniques have alleviated a large obstacle on the path of applying reinforcement learning techniques in real world problems. A major issue that these algorithms still face is tuning their most important hyperparameter: The proximity to the original policy. Virtually all algorithms tackling the offline setting have such a hyperparameter, and it is obviously hard to tune, since no interaction with the real environment is permitted until final deployment. Practitioners thus risk being overly conservative (resulting in no improvement) or overly progressive (risking worse performing policies) in their choice. Additionally, one of the arguably largest obstacles on the path to deployment of RL trained policies in most industrial control problems is that (offline) RL algorithms ignore the presence of domain experts, who can be seen as users of the final product -the policy. Instead, most algorithms today can be seen as trying to make human practitioners obsolete. We argue that it is important to provide these users with a utility -something that makes them want to use RL solutions. Other research fields, such as machine learning for medical diagnoses, have already established the idea that domain experts are crucially important to solve the task and complement human users in various ways Babbar et al. 2020). We see our work in line with these and other researchers (Shneiderman, 2020; Schmidt et al., 2021) , who suggest that the next generation of AI systems needs to adopt a user-centered approach and develop systems that behave more like an intelligent tool, combining both high levels of human control and high levels of automation. We seek to develop an offline RL method that does just that. Furthermore, we see giving control to the user as a requirement that may in the future be much more enforced when regulations regarding AI systems become more strict: The EU's high level expert group on AI has already recognized "human autonomy and oversight" as a key requirement for trustworthy AI in their Ethics Guidelines for Trustworthy AI (Smuha, 2019). In the future, solutions found with RL might thus be required by law to exhibit features that enable more human control. In this paper, we thus propose a simple method to provide users with more control over how an offline RL policy will behave after deployment. The algorithm that we develop trains a conditional policy, that can after training adapt the trade-off between proximity to the data generating policy on



(2022); Cai et al. (2019); De-Arteaga et al. (2021); Fard & Pineau (2011); Tang et al. (

