USER-INTERACTIVE OFFLINE REINFORCEMENT LEARNING

Abstract

Offline reinforcement learning algorithms still lack trust in practice due to the risk that the learned policy performs worse than the original policy that generated the dataset or behaves in an unexpected way that is unfamiliar to the user. At the same time, offline RL algorithms are not able to tune their most important hyperparameter -the proximity of the learned policy to the original policy. We propose an algorithm that allows the user to tune this hyperparameter at runtime, thereby addressing both of the above mentioned issues simultaneously. This allows users to start with the original behavior and grant successively greater deviation, as well as stopping at any time when the policy deteriorates or the behavior is too far from the familiar one.

1. INTRODUCTION

Recently, offline reinforcement learning (RL) methods have shown that it is possible to learn effective policies from a static pre-collected dataset instead of directly interacting with the environment (Laroche et al., 2019; Fujimoto et al., 2019; Yu et al., 2020; Swazinna et al., 2021b) . Since direct interaction is in practice usually very costly, these techniques have alleviated a large obstacle on the path of applying reinforcement learning techniques in real world problems. A major issue that these algorithms still face is tuning their most important hyperparameter: The proximity to the original policy. Virtually all algorithms tackling the offline setting have such a hyperparameter, and it is obviously hard to tune, since no interaction with the real environment is permitted until final deployment. Practitioners thus risk being overly conservative (resulting in no improvement) or overly progressive (risking worse performing policies) in their choice. Additionally, one of the arguably largest obstacles on the path to deployment of RL trained policies in most industrial control problems is that (offline) RL algorithms ignore the presence of domain experts, who can be seen as users of the final product -the policy. Instead, most algorithms today can be seen as trying to make human practitioners obsolete. We argue that it is important to provide these users with a utility -something that makes them want to use RL solutions. Other research fields, such as machine learning for medical diagnoses, have already established the idea that domain experts are crucially important to solve the task and complement human users in various ways Babbar et al. 2020). We see our work in line with these and other researchers (Shneiderman, 2020; Schmidt et al., 2021) , who suggest that the next generation of AI systems needs to adopt a user-centered approach and develop systems that behave more like an intelligent tool, combining both high levels of human control and high levels of automation. We seek to develop an offline RL method that does just that. Furthermore, we see giving control to the user as a requirement that may in the future be much more enforced when regulations regarding AI systems become more strict: The EU's high level expert group on AI has already recognized "human autonomy and oversight" as a key requirement for trustworthy AI in their Ethics Guidelines for Trustworthy AI (Smuha, 2019). In the future, solutions found with RL might thus be required by law to exhibit features that enable more human control. In this paper, we thus propose a simple method to provide users with more control over how an offline RL policy will behave after deployment. The algorithm that we develop trains a conditional policy, that can after training adapt the trade-off between proximity to the data generating policy on the one hand and estimated performance on the other. Close proximity to a known solution naturally facilitates trust, enabling conservative users to choose behavior they are more inclined to confidently deploy. That way, users may benefit from the automation provided by offline RL (users don't need to handcraft controllers, possibly even interactively choose actions) yet still remain in control as they can e.g. make the policy move to a more conservative or more liberal trade-off. We show how such an algorithm can be designed, as well as compare its performance with a variety of offline RL baselines and show that a user can achieve state of the art performance with it. Furthermore, we show that our method has advantages over simpler approaches like training many policies with diverse hyperparameters. Finally, since we train a policy conditional on one of the most important hyperparameters in offline RL, we show how a user could potentially use it to tune this hyperparameter. In many cases of our evaluations, this works almost regret-free, since we observe that the performance as a function of the hyperparameter is mostly a smooth function.

2. RELATED WORK

Offline RL Recently, a plethora of methods has been published that learn policies from static datasets. Early works, such as FQI and NFQ (Ernst et al., 2005; Riedmiller, 2005) , were termed batch instead of offline since they didn't explicitly address issue that the data collection cannot be influenced. Instead, similarly to other batch methods (Depeweg et al., 2016; Hein et al., 2018; Kaiser et al., 2020) , they assumed a uniform random data collection that made generalization to the real environment simpler. Among the first to explicitly address the limitations in the offline setting under unknown data collection were SPIBB(-DQN) (Laroche et al., 2019) in the discrete and BCQ (Fujimoto et al., 2019) in the continuous actions case. Many works with different focuses followed: Some treat discrete MDPs and come with provable bounds on the performance at least with a certain probability Thomas et al. ( 2015); Nadjahi et al. ( 2019), however many more focused on the continuous setting: EMaQ, BEAR, BRAC, ABM, various DICE based methods, REM, PEBL, PSEC-TD-0, CQL, IQL, BAIL, CRR, COIL, O-RAAC, OPAL, TD3+BC, and RvS (Ghasemipour et al., 2021; Kumar et al., 2019; Wu et al., 2019; Siegel et al., 2020; Nachum et al., 2019; Zhang et al., 2020; Agarwal et al., 2020; Smit et al., 2021; Pavse et al., 2020; Kumar et al., 2020; Kostrikov et al., 2021; Chen et al., 2019; Wang et al., 2020; Liu et al., 2021; Urpí et al., 2021; Ajay et al., 2020; Brandfonbrener et al., 2021; Emmons et al., 2021) are just a few of the proposed model-free methods over the last few years. Additionally, many model-based as well as hybrid approaches have been proposed, such as MOPO, MOReL, MOOSE, COMBO, RAMBO, and WSBC (Yu et al., 2020; Kidambi et al., 2020; Swazinna et al., 2021b; Yu et al., 2021; Rigter et al., 2022; Swazinna et al., 2021a) . Even approaches that train policies purely supervised, by conditioning on performance, have been proposed (Peng et al., 2019; Emmons et al., 2021; Chen et al., 2021) . Model based algorithms more often use model uncertainty, while model-free methods use a more direct behavior regularization approach. Offline policy evaluation or offline hyperparameter selection is concerned with evaluating (or at least ranking) policies that have been found by an offline RL algorithm, in order to either pick the best performing one or to tune hyperparameters. Often, dynamics models are used to evaluate policies found in model-free algorithms, however also model-free evaluation methods exist (Hans et al., 2011; Paine et al., 2020; Konyushova et al., 2021; Zhang et al., 2021b; Fu et al., 2021) . Unfortunately, but also intuitively, this problem is rather hard since if any method is found that can more accurately assess the policy performance than the mechanism in the offline algorithm used for training, it should be used instead of the previously employed method for training. Also, the general dilemma of not knowing in which parts of the state-action space we know enough to optimize behavior seems to always remain. Works such as Zhang et al. (2021a); Lu et al. (2021) become applicable if limited online evaluations are allowed, making hyperparameter tuning much more viable. Offline RL with online adaptation Other works propose an online learning phase that follows after offline learning has conceded. In the most basic form, Kurenkov & Kolesnikov (2021) introduce an online evaluation budget that lets them find the best set of hyperparameters for an offline RL algorithm given limited online evaluation resources. In an effort to minimize such a budget, Yang et al. 



(2022); Cai et al. (2019); De-Arteaga et al. (2021); Fard & Pineau (2011); Tang et al. (

(2021) train a set of policies spanning a diverse set of uncertainty-performance trade-offs. Ma et al. (2021) propose a conservative adaptive penalty, that penalizes unknown behavior more during the beginning and less during the end of training, leading to safer policies during training. In Pong et al.

