DRIFT DETECTION IN EPISODIC DATA: DETECT WHEN YOUR AGENT STARTS FALTERING Anonymous

Abstract

Detection of deterioration of agent performance in dynamic environments is challenging due to the non-i.i.d nature of the observed performance. We consider an episodic framework, where the objective is to detect when an agent begins to falter. We devise a hypothesis testing procedure for non-i.i.d rewards, which is optimal under certain conditions. To apply the procedure sequentially in an online manner, we also suggest a novel Bootstrap mechanism for False Alarm Rate control (BFAR). We demonstrate our procedure in problems where the rewards are not independent, nor identically-distributed, nor normally-distributed. The statistical power of the new testing procedure is shown to outperform alternative tests -often by orders of magnitude -for a variety of environment modifications (which cause deterioration in agent performance). Our detection method is entirely external to the agent, and in particular does not require model-based learning. Furthermore, it can be applied to detect changes or drifts in any episodic signal.

1. INTRODUCTION

Reinforcement learning (RL) algorithms have recently demonstrated impressive success in a variety of sequential decision-making problems (Badia et al., 2020; Hessel et al., 2018) . While most RL works focus on the maximization of rewards under various conditions, a key issue in real-world RL tasks is the safety and reliability of the system (Dulac-Arnold et al., 2019; Chan et al., 2020) , arising in both offline and online settings. In offline settings, comparing the agent performance in different environments is important for generalization (e.g., in sim-to-real and transfer learning). The comparison may indicate the difficulty of the problem or help to select the right learning algorithms. Uncertainty estimation, which could help to address this challenge, is currently considered a hard problem in RL, in particular for modelfree methods (Yu et al., 2020) . In online settings, where a fixed, already-trained agent runs continuously, its performance may be affected (gradually or abruptly) by changes in the controlled system or its surroundings, or when reaching new states beyond the ones explored during the training. Some works address the robustness of the agent to such changes (Lecarpentier & Rachelson, 2019; Lee et al., 2020) . However, noticing the changes may be equally important, as it allows us to fall back into manual control, send the agent to re-train, guide diagnosis, or even bring the agent to halt. This is particularly critical in real-world problems such as health care and autonomous driving (Zhao et al., 2019) , where agents are required to be fixed and stable: interventions in the policy are often limited or forbidden (Matsushima et al., 2020) , but any performance degradation should be detected as soon as possible. Many sequential statistical tests exist for detection of mean degradation in a random process. However, common methods (Page, 1954; Lan, 1994; Harel et al., 2014) assume independent and identically distributed (i.i.d) samples, while in RL the feedback from the environment is usually both highly correlated over consecutive time-steps, and varies over the life-time of the task (Korenkevych et al., 2019) . This is demonstrated in Fig. 1 . A possible solution is to apply statistical tests to large blocks of time-steps assumed to be i.i.d (Ditzler et al., 2015) . Since many RL applications consist of repeating episodes, such a blocks-partition can be applied in a natural way (Colas et al., 2019) . However, this approach requires complete episodes for change detection, while a faster response is often required. Furthermore, naively ap-plying a statistical test on the accumulated feedback (e.g., sum of rewards) from complete episodes, ignores the dependencies within the episodes and may miss vital information, leading to highly sub-optimal tests. In this work, we devise an optimal test for detection of degradation of the rewards in an episodic RL task (or in any other episodic signal), based on the covariance structure within the episodes. Even in absence of the assumptions that guarantee its optimality, the test is still asymptotically superior to the common approach of comparing the mean (Colas et al., 2019) . The test can detect changes and drifts in both the offline and the online settings defined above. In addition, for the online settings, we suggest a novel Bootstrap mechanism to control the False Alarm Rate (BFAR) through adjustment of test thresholds in sequential tests of episodic signals. The suggested procedures rely on the ability to estimate the correlations within the episodes, e.g., through a "reference dataset" of episodes. Since the test is applied directly to the rewards, it is model-free in the following senses: the underlying process is not assumed to be known, to be Markov, or to be observable at all (as opposed to other works, e.g., Banerjee et al. ( 2016)), and we require no knowledge about the process or the running policy. Furthermore, as the rewards are simply referred to as episodic time-series, the test can be similarly applied to detect changes in any episodic signal. We demonstrate the new procedures in the environments of Pendulum (OpenAI), HalfCheetah and Humanoid (MuJoCo; Todorov et al., 2012) . BFAR is shown to successfully control the false alarm rate. The covariance-based degradation-test detects degradation faster and more often than three alternative tests -in certain cases by orders of magnitude. Section 3 formulates the offline setup (individual tests) and the online setup (sequential tests). Section 4 introduces the model of an episodic signal, and derives an optimal test for degradation in such a signal. Section 5 shows how to adjust the test for online settings and control the false alarm rate. Section 6 describes the experiments, Section 7 discusses related works and Section 8 summarizes. To the best of our knowledge, we are the first to exploit the covariance between rewards in posttraining phase to test for changes in RL-based systems. The contributions of this paper are (i) a new framework for model-free statistical tests on episodic (non-i.i.d) data with trusted referenceepisodes; (ii) an optimal test (under certain conditions) for degradation in episodic data; and (iii) a novel bootstrap mechanism that controls the false alarm rate of sequential tests on episodic data.

2. PRELIMINARIES

Reinforcement learning and episodic framework: A Reinforcement Learning (RL) problem is usually modeled as a decision process, where a learning agent has to repeatedly make decisions that affect its future states and rewards. The process is often organized as a finite sequence of timesteps (an episode) that repeats multiple times in different variants, e.g., with different initial states. Common examples are board and video games (Brockman et al., 2016) , as well as more realistic problems such as repeating drives in autonomous driving tasks. Once the agent is fixed (which is the case in the scope of this work), the rewards of the decision process essentially reduce to a (decision-free) random process {X t } n t=1 , which can be defined by its PDF (f {Xt} n t=1 : R n → [0, ∞)). {X t } usually depend on each other: even in the popular Markov Decision Process (Bellman, 1957) , where the dependence goes only a single step back, long-term correlations may still carry information if the states are not observable by the agent. Hypothesis tests: Consider a parametric probability function p(X|θ) describing a random process, and consider two different hypotheses H 0 , H A determining the value (simple hypothesis) or allowed values (complex hypothesis) of θ. When designing a test to decide between the hypotheses, the basic metrics for the test efficacy are its significance P (not reject H 0 |H 0 ) = 1 -α and its power P (reject H 0 |H A ) = β. A statistical hypothesis test with significance 1 -α and power β is said to be optimal if any test with as high significance 1 -α ≥ 1 -α has smaller power β ≤ β. The likelihood of the hypothesis H : θ ∈ Θ given data X is defined as L(H|X) = sup θ∈Θ p(X|θ). According to Neyman-Pearson Lemma (Neyman et al., 1933) , a threshold-test on the likelihood ratio LR(H 0 , H A |X) = L(H 0 |X)/L(H A |X) is optimal. In a threshold-test, the threshold is uniquely determined by the desired significance level α, though is often difficult to calculate given α.

