CRYSTALBOX: EFFICIENT MODEL-AGNOSTIC EXPLANATIONS FOR DEEP RL CONTROLLERS

Abstract

Practical adoption of Reinforcement Learning (RL) controllers is hindered by a lack of explainability. Particularly, in input-driven environments such as computer systems where the state dynamics are affected by external processes, explainability can serve as a key towards increased real-world deployment of RL controllers. In this work, we propose a novel framework, CrystalBox, for generating black-box post-hoc explanations for RL controllers in input-driven environments. CrystalBox is built on the principle of separation between policy learning and explanation computation. As the explanations are generated completely outside the training loop, CrystalBox is generalizable to a large family of inputdriven RL controllers. To generate explanations, CrystalBox combines the natural decomposability of reward functions in systems environments with the explanatory power of decomposed returns. CrystalBox predicts these decomposed future returns using on-policy Q-function approximations. Our design leverages two complementary approaches for this computation: sampling-and learning-based methods. We evaluate CrystalBox with RL controllers in real-world settings and demonstrate that it generates high-fidelity explanations.

1. INTRODUCTION

Deep Reinforcement Learning (DRL) based solutions outperform manually designed heuristics in many computer systems and networking problems in lab settings. DRL agents have been successful in a wide variety of areas, such as Adaptive Bitrate Streaming (Mao et al., 2017) , congestion control (Jay et al., 2019) , cluster scheduling (Mao et al., 2019b) , and network traffic optimization (Chen et al., 2018) . However, because DRL agents choose their actions in a black-box manner, systems operators are reluctant to deploy them in real-world systems (Meng et al., 2020) . Hence, similar to many ML algorithms, the lack of explainability and interpretability of RL agents has triggered a quest for eXplainable Reinforcement Learning algorithms and techniques (XRL). There are two major research directions in explainability of deep RL. The first line of work, which can be described as feature-based methods, transfer established XAI results developed for supervised learning algorithms to deep RL settings. They focus on tailoring commonly used post-hoc explainers for classification and regression tasks, such as saliency maps (Zahavy et al., 2016; Iyer et al., 2018; Greydanus et al., 2018; Puri et al., 2019) or model distillation (Bastani et al., 2018; Verma et al., 2018; Zhang et al., 2020) . While such adapted techniques work well for some RL applications, it is becoming apparent that these types of explanations are not sufficient to explain the behavior of complex agents in many real-world settings (Puiutta & Veith, 2020; Madumal et al., 2020) . For example, the inherent time-dependent characteristic of RL's decision making process can not be easily captured by feature-based methods. In the second line of work, XRL techniques help the user to understand the agent's dynamic behavior (Yau et al., 2020; Cruz et al., 2021; Juozapaitis et al., 2019) . The main underlying idea of this class of XRL methods is to reveal to the user how the agent 'views the future' as most algorithms compute an explanation using various forms of the agent's future beliefs like future rewards, goals, etc. For example, (Juozapaitis et al., 2019) proposed to modify a DQN agent to decompose its Q-function into interpretable components. (van der Waa et al., 2018) introduce the concept of explaining two actions by explaining the differences between their future consequences. In this work, we present CrystalBox, a novel framework for extracting post-hoc black-box explanations. CrystalBox is designed to work with input-driven RL environments which is a rich class of RL environments, including systems or networking domains. Input-driven environments have two distinctive characteristics compared to standard RL settings. These environments operate over input data traces (where a trace can be a sequence of network conditions measurements), and often have a decomposable reward function. Traces are difficult to model, and make both policy learning and explainability more challenging: learning a self-explainable policy can lead to significant performance degradation. Hence, we build CrystalBox on the principle of separation between policy learning and explanation computation. Our next key observation is that thanks to the decomposable reward property, we can adapt the idea of decomposable returns (Anderson et al., 2019) as the basis for explanations. Below, we summarize our main contributions. 1. We propose the first post-hoc black box explanation framework for input-driven RL environments. 2. We demonstrate that decomposable return-based explanations (Anderson et al., 2019) are a good fit for input-driven RL environments and propose a novel method for generating decomposed future returns using on-policy Q-function. 3. We design two complementary approaches to compute on-policy Q-function approximations outside of the RL agent's training loop: sampling-and learning-based methods. 4. We implement CrystalBox and evaluate it on input-driven RL environments. We demonstrate that CrystalBox produces high-fidelity explanations in real-world settings.

2. SYSTEMS ENVIRONMENTS

Systems environments are a rich class of environments that represent dynamics in computer systems, which are fundamentally different from traditional RL environments. We provide an overview of our representative examples, Adaptive Bitrate Streaming and Congestion Control, and various other systems environments. For a thorough discussion on these environments, please see Appendix A.2 In this section, we highlight the characteristics that we leverage in our explainer, decomposability of reward functions and the notion of traces in these settings.

Adaptive Bitrate Streaming. (ABR)

In adaptive video streaming, there are two communicating entities: a client, such as a Netflix subscriber, who is streaming a video over the Internet from a server, such as a Netflix server. In video streaming, the video is typically divided into small secondslong chunks and encoded, in advance, at various discrete bitrates. The goal of the ABR controller is to maximize the Quality of Experience (QoE) of the client by choosing the most suitable bitrate for the next video chunk based on the network conditions. The controller ensures that the client receives an uninterrupted high-quality video stream while minimizing abrupt changes in the video quality and stalling. QoE in this setting is typically defined as a linear combination that awards higher quality and penalizes both quality changes and stalling (Mok et al., 2011) . Note that network conditions are non-deterministic and constitute the main source of uncertainty in this setting. For example, the time taken to send a chunk depends on the network throughput. These network conditions are defined as the trace in ABR. More concretely, a trace is a sequence of network throughput values over time in ABR. Thus, an environment in ABR is modeled using network traces that represent network conditions. Congestion Control (CC) Congestion control protocols running on end-user devices are responsible for adaptively determining the most suitable transmission rate for data transfers over a shared, dynamic network. When a user transmits data at a rate that the network cannot support, the user experience high queuing delays and packet losses. Deep RL-based solutions have shown superior performance in this setting (Jay et al., 2019; Abbasloo et al., 2020) . Similar to the ABR environment, traces in this setting also constitute a timeseries of throughput values. The reward function in congestion control incentivizes higher sending rates and penalizes delay and loss. Other Systems Environments. Deep RL offers high performance in cluster scheduling (Mao et al., 2019b ), network planning (Zhu et al., 2021) , network traffic engineering (Chen et al., 2018) , database query optimization (Marcus et al., 2019) , and several other systems control problems. A

