CAUSAL INFERENCE Q-NETWORK: TOWARD RESILIENT REINFORCEMENT LEARNING

Abstract

Deep reinforcement learning (DRL) has demonstrated impressive performance in various gaming simulators and real-world applications. In practice, however, a DRL agent may receive faulty observation by abrupt interferences such as blackout, frozen-screen, and adversarial perturbation. How to design a resilient DRL algorithm against these rare but mission-critical and safety-crucial scenarios is an important yet challenging task. In this paper, we consider a resilient DRL framework with observational interferences. Under this framework, we discuss the importance of the causal relation and propose a causal inference based DRL algorithm called causal inference Q-network (CIQ). We evaluate the performance of CIQ in several benchmark DRL environments with different types of interferences. Our experimental results show that the proposed CIQ method could achieve higher performance and more resilience against observational interferences.

1. INTRODUCTION

Deep reinforcement learning (DRL) methods have shown enhanced performance, gained widespread applications (Mnih et al., 2015; 2016; Ecoffet et al., 2019; Silver et al., 2017; Mao et al., 2017) , and improved robot learning (Gu et al., 2017) in navigation systems (Tai et al., 2017; Nagabandi et al., 2018) . However, most successful demonstrations of these DRL methods are usually trained and deployed under well-controlled situations. In contrast, real-world use cases often encounter inevitable observational uncertainty (Grigorescu et al., 2020; Hafner et al., 2018; Moreno et al., 2018) from an external attacker (Huang et al., 2017) or noisy sensor (Fortunato et al., 2018; Lee et al., 2018) . For examples, playing online video games may experience sudden black-outs or frame-skippings due to network instabilities, and driving on the road may encounter temporary blindness when facing the sun. Such an abrupt interference on the observation could cause serious issues for DRL algorithms. Unlike other machine learning tasks that involve only a single mission at a time (e.g., image classification), an RL agent has to deal with a dynamic (Schmidhuber, 1992) and encoded state (Schmidhuber, 1991; Kaelbling et al., 1998) and to anticipate future rewards. Therefore, DRL-based systems are likely to propagate and even enlarge risks (e.g., delay and noisy pulsed-signals on sensor-fusion (Yurtsever et al., 2020; Johansen et al., 2015) ) induced from the uncertain interference. In this paper, we investigate the resilience ability of an RL agent to withstand unforeseen, rare, adversarial and potentially catastrophic interferences, and to recover and adapt by improving itself in reaction to these events. We consider a resilient RL framework with observational interferences. At each time, the agent's observation is subjected to a type of sudden interference at a predefined possibility. Whether or not an observation has interfered is referred to as the interference label. Specifically, to train a resilient agent, we provide the agent with the interference labels during training. For instance, the labels could be derived from some uncertain noise generators recording whether the agent observes an intervened state at the moment as a binary causation label. By applying the labels as an intervention into the environment, the RL agent is asked to learn a binary causation label and embed a latent state into its model. However, when the trained agent is deployed in the field (i.e., the testing phase), the agent only receives the interfered observations but is agnostic to interference labels and needs to act resiliently against the interference. For an RL agent to be resilient against interference, the agent needs to diagnose observations to make the correct inference about the reward information. To achieve this, the RL agent has to reason about what leads to desired rewards despite the irrelevant intermittent interference. To equip an RL agent with this reasoning capability, we exploit the causal inference framework. Intuitively, a causal inference model for observation interference uses an unobserved confounder (Pearl, 2009; 2019; 1995b; Saunders et al., 2018; Bareinboim et al., 2015) to capture the effect of the interference on the reward collected from the environment. When such a confounder is available, the RL agent can focus on the confounder for relevant reward information and make the best decision. As illustrated in Figure 1 , we propose a causal inference based DRL algorithm termed causal inference Q-network (CIQ). During training, when the interference labels are available, the CIQ agent will implicitly learn a causal inference model by embedding the confounder into a latent state. At the same time, the CIQ agent will also train a Q-network on the latent state for decision making. Then at testing, the CIQ agent will make use of the learned model to estimate the confounding latent state and the interference label. The history of latent states is combined into a causal inference state, which captures the relevant information for the Q-network to collect rewards in the environment despite of the observational interference. In this paper, we evaluate the performance of our method in four environments: 1) Cartpole-v0 -the continuous control environment (Brockman et al., 2016) ; 2) the 3D graphical Banana Collector (Juliani et al., 2018)); 3) an Atari environment LunarLander-v2 (Brockman et al., 2016) , and 4) pixel Cartpole -visual learning from the pixel inputs of Cartpole. For each of the environments, we consider four types of interference: (a) black-out, (b) Gaussian noise, (c) frozen screen, and (d) adversarial attack. In the testing phase mimicking the practical scenario that the agent may have interfered observations but is unaware of the true interference labels (i.e., happens or not), the results show that our CIQ method can perform better and more resilience against all the four types of interference. Furthermore, to benchmark the level of resilience of different RL models, we propose a new robustness measure, called CLEVER-Q, to evaluate the robustness of Q-network based RL algorithms. The idea is to compute a lower bound on the observation noise level such that the greedy action from the Q-network will remain the same against any noise below the lower bound. According to this robustness analysis, our CIQ algorithm indeed achieves higher CLEVER-Q scores compared with the baseline methods. The main contributions of this paper include 1) a framework to evaluate the resilience of DRL methods under abrupt observational interferences; 2) the proposed CIQ architecture and algorithm towards training a resilient DRL agent, and 3) an extreme-value theory based robustness metric (CLEVER-Q) for quantifying the resilience of Q-network based RL algorithms.

2. RELATED WORKS

Causal Inference for Reinforcement Learning: Causal inference (Greenland et al., 1999; Pearl, 2009; Pearl et al., 2016; Pearl, 2019; Robins et al., 1995) has been used to empower the learning process under noisy observation and have better interpretability on deep learning models (Shalit et al., 2017; Louizos et al., 2017) , also with efforts (Jaber et al., 2019; Forney et al., 2017; Bareinboim et al., 2015) on causal online learning and bandit methods. Defining causation and applying causal inference framework to DRL still remains relatively unexplored. Recent works (Lu et al., 2018;  



Figure 1: Frameworks of: (a) the proposed causal inference Q-network (CIQ) training and test framework, where the latent state is an unobserved (hidden) confounder; (b) a 3D navigation task, banana collector (Juliani et al., 2018), and (c) a video game, LunarLander (Brockman et al., 2016).

