CAUSAL INFERENCE Q-NETWORK: TOWARD RESILIENT REINFORCEMENT LEARNING

Abstract

Deep reinforcement learning (DRL) has demonstrated impressive performance in various gaming simulators and real-world applications. In practice, however, a DRL agent may receive faulty observation by abrupt interferences such as blackout, frozen-screen, and adversarial perturbation. How to design a resilient DRL algorithm against these rare but mission-critical and safety-crucial scenarios is an important yet challenging task. In this paper, we consider a resilient DRL framework with observational interferences. Under this framework, we discuss the importance of the causal relation and propose a causal inference based DRL algorithm called causal inference Q-network (CIQ). We evaluate the performance of CIQ in several benchmark DRL environments with different types of interferences. Our experimental results show that the proposed CIQ method could achieve higher performance and more resilience against observational interferences.

1. INTRODUCTION

Deep reinforcement learning (DRL) methods have shown enhanced performance, gained widespread applications (Mnih et al., 2015; 2016; Ecoffet et al., 2019; Silver et al., 2017; Mao et al., 2017) , and improved robot learning (Gu et al., 2017) in navigation systems (Tai et al., 2017; Nagabandi et al., 2018) . However, most successful demonstrations of these DRL methods are usually trained and deployed under well-controlled situations. In contrast, real-world use cases often encounter inevitable observational uncertainty (Grigorescu et al., 2020; Hafner et al., 2018; Moreno et al., 2018) from an external attacker (Huang et al., 2017) or noisy sensor (Fortunato et al., 2018; Lee et al., 2018) . For examples, playing online video games may experience sudden black-outs or frame-skippings due to network instabilities, and driving on the road may encounter temporary blindness when facing the sun. Such an abrupt interference on the observation could cause serious issues for DRL algorithms. Unlike other machine learning tasks that involve only a single mission at a time (e.g., image classification), an RL agent has to deal with a dynamic (Schmidhuber, 1992) and encoded state (Schmidhuber, 1991; Kaelbling et al., 1998) and to anticipate future rewards. Therefore, DRL-based systems are likely to propagate and even enlarge risks (e.g., delay and noisy pulsed-signals on sensor-fusion (Yurtsever et al., 2020; Johansen et al., 2015) ) induced from the uncertain interference. In this paper, we investigate the resilience ability of an RL agent to withstand unforeseen, rare, adversarial and potentially catastrophic interferences, and to recover and adapt by improving itself in reaction to these events. We consider a resilient RL framework with observational interferences. At each time, the agent's observation is subjected to a type of sudden interference at a predefined possibility. Whether or not an observation has interfered is referred to as the interference label. Specifically, to train a resilient agent, we provide the agent with the interference labels during training. For instance, the labels could be derived from some uncertain noise generators recording whether the agent observes an intervened state at the moment as a binary causation label. By applying the labels as an intervention into the environment, the RL agent is asked to learn a binary causation label and embed a latent state into its model. However, when the trained agent is deployed in the field (i.e., the testing phase), the agent only receives the interfered observations but is agnostic to interference labels and needs to act resiliently against the interference. For an RL agent to be resilient against interference, the agent needs to diagnose observations to make the correct inference about the reward information. To achieve this, the RL agent has to reason about what leads to desired rewards despite the irrelevant intermittent interference. To equip an RL (Juliani et al., 2018) , and (c) a video game, LunarLander (Brockman et al., 2016) . agent with this reasoning capability, we exploit the causal inference framework. Intuitively, a causal inference model for observation interference uses an unobserved confounder (Pearl, 2009; 2019; 1995b; Saunders et al., 2018; Bareinboim et al., 2015) to capture the effect of the interference on the reward collected from the environment. When such a confounder is available, the RL agent can focus on the confounder for relevant reward information and make the best decision. As illustrated in Figure 1 , we propose a causal inference based DRL algorithm termed causal inference Q-network (CIQ). During training, when the interference labels are available, the CIQ agent will implicitly learn a causal inference model by embedding the confounder into a latent state. At the same time, the CIQ agent will also train a Q-network on the latent state for decision making. Then at testing, the CIQ agent will make use of the learned model to estimate the confounding latent state and the interference label. The history of latent states is combined into a causal inference state, which captures the relevant information for the Q-network to collect rewards in the environment despite of the observational interference. In this paper, we evaluate the performance of our method in four environments: 1) Cartpole-v0 -the continuous control environment (Brockman et al., 2016) ; 2) the 3D graphical Banana Collector (Juliani et al., 2018) ); 3) an Atari environment LunarLander-v2 (Brockman et al., 2016) , and 4) pixel Cartpole -visual learning from the pixel inputs of Cartpole. For each of the environments, we consider four types of interference: (a) black-out, (b) Gaussian noise, (c) frozen screen, and (d) adversarial attack. In the testing phase mimicking the practical scenario that the agent may have interfered observations but is unaware of the true interference labels (i.e., happens or not), the results show that our CIQ method can perform better and more resilience against all the four types of interference. Furthermore, to benchmark the level of resilience of different RL models, we propose a new robustness measure, called CLEVER-Q, to evaluate the robustness of Q-network based RL algorithms. The idea is to compute a lower bound on the observation noise level such that the greedy action from the Q-network will remain the same against any noise below the lower bound. According to this robustness analysis, our CIQ algorithm indeed achieves higher CLEVER-Q scores compared with the baseline methods. The main contributions of this paper include 1) a framework to evaluate the resilience of DRL methods under abrupt observational interferences; 2) the proposed CIQ architecture and algorithm towards training a resilient DRL agent, and 3) an extreme-value theory based robustness metric (CLEVER-Q) for quantifying the resilience of Q-network based RL algorithms. 2 RELATED WORKS Tennenholtz et al., 2019) study this problem by defining action as one kind of intervention and calculating the treatment effects on the action. In contrast, we introduce causation into DRL by applying extra noisy and uncertain inventions. Different from the aforementioned approaches, we leverage the causal effect of observational interferences on states, and design an end-to-end structure for learning a causal-observational representation evaluating treatment effects on rewards. Adversarial Perturbation: An intensifying challenge against deep neural network based systems is adversarial perturbation for making incorrect decisions. Many gradient-based noise-generating methods (Goodfellow et al., 2015; Huang et al., 2017) have been conducted for misclassification and mislead an agent's output action. As an example of using DRL model playing Atari games, an adversarial attacker (Lin et al., 2017; Yang et al., 2020) could jam in a timely and barely detectable noise to maximize the prediction loss of a Q-network and cause massively degraded performance. Partially Observable Markov Decision Processes (POMDPs): Our resilient RL framework can be viewed as a POMDP with interfered observations. Belief-state methods are available for simple POMDP problems (e.g., plan graph and the tiger problem (Kaelbling et al., 1998) ), but no provably efficient algorithm is available for general POMDP settings (Papadimitriou & Tsitsiklis, 1987; Gregor et al., 2018) . Recently, Igl et al. (2018) have proposed a DRL approach for POMDPs by combining variational autoencoder and policy-based learning, but this kind of methods do not consider the interference labels available during training in our resilient RL framework. Safe Reinforcement Learning: Safe reinforcement learning (SRL) (Garcia & Fernández, 2012) seeks to learn a policy that maximizes the expected return, while satisfying specific safety constraints. Previous approaches to SRL include reward-shaping (Saunders et al., 2018) , noisy training (Fortunato et al., 2018) , shielding-based SRL (Alshiekh et al., 2018) , and policy optimization with confident lower-bound constraints (Thomas et al., 2015) . However, finding these policies in the first place could need to reset the model at each time and be computationally challenging. Our proposed resilient RL framework can be viewed as an approach to achieve SRL (Alshiekh et al., 2018 ), but we focus on gaining resilience against abrupt observation interferences. Another key difference between our framework and other SRL schemes is the novelty in proactively using available interference labels during training, which allows our agent to learn a causal inference model to make safer decisions.

3. RESILIENT REINFORCEMENT LEARNING

In this section, we formally introduce our resilient RL framework and provide an extreme-value theory based metric called CLEVER-Q for measuring the robustness of DQN-based methods. We consider a sequential decision-making problem where an agent interacts with an environment. At each time t, the agent gets an observation x t , e.g. a frame in a video environment. As in many RL domains (e.g., Atari games), we view s t = (x t-M +1 , . . . , x t ) to be the state of the environment where M is a fixed number for the history of observations. Given a stochastic policy π, the agent chooses an action a t ∼ π(s t ) from a discrete action space based on the observed state and receives a reward r t from the environment. For a policy π, define the Q-function Q π (s, a) = E ∞ t=0 γ t r t |s 0 = s, a 0 = a, π where γ ∈ (0, 1) is the discount factor. The agent's goal is to find the optimal policy π * that achieves the optimal Q-function given by Q * (s, a) = max π Q π (s, a).

3.1. RESILIENCE BASE ON AN INTERVENTIONAL PERSPECTIVE

We consider a resilient RL framework where the observations are subject to interference (as illustrated in Fig 1 (a) ) as an empirical process in Rubin's Causal Model (RCM) (Kennedy, 2016; Holland, 1988; Balke & Pearl, 1997; Robins et al., 2003) for causal inference. Given a type of interference I, the agent's observation becomes: x t = F I (x t , i t ) = i t × I(x t ) + (1 -i t ) × x t where i t ∈ {0, 1} is the label indicating whether the observation is interfered at time t or not (under the potential outcome estimation (Rubin, 1974) ), and I(x t ) is the interfered observation. The interfered state is then given by s t = (x t-M +1 , . . . , x t ). We assume that interference labels i t follow an i.i.d. Bernoulli process with a fixed interference probability p I as a noise level. For example, when p I equals to 10%, each observational state has a 10% chance to be intervened under a perturbation. The agent now needs to choose its actions a t ∼ π(s t ) based on the interfered state. The resilient RL objective for the agent is to find a policy π to maximize rewards in this environment under observation interference. In this work, we consider four types of interference as described below. Gaussian Noise. Gaussian noise or white noise is a common interference to sensory data (Osband et al., 2019; Yurtsever et al., 2020) . The interfered observation becomes I(x t ) = x t + n t with a zero-mean Gaussian noise n t . The noise variance is set to be the variance of all recorded states. Adversarial Observation. Following the standard adversarial RL attack setting, we use the fast gradient sign method (FGSM) (Szegedy et al., 2014) to generate adversarial patterns against the DQN prediction loss (Huang et al., 2017) . The adversarial observation is given by I(x t ) = x t + sign (∇ xt Q(x t , y; θ)) where y is the optimal output action by weighting over all possible actions. Observation Black-Out. Off-the-shelf hardware can affect the entire sensor networks as a sensing background (Yurtsever et al., 2020) over-shoot with I(x t ) = 0 (Yan et al., 2016) . This perturbation is realistic owing to overheat hardware and losing the observational information of sensors. Frozen Frame. Lagging and frozen frame(s) (Kalashnikov et al., 2018) often come from limited data communication bottleneck bandwidth. A frozen frame is given by I(x t ) = x t-1 . If the perturbation is constantly present, the frame will remain the first frozen frame since the perturbation happened.

3.2. CLEVER-

Q: A ROBUSTNESS EVALUATION METRIC FOR Q-NETWORKS Here we provide a comprehensive score (CLEVER-Q) for evaluating the robustness of a Q-network model by extending the CLEVER robustness score (Weng et al., 2018) designed for classification tasks to Q-network based DRL tasks. Consider an p -norm bounded (p ≥ 1) perturbation δ to the state s t . We first derive a lower bound β L on the minimal perturbation to s t for altering the action with the top Q-value, i.e., the greedy action. For a given s t and a Q-network, this lower bound β L provides a robustness guarantee that the greedy action at s t will be the same as that of any perturbed state s t + δ, as long as the perturbation level δ p ≤ β L . Therefore, the larger the value β L is, the more resilience of the Q-network against perturbations can be guaranteed. Our CLEVER-Q score uses the extreme value theory to evaluate the lower bound β L as a robustness metric for benchmarking different Q-network models. The proof of Theorem 1. is available in appendix B.1. Theorem 1. Consider a Q-network Q(s, a) and a state s t . Let A * = arg max a Q(s t , a) be the set of greedy (best) actions having the highest Q-value at s t according to the Q-network. Define g a (s t ) = Q(s t , A * ) -Q(s t , a) for every action a, where Q(s t , A * ) denotes the best Q-value at s t . Assume g a (s t ) is locally Lipschitz continuousfoot_0 with its local Lipschitz constant denoted by L a q , where 1/p + 1/q = 1 and p ≥ 1. For any p ≥ 1, define the lower bound β L = min a / ∈A * g a (s t )/L a q . (2) Then for any δ such that δ p ≤ β L , we have arg max a Q(s t , a) = arg max a Q(s t + δ, a).

3.3. CAUSAL GRAPHICAL MODEL AND THE BENEFITS ON LEARNING PERFORMANCE

The causal relation of observation, reward, and interference can be described by a causal graphical model (CGM) in Figure 2 . We use z t to denote the latent state which can be viewed as a confounder in causal inference. Formally, we define z t = h(x t , i t ) to be the hidden confounder. Here h is a function which compresses (x t , i t ) into a confounder such that the CGM holds. It is clear from Eq. (1) and the MDP definition that the CGM holds with h being the identity function, i.e., z t = (x t , i t ). We assume that there exists some unknown compression function h such that z t is low-dimensional. Similar to Louizos et al. (2017) , we aim to learn to predict this low-dimensional hidden confounder by a neural network. Table 1 : Causal hierarchy (Pearl, 2009; 2019; Bareinboim et al., 2020) According to the CGM, different training settings correspond to different levels of Pearl's causal hierarchy (Bareinboim et al., 2020; Shpitser & Pearl, 2008; Pearl, 2009) as shown in Table 1 . If only the observations are available, the training process corresponds to Level I of the causal hierarchy, which associates the outcome r t to the input observation x t directly by P (r t |x t ). Regular DQN and other algorithms with only observations in training belong to this association level. On the other hand, when interference type I and the interference labels i t are available during training, the learning problem is elevated to Level II of the causal hierarchy. In particular, the interference model of Eq. ( 1) can be viewed as the intervention logic with the interference label i t being the treatment information. With these information, we can describe the causal inference problem by P (r t |do(x t ), i t ) = P (r t |F I (x t , i t ) = x t , i t ) with the do-operator (Pearl, 2019) in the intervention level of the causal hierarchy. Based on the causal hierarchy Theorem (Pearl, 2009) , we could answer causal questions in the higher Level II given the interference type I and the interference labels i t in the learning process. We provide an example to analytically demonstrate the learning advantage of having the interference labels during training. Consider an environment of i.i.d. Bernoulli states with P (x t = 1) = P (x t = 0) = 0.5 and two actions 0 and 1. There is no reward taking action a t = 0. When a t = 1, the agent pays one unit to have a chance to win a two unit reward with probability q x at state x t = x ∈ {0, 1}. Therefore, P (r t = 1|x t = x, a t = 1) = q x and P (r t = -1|x t = x, a t = 1) = 1 -q x . This simple environment is a contextual bandit problem where the optimal policy is to pick a t = 1 at state x t = x if q x > 0.5, and a t = 0 if q x ≤ 0.5. If the goal is to find an approximately optimal policy, the agent should take action a t = 1 during training to learn the probabilities q 0 and q 1 . Suppose the environment is subjected to observation black-out x t = 0 with p I = 0.2 when x t = 1, and no interference when x t = 0. Assume q 0 = (3 -q 1 )/5. Then we have P (r t = 1|x t = 1, a t = 1) = q 1 , and P (r t = 1|x t = 0, a t = 1) = q 0 P (x t = 0|x t = 0) + q 1 P (x t = 1|x t = 0) = 0.5. If the agent only has the interfered observation x t , the samples for x t = 0 are irrelevant to learning q 1 because rewards just randomly occur with probability half given x t = 0. Therefore, the sample complexity bound is proportional to 1/P (x t = 1) because only samples with x t = 1 are relevant. On the other hand, if the agent has access to the labels i t during training, even when observed x t = 0, the agent can infer whether x t = 1 by checking i t = 1 or not. Therefore, the causal relation allows the agent to learn q 1 by utilizing all samples with x t = 1, and the sample complexity bound is proportional to 1/P (x t = 1) = 2 which is a 20% reduction from 1/P (x t = 1) = 2.5 when the labels are not available. Note that z t = (x t , i t ) is a latent state for this example, and the latent state and its causal relation is very important to improving learning performance. From the above discussion, we provide the interference type I the interference labels i t to efficiently train a resilient RL agent with the CGM; however, in the actual testing environment, the agent only has access to the interfered observations x t . The CGM allows the agent to infer the latent state z t and utilize its causal relation with observation, interference and reward to learn resilient behaviors. We will then show how we parameterize this model with a novel deep neural network.

3.4. CAUSAL INFERENCE Q-NETWORK

Based on the causal inference graphical model, we propose a causal inference Q-network, referred as CIQ, that is able to map the interfered observation x t into a latent state z t , make proper inferences about the interference condition i t , and adjust our policy based on the estimated interference. We approximate the latent state by a neural network zt = f 1 (x t ; θ 1 ). From the latent state, we generate the estimated interference label ĩt ∼ p( ĩt |z t ) = f I (z t ; φ). We denote s CI t = (z t-M +1 , ĩt-M+1 , . . . , zt , ĩt ) to be the causal inference state. As discussed in the previous subsection, the causal inference state acts as a confounder between the interference and the reward. Therefore, instead of using the interfered state s t , the causal inference state s CI t contains more relevant information for the agent to maximize rewards. Using the causal inference state helps focus on meaningful and informative details even under interference. With the causal inference state s CI t , the output of the Q-network Q(s CI t ; θ) is set to be switched between two neural networks f 2 (s CI t ; θ 2 ) and f 3 (s CI t ; θ 3 ) by the interference label. Such a switching mechanism prevents our network from over-generalizing the causal inference state. During training, switching between the two neural networks is determined by the training interference label i train t . We assume that the true interference label is available in the training phase so i train t = i t . In the testing, when i t is not available, we use the predicted interference label ĩt as the switch to decide which of the two neural networks to use. The design intuition of the inference mechanism is based on the potential outcome estimation theory (Rubin, 1974; Imbens & Rubin, 2010; Pearl, 2009) in RCM and modeling of the interference scenario as described in Eq. ( 1). Intuitively, the switching mechanism (counterfactual inference) from RCM could be considered as a method to disentangle a single deep network into two non-parameter-sharing networks to improve model generalization under uncertainty. It has shown many advantages for representation learning in regression tasks (Shalit et al., 2017; Louizos et al., 2017) . We provide more implementation details in appendix C.1. All the neural networks f 1 , f 2 , f 3 , f I have two fully connected layersfoot_2 with each layer followed by the ReLU activation except for the last layer in f 2 , f 3 and f I . The overall CIQ model is shown in Figure 2 and θ = (θ 1 , θ 2 , θ 3 , φ) denotes all its parameters. Note that, as common practice for discrete action spaces, the Q-network output Q(s CI t ; θ) is an A-dimensional vector where A is the size of the action space, and each dimension represents the value for taking the corresponding action. Finally, we train the CIQ model Q(s t ; θ) end-to-end by the DQN algorithm with an additional loss for predicting the interference label. The overall CIQ objective function is defined as: L CIQ (θ 1 , θ 2 , θ 3 , φ) = i train t • L DQN (θ 1 , θ 2 , φ) + (1 -i train t ) • L DQN (θ 1 , θ 3 , φ) + λ • (i train t log p( ĩt |z t ; θ 1 , φ) + (1 -i train t ) log(1 -p( ĩt |z t ; θ 1 , φ))), ) where λ is a scaling constant and is set to 1 for simplicity. The entire CIQ training procedure is described by Algorithm 1. Due to the design of the causal inference state and the switching mechanism, we will show that CIQ can perform resilient behaviors against the observation interferences.

4.1. ENVIRONMENTS FOR DQNS

Our testing platforms were based on (a) OpenAI Gym (Brockman et al., 2016) , (b) Unity-3D environments (Juliani et al., 2018) , (c) a 2D gaming environment (Brockman et al., 2016) , and (d) visual learning from pixel inputs of cart pole. Our test environments cover some major application scenarios and feature discrete actions for training DQN agents with the CLEVER-Q analysis. Vector Cartpole: Cartpole (Sutton et al., 1998) is a classical continuous control problem. The defined environment is manipulated by adding a force of +1 or -1 to a moving cart. A pendulum starts upright, and the goal is to balance and prevent it from falling over. We use Cartpole-v0 from Gym (Brockman et al., 2016) with a targeted reward = 195.0 to solve the environment. The observational vector-state consist of four physical parameters of cart position and angle velocities.  s t = s t+1 14: t = t + 1 15: Return Agent Banana Collector: The Banana collector shown in Figure 1 (b) is one of the Unity baseline (Juliani et al., 2018) rendering by 3D engine. Different from the MuJoCo (Todorov et al., 2012) simulators with continuous actions, the Banana collector is controlled by four discrete actions corresponding to moving directions. The state-space has 37 dimensions included velocity and a ray-based perception of objects around the agent. The targeted reward is 12.0 points by accessing correct bananas (+1). Lunar Lander: Similar to the Atari gaming environments, Lunar Lander-v2 (Figure 1 (c) ) is a discrete action environment from OpenAI Gym (Brockman et al., 2016) . The state is an eightdimensional vector that records the lander's position, velocity, angle, and angular velocities. The episode finishes if the lander crashes or comes to rest, receiving a reward -100 or +100 with a targeted reward of 200. Firing ejector costs -0.3 each frame with +10 for each ground contact. Pixel Cartpole: To further evaluate our models, we conduct experiments from the pixel inputs in the cartpole environment as a visual learning task. The size of input state is 400 × 600. We use a max-pooling and a convolution layer to extract states as network inputs. The environment includes two discrete actions {lef t, right}, which is identical to the Cartpole-v0 of the vector version.

4.2. BASELINE METHODS

In the experiments, we compare our CIQ algorithm with two sets of DRL baselines to demonstrate the resilience capability of the proposed method. We ensure all the models have the same number of 9.7 millions parameters with careful fine-tuning to avoid model capacity issues. Pure DQN: We use DQN as a baseline in our experiments. The DQN agent is trained and tested on interfered state s t . We also evaluate common DQN improvements in appendix C.1 and find the improvements have no significant effect against interference. DQN with an interference classifier (DQN-CF): In the resilient reinforcement learning framework, the agent is given the true interference label i train t at training. Therefore, we would like to provide this additional information to the DQN agent for a fair comparision. During training, the interfered state s t is concatenated with the true label i train t as the input for the DQN agent. Since the true label is not available at testing, we train an additional binary classifier (CF) for the DQN agent. The classifier is trained to predict the interference label, and this predicted label will be concatenated with the interfered state as the input for the DQN agent during testing. DQN with safe actions (DQN-SA): Inspired by shielding-based safe RL (Alshiekh et al., 2018) , we consider a DQN baseline with safe actions (SA). The DQN-SA agent will apply the DQN action if there is no interference. However, if the current observation is interfered, it will choose the action used for the last uninterfered observation as the safe action. This action-holding method is also a typical control approach when there are missing observations (Franklin et al., 1998) . Similar to DQN-CF, a binary classifier for interference is trained to provide predicted labels at testing. Figure 3 : Performance of DQNs under potential (20%) adversarial and black-out interference. DVRLQ and DVRLQ-CF: Motivated by deep variational RL (DVRL) (Igl et al., 2018) , we provide a version of DVRL as a POMDP baseline. We call this baseline DVRLQ because we replace the A2C-loss with the DQN loss. Similar to DQN-CF, we also consider another baseline of DVRLQ with a classifier, referred to as DVRLQ-CF, for a fair comparison using the interference labels.

4.3. RESILIENT RL ON AVERAGE RETURNS

We run performance evaluation with six different interference probabilities (p I in Sec. 3.1), including {0%, 10%, 20%, 30%, 40%, 50%}. We train each agent 50 times and highlight its standard deviation with lighter colors. Each agent is trained until the target score (shown as the dashed black line) is reached or until 400 episodes. We show the average returns for p I = 20% under adversarial perturbation and black-out in Figure 3 and report the rest of the results in appendix C.1. CIQ (green) clearly outperforms all the baselines under all types of interference, validating the effectiveness of our CIQ in learning to infer and gaining resilience against a wide range of observational interferences. Pure DQN (yellow) cannot handle the interference with 20% noise level. DQN-CF (orange) and DQN-SA (brown) have competitive performance in some environments against certain interferences, but perform poorly in others. DVRLQ (blue) and DVRLQ-CF (purple) cannot achieve the target reward in most experiments and this might suggest the inefficiency of applying a general POMDP approach in a framework with a specific structure of observational interference.

4.4. ROBUSTNESS METRICS BASED ON RECORDING STATES

We evaluate the robustness of DQN and CIQ by the proposed CLEVER-Q metric. To make the test state environment consistent among different types and levels of interference, we record the interfered states, S N = I(S C ), together with their clean states, S C . We then calculate the average CLEVER-Q for DQN and CIQ based on the clean states S C using Eq. 2 over 50 times experiments for each agent. We also consider a retrospective robustness metric, the action correction rate (AC-Rate). Motivated by previous off-policy and error correction studies (Dulac-Arnold et al., 2012; Harutyunyan et al., 2016; Lin et al., 2017) , AC-Rate is defined as the action matching rate R Act = 1 T T -1 t=0 1 {at=a * t } between a t and a * t over an episode with length T . Here a t denotes the action taken by the agent with interfered observations S N , and a * t is the action of the agent if clean states S C were observed instead. The roles of CLEVER-Q and AC-Rate are complementary as robustness metrics. CLEVER-Q measures sensitivity in terms of the margin (minimum perturbation) required for a given state to change the original action. AC-rate measures the utility in terms of action consistency. Altogether, they provide a comprehensive resilience assessment. We also conduct the following analysis to better understand our CIQ model. Environments with a dynamic noise level are evaluated. Due to the space limit, see their details in appendix C to E. Treatment effect analysis: We provide treatment effect analysis on each kind of interference to statistically verify the CGM with lowest errors on average treatment effect refutation in appendix D. Ablation studies: We conduct ablation studies by comparing several CIQ variants, each without certain CIQ component. The results verify the importance of the proposed CIQ architecture in appendix E.

Test on different noise levels:

We train CIQ under one noise level and test on another level. The results show that the difference in noise level does not affect much on the performance in appendix C.6. Neural saliency map: We apply the perturbation-based saliency map for DRL (Greydanus et al., 2018) in appendix E.4 to visualize the saliency centers and actions of CIQ and other baseline agents. Transferability in robustness: Based on CIQ, we study how well can the robustness of different interference types transfer between training and testing environments . We evaluate two general settings (i) same interference type but different noise levels (appendix C.6) and (ii) different interference types (appendix E.5).

Multiple interference types:

We also provide an generalized version of CIQ that deals with multiple interference types at training and testing environments. The generalized CIQ is equipped with a common encoder and individual interference decoders to study multi-module conditional inference, as discussed in appendix E.6.

5. CONCLUSION

Our experiments suggest that, although some DRL-based algorithms can achieve high scores under the normal condition, their performance can be severely degraded in the presence of interference. In order to be resilient against interference, we propose CIQ, a novel causal-inference-driven DRL algorithm. Evaluated on a wide range of environments and multiple types of interferences, the CIQ results show consistently superior performance over several RL baseline methods. We also validate the improved resilience of CIQ by CLEVER-Q and AC-Rate metrics and will open source code. provides a robustness guarantee that the greedy action at s t will be the same as that of any perturbed state s t + δ, as long as the perturbation level δ p ≤ β L . Therefore, the larger the value β L is, the more resilience of the Q-network against perturbations can be guaranteed. Our CLEVER-Q score uses the extreme value theory to evaluate the lower bound β L as a robustness metric for benchmarking different Q-network models. Theorem 2. Consider a Q-network Q(s, a) and a state s t . Let A * = arg max a Q(s t , a) be the set of greedy (best) actions having the highest Q-value at s t according to the Q-network. Define g a (s t ) = Q(s t , A * ) -Q(s t , a) for every action a, where Q(s t , A * ) denotes the best Q-value at s t . Assume g a (s t ) is locally Lipschitz continuous 3 with its local Lipschitz constant denoted by L a q , where 1/p + 1/q = 1 and p ≥ 1. Then for any p ≥ 1, define the lower bound β L = min a / ∈A * g a (s t )/L a q . Then for any δ such that δ p ≤ β L , arg max a Q(s t , a) = arg max a Q(s t + δ, a) Proof. Because g a (s t ) is locally Lipschitz continuous, by Holder's inequality, we have |g a (x) -g a (y)| ≤ L a q ||x -y|| p , for any x, y within the R p -ball centered at s t . Now let x = s t and y = s t + δ, where δ is some perturbation. Then g a (s t ) -L a q ||δ|| p ≤ g a (s t + δ) ≤ g a (s t ) + L a q .||δ|| p (6) Note that if g a (s t + δ) ≥ 0, then A * still remains as the top Q-value action set at state s t + δ. Moreover, g a (s t ) -L a q ||δ|| p ≥ 0 implies g a (s t + δ) ≥ 0. Therefore, ||δ|| p ≤ g a (s t )/L a q , provides a robustness guarantee that ensures Q(s t + δ, A * ) ≥ Q(s t + δ, a) for any δ satisfying Eq. equation 5. Finally, to provide a robustness guarantee that Q(s t + δ, A * ) ≥ Q(s t + δ, a) for any action a / ∈ A * , it suffices to take the minimum value of the bound (for each a) in Eq. equation 5 over all actions other than a * , which gives the lower bound β L = min a / ∈A * g a (s t )/L a q (8) 3 Here locally Lipschitz continuous means ga(st) is Lipschitz continuous within the p ball centered at st with radius Rp. We follow the same definition as in Weng et al. (2018) . For computing β L , while the numerator is easy to obtain, the local Lipschitz constant L a q cannot be directly computed. In our implementation, by using the fact that L a q is equivalent to the local maximum gradient norm (in q norm), we use the same sampling technique from extreme value theory as proposed in Weng et al. (2018) for estimating L a q .

B.2 ADDITIONAL ROBUSTNESS MEASUREMENTS

Following the discussion in Section 4.4 of the main content, we provide more experimental results related to CLEVER-Q measurement and use action correction rate (AC-Rate) mentioned in the main content as a reference. Louizos et al. (2017) . We also conduct an experiment on a DQN extension of TD-VAE Gregor et al. (2018) , but the performance of this extension becomes even lower in all metrics than the V-CF after carefully fine-tuning. We also find that the DQNs would be a benefit on the performance with a joint-trained interference classifier shown in Table S1 and S2 .  Q-CF Q-SA Q-CF Q-SA P%, I Q-CF Q-SA Q-CF Q-SA V-Q V-CF V-Q V-CF P%, I V-Q V-CF V-Q V-

C.1 BACKGROUND AND TRAINING SETTING

To scale to high-dimensional problems, one can use a parameterized deep neural network Q(s, a; θ) to approximate the Q-function, and the network Q(s, a; θ) is referred to as the deep Q-network (DQN). The DQN algorithm Mnih et al. (2015) updates parameter θ according to the loss function: L DQN (θ) = E (st,at,rt,st+1)∼D (y DQN t -Q(s t , a t ; θ)) 2 where the transitions (s t , a t , r t , s t+1 ) are uniformly sampled from the replay buffer D of previously observed transitions, and y DQN t = r t + γ max a Q(s t+1 , a; θ -) is the DQN target with θ -being the target network parameter periodically updated by θ. We utilize the Unity Machine Learning Agents Toolkit Juliani et al. (2018) , which is an open-sourcefoot_3 and reproducible 3D rendering environment for the task of Banana Collector. A reproducible source code, which is designed to render the collector agent, has been given in the supplementary for both Linux and Windows systems. A reward of +1 is provided for collecting a yellow banana, and a reward of -1 is provided for collecting a blue banana. We use a six-layer deep network, which includes an input layer, three 64-unit fully-connected ReLU hidden layers, soft-attention layer Rao et al. (2017) (for all DQNs), and an output layer (2 dimensions). We use [37 × 4] for our input layer, which composes from the observation dimension (37) and the stacked input of 4 consecutive observations. We design a replay buffer with a memory of 100,000, with a mini-batch size of 32, the discount factor γ is equal to 0.99, the τ for a soft update of target parameter is 10 -3 , a learning rate for Adam Kingma & Ba (2014) optimization is 5 × 10 -4 , a regularization term for weight decay is 1 × 10 -4 , the coefficient α for importance sampling exponent is 0.6, the coefficient of prioritization exponent is 0.4. We train each model 1,000 times for each case and report the mean of the average final performance (average over all types of interference) in Table S6 . We report the DVRLQ-CF, which attains a higher performance among DVRLQ and DVRLQ-SA, to compare with DQN-CF, DDQN-CF. CIQ still performs an overall best performance compared to the other baselines. Interestingly, learning an interference improves general performance combined with the joint-training frameworks. C.4 ENV 3 : LUNAR LANDER ENVIRONMENT. The lunar lander-v2 Brockman et al. (2016) is one of the most challenging environments with discrete actions. The observation dimension of Lunar Lander-v2 Brockman et al. ( 2016) is 8 and the input stacks 10 consecutive observations. The objective of the game is to navigate the lunar lander spaceship to a targeted landing spot without a collision. A collection of six discrete actions controls two real-valued vectors ranging from -1 to +1. The first dimension controls the main engine on and off numerically, and the second dimension throttles from 50% to 100% power. The following two actions represent for firing left, and the last two actions represent for firing the right engine. The dimension of the input layer is [8 × 10]. We design a 7-layers neural network for this task, which includes 1 input layer, 2 layer of 32 unit wide fully-connected ReLU network, 2 layers deep 64-unit wide ReLU networks, 1 soft-attention layer Rao et al. (2017) (for all DQNs), and 1 output layer (4 dimensions). The replay buffer size is 500,000; the minimum batch size is 64, the discount factor is 0.99, the τ for a soft update of target parameters is 10 -3 , the learning rate is 5 × 10 -4 , the minimal step for reset memory buffer is 50. We train each model 1,000 times for each case and report the mean of the average final performance (average over all types of interference) in Table S8 [400, 600, 3] . We first resize the original frame into a single gray-scale channel, [100, 150] from the RGN2GRAY function in the OpenCV. The implementation details are shown in the "pixel_tool.py" and "cartpole_pixel.py", which could be refereed to the submitted supplementary code. Then we stack 4 consecutive gray-scale frames as the input. We design a 7-layer DQN model, which included input layer, the first hidden layer convolves 32 filters of a [8 × 8] kernel with stride 4, the second hidden layer convolves 64 filters of a [4 × 4] kernel with stride 2, the third layer is a fully-connected layer with 128 units, from fourth to fifth layers are fully-connected layer with 64 units, a soft-attention layer Rao et al. (2017) (for all DQNs), and the output layer (2 dimensions). The replay buffer size is 500,000; the minimum batch size is 32, the discount factor is 0.99, the τ for a soft update of target parameters is 10 -3 , the learning rate is 5 × 10 -4 , the minimal step for reset memory buffer is 1000. We train each model 1,000 times for each case and report the mean of the average final performance (average over all types of interference) in Table S9 . 2019) is used for analysis.

D.1 AVERAGE TREATMENT EFFECT UNDER INTERVENTION

We refine a Q-network with discrete actions for estimating treatment effects based on Theorem 1 in Louizos et al. (2017) . In particular, individual treatment effect (ITE) can be defined as the difference between the two potential outcomes of a Q-network; and the average treatment effect (ATE) is the expected value of the potential outcomes over the subjects. In a binary treatment setting, for a Q-value function Q t (s t ) and the interfered state I(s t ), the ITE and ATE are calculated by: Q IT E t = Q t (s t ) (1 -p t ) + Q t (I(s t ))p t (13) AT E = T t=1 E Q IT E t (I(s t )) -E Q IT E t (s t ) T ( ) where p t is the estimated inference label by the agent and T is the total time steps of each episode. As expected, we find that CIQ indeed attains a better ATE and its significance can be informed by the refuting tests based on T c , T p and T s . To evaluate the causal effect, we follow a standard refuting setting Rothman & Greenland (2005); Pearl et al. (2016) ; Pearl (1995b) with the causal graphical model in Fig. 3 of the main context to run three major tests, as reported in Tab. 13. The code for the statistical test was conducted by Dowhy Sharma et al. (2019) , which has been submitted as supplementary material. (We intend to open source as a reproducible result.) Pearl Pearl (1995a) introduces a "do-operator" to study this problem under intervention. The do symbol removes the treatment tr, which is equal to interference I in the Eq. ( 1) of the main content , from the given mechanism and sets it to a specific value by some external intervention. The notation P (r t |do(tr)) denotes the probability of reward r t with possible interventions on treatment at time t. Following Pearl's back-door adjustment formula Pearl (2009) and the causal graphical model in Figure 2 of the main content., it is proved in Louizos et al. (2017) that the causal effect for a given binary treatment tr (denoted as a binary interference label i t in Eq. (1) of the main content), a series of proxy variables X = ( T t=1 x t ) ≡ S = ( T t=1 s t ), as s t in Eq. (1) of the main content, a summation of accumulated reward R = ( T t=1 r t ) and a confounding variable Z can be evaluated by (similarly for tr = 0): p(R|S , do(tr = 1)) = Z p(R|S, do(tr = 1), Z)p(Z|S, do(tr = 1))dZ (i) = Z p(R|S , tr = 1, Z)p(Z|S )dZ, where equality (i) is by the rules of do-calculus Pearl (1995a) ; Pearl et al. (2016) applied to the causal graph applied on Figure 1 of the main content. We extend to Eq. 15 on individual outcome study with DQNs, which is known by the Theorem 1. from Louizos et. al. Louizos et al. (2017) and Chapter 3.2 of Pearl Pearl (2009) .

D.2 REFUTATION TEST:

A sampling plan for collecting samples refer to as subgroups (i=1, ..., k). Common cause variation (T-c) is denoted as σ c , which is an estimate of common cause variation within the subgroups in terms of the standard deviation of the within subgroup variation: σ c ∼ = k i=1 s i /k, where k denotes as the number of sample size. We introduce intervention a error rate n, which is a probability to feed error interference (e.g., feed i t = 0 even under interference with a probability of n) and results shown in Table 12 . The test (T-p) of replacing treatment with a random (placebo) variable is conducted by modifying the graphical relationship in the proposed probabilistic model in Fig. 3 of the main context. The new assign variable will follow the placebo note but with a value sampling from a random Gaussian distribution. The test of removing a random subset of data (T-r) is to randomly split and sampling the subset value to calculate an average treatment value in the proposed graphical model. We use the official dowhyfoot_4 implementation, which includes: (1) confounders effect on treatment: how the simulated confounder affects the value of treatment; (2) confounders effect on outcome: how the simulated confounder affects the value of outcome; (3) effect strength on treatment: parameter for the strength of the effect of simulated confounder on treatment, and (4) effect strength on outcome: parameter for the strength of the effect of simulated confounder on outcome. Through Eq. ( 9) to ( 10) and the corresponding correct action rate in the main context, we could interpret deep q-network by estimating the average treatment effect (ATE) of each noisy and adversarial observation. ATE Louizos et al. (2017) ; Shalit et al. (2017) is defined as the expected value of the potential outcomes (e.g., disease) over the subjects (e.g., clinical features.) For example, in navigation environments, we could rank the harmfulness of each noisy observation against q-network from the autonomous driving agent. We also spend efforts on a parameter-study on the results of average returns between different DQN-based models, which included DQN, Double DQN (DDQN), DDQN with dueling, CIQ, DQN with a classifier (DQN-CF), DDQN with a classifier (DDQN-CF), DQN with a variational autoencoder Kingma & Welling (2013) (DQN-VAE), NoisyNet, and using the latent input of causal effect variational autoencoder for Q network (CEVAE-Q) prediction. Overall, CEVAE-Q is with minimal-requested parameters with 14.4 M (in Env 1 ) as the largest model used in our experiments in Tab. 14. CIQ remains roughly similar parameters as 9.7M compared with DDQN, DDQN d , and Noisy Net. Our ablation study in Tab. 14 indicates the advantages of CIQ are not owing to extensive features using in the model according to the size of parameters. CIQ attains benchmark results in our resilient reinforcement learning setting compared to the other DQN models. CEVAE-Q Network: TARNet (Shalit et al., 2017; Louizos et al., 2017) is a major class of neural network architectures for estimating outcomes of a binary treatment on linear data (e.g., clinical reports). Our proposed CIQ uses an end-to-end approach to learn the interventional (causal) features. We provide another baseline on using the latent features from a causal variational autoencoder (Louizos et al., 2017) (CEVAE) as latent features as state inputs followed the loss function in (Louizos et al., 



Here locally Lipschitz continuous means ga(st) is Lipschitz continuous within the p ball centered at st with radius Rp. We follow the same definition as in(Weng et al., 2018). Though such manner may lead to the myth of over-parameterization, our ablation study proves that we can achieve better results with almost the same amount of parameters. Source: https://github.com/Unity-Technologies/ml-agents Source:github.com/microsoft/dowhy/causal_refuters



Figure 1: Frameworks of: (a) the proposed causal inference Q-network (CIQ) training and test framework, where the latent state is an unobserved (hidden) confounder; (b) a 3D navigation task, banana collector(Juliani et al., 2018), and (c) a video game, LunarLander(Brockman et al., 2016).

a) Causal graphical model (CGM). (b) CIQ architecture. The notation i train t denotes the inference label available during training, whereas ĩt is sampled during inference as i t is unknown.

Following the refutation experiment in the CEVAE paper, we conduct experiments shown in Tab. S12 and S13 with 10 % to 50 % intervention noise on the binary treatment labels. The results in Tab. S12 show that proposed CIQ maintains a lower rate compared with the benchmark methods included logistic regression and CEVAE (refer to Fig. 4 (b) in Louizos et al. (2017)).

Figure 4: Perturbation-based saliency map on Pixel Cartpole under adversarial perturbation: (a) DQN, (b) CIQ, (c) DQN-CF, and (d) DVRLQ-CF. The black arrows are correct actions and blue arrows are agents' actions. The neural saliency of CIQ makes more correct actions responding to ground actions.E.2 LATENT REPRESENTATIONSWe conduct an ablation study by comparing other latent representation methods to the proposed CIQ model.DQN with an variational autoencoder (DQN-VAE):To learn important features from observations, many recent works leverage deep variational inference for accessing latent states for feeding into DQN. We provide a baseline on training a variational autoencoder (VAE) built upon the DQN baseline, denoted as DQN-VAE. The DQN-VAE baseline is targeted to recover a potential noisy state and feed the bottleneck latent features into the Q-network.

in our resilient DRL setting. Intervention Intervening P (r t |do(x t ), i t ) CIQ (ours)

Algorithm 1 CIQ Training 1: Inputs: Agent, N oisyEnv, Oracle, max_step, N oisyEnv_test, target, eval_steps 2: Initialize: t = 0, score = 0, s t = N oisyEnv.reset() 3: while t < max_step and score < target do Agent.learn(s t , a t , r t , s t+1 , i t )

reports the two robustness metrics for DQN and CIQ under two types of interference. CIQ attains higher scores than DQN in both CLEVER-Q and AC-Rate, reflecting better resilience in CIQ evaluations. We provide more robustness measurements in appendix B.2 and E.

AC-Rate and CLEVER-Q robustness analysis under Gaussian (l 2 -norm) and adversarial (l ∞ -norm) perturbations in the vector Cartpole environment.

reports the two robustness metrics for DQN, CIQ, DQN-CF (a DQN joint-training with an interference classifier, denoted as Q-CF), DQN-SA (a DQN joint-training with safe-action replay, denoted as Q-SA), DVRLQ (a deep variational reinforcement learning framework Igl et al. (2018) with a DQN-loss, denoted as V-Q), and DVRLQ-CF (a V-Q joint-training with an interference classifier, denoted as V-CF) under two types of L n -norm Weng et al. (2018) interference. CIQ attains higher scores than DQN in both CLEVER-Q and AC-Rate, reflecting better resilience in CIQ evaluations. The performance of the returns of each agent is shown in TableS2. We observe that variational auto-encoding methods included V-Q and V-CF attaining a lower CLEVER-Q, Act-Rate, and average returns from TableS1 and S2. From previous studies Van Hoof et al. (2016), reasons would be difficulties Van Hasselt et al. (2016) of estimation considering temporal information and the various state is hard to disentangle by a single network from counterfactual learning Shalit et al. (2017);



Performance on return in clean and five different noise level in Env 1 evaluated by an average of under uncertain perturbation included Gaussian, adversarial, blackout, and frozen frame. All DQN models solve the environment with over 195.0 average returns in a clean state (a.k.a. no noise).

Ablation study on parameter of different DQN models using in our experiments training under four different noise type of noisy environments (P = 20%), which included Blackout, Adversarial, Gaussian, and Frozen frame for Env 1 reported in the main content. The minimal parameters of each model denote as Para. in the Tab. 14.

Performance on return in clean and five different noise level in Env 2 evaluated by an average of under uncertain perturbation included Gaussian, adversarial, blackout, and frozen frame. All DQN models solve the environment with over 12.0 average returns in a clean state (a.k.a. no noise).

Ablation study on parameter of different DQN models using in our experiments training under four different noise type of noisy environments (P = 20%), which included Blackout, Adversarial, Gaussian, and Frozen frame for Env 2 reported in the main content. The minimal parameters of each model denote as Para. in the Tab. 14.

. Env 3 is a challenging task owing to often receive negative reward during the training. We thus consider a non-stationary noise-level sampling from a cosine-wave in a narrow range of [0%, 20%] for every ten steps. Results suggest CIQ could still solve the environment before the noise-level going over to 30%. For the various noisy test, CIQ attains a best performance over 200.0 the other DQNs algorithms (skipping the table since only CIQ and DQN-CF have solved the environment over 200.0 training with adversarial and blackout interference.)

Performance on average return in clean and five different noise level in Env 3 evaluated by an average of under uncertain perturbation included Gaussian, adversarial, blackout, and frozen frame. All DQN models solve the environment with over 200.0 average returns in a clean state input (a.k.a. no noise).

Performance on average return in clean and five different noise level in Env 4 evaluated by an average of under uncertain perturbation included Gaussian, adversarial, blackout, and frozen frame. Only selected DQN models below solve the environment with over 195.0 average returns in a clean state input (a.k.a. no noise).

Ablation study on parameter of different DQN models using in our experiments training under four different noise type of noisy environments (P = 20%), which included Blackout, Adversarial, Gaussian, and Frozen frame for Env 4 reported in the main content. The minimal parameters of each model denote as Para. in the Tab. 14. case trains with train% noise then tests with test% noise. Their results shown in TableS11are similar to the cases with the same training and testing noise level. We observe that CIQ have the capability of learning transformable q-value estimation, which attain a succeed score of 195.00 in the noise level 30 ± 10%. Meanwhile, other DQNs methods included DDQN-CF, DVRLQ-CF, DDQN-SA perform a general performance decay in the test on different noise level. This result would be limited to the generalization of power and challengesBengio (2013);Higgins et al. (2018) in as disentangle unseen state of a single parameterized deep network.

Stability test of proposed CIQ (Train Noise-Level, Test Noise-Level) a causal learning setting, evaluating treatment effects and conducting statistical refuting experiments are essential to support the underlying causal graphical model. Through resilient reinforcement learning framework, we could interpret DQN by estimating the average treatment effect (ATE) of each noisy and adversarial observation. We first define how to calculate a treatment effect in the resilient RL settings and conduct statistical refuting tests including random common cause variable test (T c ), replacing treatment with a random (placebo) variable (T p ), and removing a random subset of data (T s ). The open-source causal inference package DowhySharma et al. (

Absolute error ATE estimate; lower value indicates a much stable causal inference under perturbation on logic direction with P I = 10% and n=error rate of intervention on the binary label.

Validation of causal effect by three causal refuting tests. The causal effect estimate is tested by random common cause variable test (T-c), replacing treatment with a random (placebo) variable (T-p -lower is better), and removing a random subset of data (T-r). Adversarial attack outperforms in most tests. .1 THE NUMBER OF MODEL PARAMETERS

Ablation study on parameter of different DQN models using in our experiments in Env 1 , Env 2 , Env 3 , and Env 4 . The minimal parameters of each model denote as Para. in the Table10.

annex

Double DQN (DDQN) Van Hasselt et al. (2016) further improves the performance by modifying the target to y DDQN t = r t + γQ(s t+1 , arg max a Q(s t+1 , a; θ); θ -). Prioritized replay is another DQN improvement which samples transitions (s t , a t , r t , s t+1 ) from the replay buffer according to the probabilities p t proportional to their temporal difference (TD) error: p t ∝ |y DDQN t -Q(s t , a t ; θ)| α where α is a hyperparameter.We use Pytorch 1.2 to design both DQN and causal inference Q (CIQ) networks in our experiments. Our code can be found in the supplementary material. We use Nvidia GeForce RTX 2080 Ti GPUs with CUDA 10.0 for our experiments. We use the Quantile Huber loss (Dabney et al., 2018) L κ for DQN models with κ = 1 in Sup-Eq. 10, which allows less dramatic changes from Huber loss:The quantile Huber loss (Dabney et al., 2018) is the asymmetric variant of the Huber loss for quantile τ ∈ [0, 1] from Sup-Eq. 9:After the a maximum update step in the temporal loss u in Sup-Eq. 9, we synchronize θ - i with θ i follow the implementation from the OpenAI baseline Dhariwal et al. (2017) in Sup-Eq 11:We use the soft-update Fox et al. (2015) to update the DQN target network as in Sup-Eq 12:where θ target and θ local represent the two neural networks in DQN and τ is the soft update parameter depending on the task.For each environment, in additional to the 5 baselines described in Section 4. We use a four-layer neural network, which included an input layer, two 32-unit wide ReLU hidden layers, and an output layer (2 dimensions). The observation dimension of Cartpole-v1 Brockman et al. ( 2016) is 4 and the input stacks 4 consecutive observations. The dimension of the input layer is [4 × 4] . We design a replay buffer with a memory of 100,000, with a mini-batch size of 32, the discount factor γ is set to 0.99, the τ for a soft update of target parameter is 5 × 10 -3 , a learning rate for Adam Kingma & Ba (2014) optimization is 5 × 10 -4 , a regularization term for weight decay is 1 × 10 -4 , the coefficient α for importance sampling exponent is 0.6, the coefficient of prioritization exponent is 0.4. We train each model 1,000 times for each case and report the mean of the average final performance (average over all types of interference) in Table S4 . Env 1 is often recognized as a simplest environment for DQN training. However, we observe an stability issue of attaining reward over 190.0, when most DQN models attain an over 100.0 score in a 10% noise level. CIQ perform best and competitive results without internal affects from over-parameterization 2017). To get the causal latent model in Q-network, we approximate the posterior distribution by a neural network z t ∼ p(z t |x t ) = φ(x t ; θ 1 ). Then we train this neural network, CEVAE-Q, by variational inference using the generative model.We conduct 10,000 times experiments and fine-tuning on DQN-VAE and CEVAE-Q. The results in Table 15 shows that the latent representation learned by CIQ provides better resilience than other representations. To study the importance of specific components in CIQ, we conducted additional ablation studies and constructed two new baseline models shown in Table 16 tested in Env 1 (Cartpole). Baseline 1 (B1) -CIQ w/o the concatenation of ĩt in S C I . This comparison shows the importance of using both the predicted confounder zt and the predicted label ĩt . B1 uses label prediction to help latent representation but not using the predicted labels in decision-making. The structure is motivated by a task-specific (depth-only information from a maze environment) DQN network from a previous study Mirowski et al. (2016) . Baseline 2 (B2) -CIQ w/o the θ 3 network (for testing θ 3 's importance) The structure Humplik et al. (2019) is motivated by a meta-inference reinforcement learning proposed by. Baseline 3 (B3) -CIQ w/o providing grounded i t for training, for testing the importance of the inference loss and joint loss propagation. The superior performance of CIQ validates the proposed model is indeed crucial from the previous discussion in Section 3 of the main content. The setting used for Table 16 is the same as the setting for the third column (noise level = 20%) in Table 5 and the third column (noise level = 20%) in Table 15 , tested in Env 1 (Cartpole).

E.4 PERTURBATION-BASED NEURAL SALIENCY FOR DQN AGENTS

To better understand our CIQ model, we use the benchmark saliency method on DQN agent, perturbation-based saliency map, (Greydanus et al., 2018) to visualize the salient pixels, which are sensitive to the loss function of the trained DQNs. We made a case study of an input frame under an adversarial perturbation, as shown in Fig. 4 . We evaluate DQN agents included DQN, CIQ, DQN-CF, DVRLQ-CF and record its weighted center from the neural saliency map, where saliency pixels of CIQ respond to ground true actions more frequent (96.2%) than the other DQN methods.

E.5 ROBUSTNESS TRANSFERABILITY AMONG DIFFERENT INTERFERENCE TYPES

We conduct additional experiments to study robustness transferability of DQN and CIQ when training and testing under different kinds of interference types in Env 1 . Note that both architectures would solve a clean environment successfully (over 195.0) . The reported numbers are averaged over 20 independent runs for each condition. As shown in Table 17 and Table 18 , CIQ agents consistently attain significant performance improvement when compared with DQN agents, especially between Gaussian and adversarial perturbation. For example, CIQ succeeded to solve the environment 12 times out of 20 independent runs, with an average score of 165.2 in Gaussian (train)-Adversarial (test) adaptation. n particular, for CIQ, 12 times out of 20 independent runs are successfully transfered from Gaussian to Adversarial perturbation. Interestingly, augmenting adversarial perturbation does not always guarantee the best policy transfer when testing in the Blackout and Frozen conditions, which shows a slightly lower performance compared with training on Gaussian interference. The reason could be attributed to the recent findings that adversarial training can undermine model generalization (Raghunathan et al., 2019; Su et al., 2018) . Here we show how the proposed CIQ model can be extended from the architecture shown in Figure 2 to the multi-interference (MI) setting. The design intuition is based on two-step inference by a common encoder, to infer a clean or noisy observation, followed by an individual decoder tied to an interference type, to infer noisy types and activate the corresponding Q-network (named θ 4 ). Note that the two-step inference mechanism follows the RCM as two sequential potential outcome estimation models (Rubin, 1974; Imbens & Rubin, 2010) , where interfered observation x t is determined by two labels i 1,t and i 2,t according to 

