DAYLIGHT: ASSESSING GENERALIZATION SKILLS OF DEEP REINFORCEMENT LEARNING AGENTS

Abstract

Deep reinforcement learning algorithms have recently achieved significant success in learning high-performing policies from purely visual observations. The ability to perform end-to-end learning from raw high dimensional input alone has led to deep reinforcement learning algorithms being deployed in a variety of fields. Thus, understanding and improving the ability of deep reinforcement learning agents to generalize to unseen data distributions is of critical importance. Much recent work has focused on assessing the generalization of deep reinforcement learning agents by introducing specifically crafted adversarial perturbations to their inputs. In this paper, we propose another approach that we call daylight: a framework to assess the generalization skills of trained deep reinforcement learning agents. Rather than focusing on worst-case analysis of distribution shift, our approach is based on black-box perturbations that correspond to semantically meaningful changes to the environment or the agent's visual observation system ranging from brightness to compression artifacts. We demonstrate that even the smallest changes in the environment cause the performance of the agents to degrade significantly in various games from the Atari environment despite having orders of magnitude lower perceptual similarity distance compared to state-of-the-art adversarial attacks. We show that our framework captures a diverse set of bands in the Fourier spectrum, giving a better overall understanding of the agent's generalization capabilities. We believe our work can be crucial towards building resilient and generalizable deep reinforcement learning agents.

1. INTRODUCTION

Following the initial work of Mnih et al. (2015) , the use of DNNs as function approximators in reinforcement learning has led to a dramatic increase in the capabilities of RL agents Schulman et al. (2017) ; Lillicrap et al. (2015) . In particular, these developments allow for the direct learning of strong policies from raw, high-dimensional inputs such as visual observations. With the successes of these new methods come new challenges regarding the robustness and generalization capabilities of deep RL agents. One line of research has focused on the high sensitivity of deep neural networks to imperceptible, adversarial perturbations to visual inputs, first in the setting of image classification Szegedy et al. (2014) ; Goodfellow et al. (2015) and more recently for deep reinforcement learning Huang et al. (2017) ; Kos & Song (2017) . Since one of the main reasons for the success and popularity of deep RL is the ability to learn directly from visual observations alone, this non-robustness to small adversarial perturbations is a serious concern Chokshi (2020) ; Vlasic & Boudette (2016) ; Kunkle (2018) . However, existing adversarial formulations for deep reinforcement learning require high computational effort to produce the perturbations, knowledge of the network used to train the agent, knowledge of the environment, real-time access to and manipulation of the agent's state observations. In this paper, we propose a more realistic scenario where we do not have access to any of the above, and the adversary essentially consists of realistic changes in the natural environment or in the agent's observation system. For instance, if our deep reinforcement learning agent is operating a self-driving car one could plausibly expect changes in daylight levels, shifts in angle due to terrain, fog on the camera lens, or compression artifacts from the camera processor. We believe that our proposed framework is semantically more meaningful than arbitrary p -norm bounded pixel perturbations. Prior work on image classification Dodge & Karam (2016) showed that image quality distortions can reduce the accuracy of DNN classfiers. Moreover, recent work by Ford et al. (2019) showed that while adversarial training for image classifiers reduced their vulnerability towards perturbations corresponding to high frequency in the Fourier domain, it actually made the models more vulnerable to low frequency perturbations including fog and contrast changes. Therefore, it is important to investigate model robustness and generalization throughout different bands in the frequency domain. We believe that being able to accurately assess the generalization capabilities of deep reinforcement learning agents is an initial step towards building robust and reliable agents. For these reasons, in this work we investigate the robustness of trained deep reinforcement learning agents and make the following contributions: • We propose a realistic threat model called daylight and a generalization framework for deep reinforcement learning agents that aims to assess the robustness of the agents to basic environmental and observational changes. • We run multiple experiments in the Atari environment in various games to demonstrate the degradation in performance of deep reinforcement learning agents. • We compare our threat model with the state-of-the-art adversarial method based on p -norm changes, and we show that our daylight framework results in competitive, and almost always larger impact, with lower perceptual similarity distance. • We evaluate the daylight framework in the time domain and show that several works based on the timing perspective of adversarial formulations can be revisited within our daylight framework. • Finally, we investigate the frequency domain of our framework and state-of-the-art targeted attacks. We show that our framework captures different bands of the frequency spectrum, thus yielding a better estimate of the model robustness.

2.1. PRELEMINARIES

In this paper we consider Markov Decision Processes (MDPs) given by a tuple (S, A, P, r, γ, s 0 ). The reinforcement learning agent interacts with the MDP by observing states s ∈ S, and then taking actions a ∈ A. The probability of transitioning to state s when the agent takes action a in state s is determined by the transition probability P : S × A × S → R. The reward received by the agent when taking action a in state s is given by the reward function r : S × A → R. The goal of the agent is to learn a policy π θ : S × A → R which takes an action a in state s that maximizes the cumulative discounted reward T -1 t=0 γ t r(s t , a t ). Here s 0 is the initial state of the agent, and γ is the discount factor. For deep Q networks (DQN) the optimal policy is determined by learning the state-action value function Q(s, a). For a state s we use F(s) to denote the 2D discrete Fourier transform. Szegedy et al. (2014) proposed to minimize the distance between the original image and adversarially produced image to create adversarial perturbations. The authors used box-constrained L-BFGS to solve this optimization problem. Goodfellow et al. (2015) introduced the fast gradient method (FGM)

2.2. CRAFTING ADVERSARIAL PERTURBATIONS

x adv = x + • ∇ x J(x, y) ||∇ x J(x, y)|| p , for crafting adversarial examples in image classification by taking the gradient of the cost function J(x, y) used to train the neural network in the direction of the input, where x is the input, y is the output label, and J(x, y) is the cost function for image classification. Carlini & Wagner (2017) introduced targeted attacks in the image classification domain based on distance minimization between the adversarial image and the original image while targeting a particular label. In the deep reinforcement learning domain the Carlini & Wagner (2017) formulation is min sadv∈S s adv -s p subject to a * (s) = a * (s adv ), where s is the unperturbed input, s adv is the adversarially perturbed input, a * (s) is the action taken in the unperturbed state, and a * (s adv ) is the action taken in the adversarial state. This formulation attempts to minimize the distance to the original state, constrained to states leading to sub-optimal actions as determined by the Q-network. In contrast to adversarial attacks, in our proposed threat model we will not need any information on the cost function used to train the network, Q-network of the trained agent or the visited states themselves.

2.3. ADVERSARIAL APPROACH IN DEEP REINFORCEMENT LEARNING

The first adversarial attacks on deep reinforcement learning introduced by Simonyan & Zisserman (2015) without calibration. Furthermore, the authors propose a method to measure the perceptual distance between two images with the Learned Perceptual Image Patch Similarity (LPIPS) metric. We compare the distance between adversarial states s adv and the original states s with the LPIPS metric. We refer to the LPIPS metric as P similarity throughout the paper. P similarity (s, s adv ) returns the distance between s and s adv based on network activations. Zhang et al. (2018) show that P similarity results in a reliable approximation of human perception.

2.5. IMPACT

We define the normalized impact of an adversary on the agent as, I = Score max -Score adv Score max -Score min . ( ) Score max is the score at the end of the episode achieved by the agent who takes the action that maximizes its Q(s, a) function in every state visited, and Score min is the score at the end of the episode achieved by the agent who takes the action that minimizes its Q(s, a) function in every state visited. Score adv is the score at the end of the episode achieved by the agent who takes the action that maximizes Q(s adv , a) under the influence of the adversary in every state visited. We report the results in a normalized scale, because we observed the agent can collect stochastic rewards even though it chooses the action that minimizes its Q(s, a) function in every state visited.

3. DAYLIGHT: A GENERALIZATION TESTING FRAMEWORK

In our generalization framework 1 we propose a baseline to test the robustness of trained deep reinforcement learning agents with respect to several realistic failures of the agent's observations. This is in contrast to prior work that focused on the presence of a strong adversary with prior access to training details of the agent's neural network, real time access to agent's perception system, and highly computationally demanding adversarial formulations used to compute simultaneous perturbations. In our model we consider the environment itself as an adversary and we examine several environmental changes such as: changes in the brightness of the environment, blurring of the observation, rotation of the observation, perspective transformation, shifting, and compression artifacts. These changes from our model can be easily linked to naturally occurring changes in the environment. For instance, for a self driving car a brightness change can be linked to the time of day, or the appearance of reflective objects or shadows. Rotation, perspective transformation, and shifting can be linked to driving on a road with varied terrain. Blurring can be linked to a rainy day, foggy weather or a fogged up camera lens utilized by the agent. In the remainder of this section we compare the impact values and perceptual similarities of the daylight framework with the state-of-the-art targeted adversarial attack proposed by Carlini & Wagner (2017) .

3.1. BRIGHTNESS AND CONTRAST

The first component of our framework focusing on testing the trained agents in different brightness and contrast levels using a linear brightness and contrast transformation, s adv (i, j) = s(i, j) • α + β, where s(i, j) is the ij th pixel of state s, and α and β are the linear brightness parameters. In Table 1 we show the impacts and perceptual similarity distances with corresponding α, β values. In all of the games except BankHeist brightness and contrast change results in higher impact. The perceptual similarity distance of brightness and contrast is lower in every game. In Figure 3 we show the corresponding states s and s adv for all of the daylight framework. 

3.2. BLURRING

The second component in our daylight framework is blurring. Median bluring is a nonlinear noise removal technique that replaces the original pixel value with the median pixel value of its neighbouring pixels. A kernel size k means that the median is computed over a k × k neighborhood of the original pixel. In Table 2 we show the impact values and perceptual similarity distance with corresponding kernel size. Only in BankHeist do we observe that the impacts and perceptual similarity distance are comparable. For the rest of the games impact is higher and perceptual similarity distance is lower for blurring. 

3.3. ROTATION

The next component in our daylight framework is rotation. In Table 3 we show impact values and perceptual similarity distance with corresponding rotation angle. In all of the games except Pong rotation results in higher impact and orders of magnitude lower perceptual similarity distance. In Pong the impact is comparable and the perceptual similarity distance is lower by a factor of 2. 

3.4. SHIFTING

The next component in our daylight framework is shifting. Shifting an image moves the elements of the image matrix along any dimension by any number of elements. In this subsection we will shift the inputs in the x or y direction with as few pixels shifted as possible. We use [t i , t j ] to denote the distance shifted, where t i is in the direction of x and t j is in the direction of y. In Table 4 we show the impact values and perceptual similarity distances for both Carlini & Wagner (2017) and shifting. For all of the games shifting yields higher impact and lower perceptual similarity distance. 

3.5. COMPRESSION ARTIFACTS

In this section we look at jpeg compression artifacts caused by the discrete cosine transform (DCT) resulting in the loss of high frequency components (ringing and blocking). In Table 5 we show the impact values and perceptual similarities of Carlini & Wagner (2017) and compression artifacts (CA). Only in Pong and Riverraid do we observe a lower impact than Carlini & Wagner (2017) while the perceptual similarity distance was orders of magnitude smaller for compression artifacts. In BankHeist the impact is comparable, and in the rest of the games compression artifacts result in higher impact and lower perceptual similarity distance compared to Carlini & Wagner (2017) .

3.6. PERSPECTIVE TRANSFORMATION

The final component of our daylight framework is perspective transformation. Given four points in the plane defining a convex quadrangle, there is a unique perspective transformation mapping the corners of the square to these four points (see Equation 5and Equation 6). We define the norm of a perspective transformation as the maximum distance that one of the corners of the square moves under this mapping. In Table 6 we show impact values and perceptual similarity distance with respect to perspective norms. For all the games we observe perspective transformation yields higher impact and lower perceptual similarity distance. In Figure 2 we show the Fourier spectrum of the original state s and the perturbed states s adv from the daylight framework and Carlini & Wagner (2017) . In Figure 1 we show the power spectral density of the original state compared to several perturbations from our proposed daylight framework and Carlini & Wagner (2017) . In Figure 1 we observe that while Carlini & Wagner (2017) increases the higher frequencies, compression artifacts decrease the magnitude of the higher frequency band. On the other hand, brightness and contrast decreases the magnitude of the low frequency band, and shifting increases the midband. Blurring decreases the midband and high frequencies together, and perspective transformation decreases the low frequecies and high frequencies while increasing the midband. We believe capturing the susceptibilities towards perturbations in different bands of the frequency domain is a key step towards building robust agents.

5. TIMING PERSPECTIVE

In the previous sections we tested our agents generalization capabilities in modified environments with our initial thread model. This modification to the environment was applied to every state that the agent visited for both Carlini & Wagner (2017) and our daylight framework. In this section we will examine the effect of both adversarial models when the perturbations are applied to only a small fraction of states. For this purpose we introduce the adversarial states s adv in randomly sampled states where the observation s adv is observed by the agent with probability p, and the original states s is observed observed by the agent with probability 1 -p. We use n sadv to denote the number of states where the agent observed s adv instead of the original state s, and we use n s to denote the total number of states visited by the agent in the given episode. We use e adv to denote the fraction n sadv /n s of adversarial perturbations per episode. In Table 7 we show the attack impacts of Carlini & Wagner (2017) and daylight framework with corresponding adversarial observation probability p averaged over 10 random episodes. See Appendix A.4 for more details. Even for low p values our proposed daylight framework obtains higher impact. Thus, to capture a broader view on the robustness of the agent, the prior work on the timing perspective by Sun et al. (2020) ; Lin et al. (2017) based on worst-case distributional shift, can be revisited with our daylight framework.

6. EXPERIMENTS

In our experiments we trained our agents with DDQN Wang et al. (2016) in the OpenAI Gym Brockman et al. (2016) Atari environment Bellemare et al. (2013) . We test trained agents from several Atari environments averaged over 10 episodes. In Figure 3 we show the original states and the environmental modifications. Interestingly, we found that a majority of the games have high robustness against rotation. On the other hand, shifting and perspective transformation can reach a higher impact level than the state-of-the-art targeted attack while not being recognizable by human perception. We observed that in some games, such as Pong and Riverraid, brightness and contrast requires radical changes to cause the agent to fail, while for others the change required is imperceptible. Another thing we observed is that for games like Pong, which is relatively trivial compared to other games in the Atari environment, the threshold values for the environmental modification are higher. When the complexity in the game increases the environment modification thresholds decrease drastically. We think that this issue could become more important as deep reinforcement learning agents are deployed in more complex and realistic scenarios.

7. CONCLUSION

In this paper we studied a realistic threat model based on basic environmental changes and proposed a framework called daylight to asses the generalization capabilities of deep reinforcement learning agents. We compared our daylight framework with the state-of-the-art adversarial attacks in the Atari environment. We demonstrated that our framework achieves higher impact on agent performance with lower perceptual similarity distance without having access to agents training details, real time access to agents memory and perception system, and computationally demanding adversarial formulations to compute simultaneous perturbations. We investigated perturbations in the time domain and showed that the studies based on imperceptible perturbations can be revisited within the daylight framework. Finally, we show that each component of our framework contains distinct bands in the frequency domain, resulting in a better estimate of the generalization capabilities of trained agents. We believe our framework can be instrumental towards generalization and robustification of deep reinforcement learning algorithms.

A APPENDIX

A.1 RAW SCORE RESULTS In this section we provide the raw scores of the agents under the observation modifications from both the state-of-the-art adversarial attack and the Daylight framework with components: Brightness & Contrast, Blurring, Rotation, Shifting, Compression Artifacts, and Perspective Transform . 6 for the Daylight framework. Note that Daylight hyperparameters refers for brightnes and contrast to [α, β], for blurring to the kernel size, for rotation to rotation degree, for shifting to [t i , t j ], and for perspective transformation to perspective norm. Shifting and compression artifacts have nearly maximal impact on the performance of the agent trained with A3C, while the other perturbations all have impact at least 0.9. Note that we did not change the Daylight hyperparameters for a direct comparison between the A3C agent and the DDQN agent. Therefore, although impact is slightly lower for brightness & contrast for A3C than for DDQN, it is possible that choosing different values of α and β while minimizing the perceptual similarity P similarity can still result in a higher impact for an agent trained with A3C. A.4 TIMING PERSPECTIVE Note that the fraction e adv can differ slightly from p due to random fluctuations, therefore we also report these values in Table 7 . Note that e adv varies between games. This was done because each game has a different minimum threshold for e adv to achieve stable impact across episodes. 



https://daylightframework.github.io



et al. (2019)  showed DNN models are robust to certain bands of perturbations in the frequency domain. Furthermore, they showed that adversarial training shifts the vulnerability from high frequency noise towards low frequency noise. Moreover,Yin et al. (2019) claim that a framework that aims to measure robustness and generalization needs to firmly encapsulate different directions of the spectrum in the frequency domain. In this section we show that the daylight framework indeed captures a broader set of directions in the frequency domain.

Figure 1: Riverraid power spectrum change with various perturbations: Carlini & Wagner, compression artifacts, brightness and contrast, perspective transformation, shifting, rotation.

Figure 2: Rows: F(s) for BankHeist, F(s) Riverraid. Columns: original state, Carlini & Wagner, brightness and contrast, blurring, rotation, shifting, perspective tansformation, compression artifacts.

Figure 3: Original frame and environmental modifications. Columns: original frame, shifting, rotation, perspective transformation, blurring, compression artifacts. brightness and contrast. Rows: Bankheist, Timepilot, Pong, JamesBond, Riverraid.



Impacts ofCarlini & Wagner (2017) and blurring with corresponding perceptual similarity P similarity , and kernel size.

Impacts ofCarlini & Wagner (2017) and rotation with corresponding perceptual similarity P similarity , and the rotation angle (RD denotes rotation degree).

Impacts ofCarlini & Wagner (2017) and shifting with corresponding perceptual similarity P similarity , and the shifting [t i , t j ].

Impacts ofCarlini & Wagner (2017) and compression artifacts (CA) with corresponding perceptual similarity P similarity .

Impacts ofCarlini & Wagner (2017) and perspective transformation (PT) with corresponding perceptual similarity P similarity , and the perspective norm.

Impact comparison with the fraction of adversarial observation probability p.

Raw Scores of Carlini & Wagner (2017) (C&W) and brightness & contrast (B&C) with corresponding perceptual similarity P similarity , and the [α, β].

Raw Scores ofCarlini & Wagner (2017) (C&W) and blurring with corresponding perceptual similarity P similarity , and kernel size.

Raw Scores ofCarlini & Wagner (2017) (C&W) and rotation with corresponding perceptual similarity P similarity , and the rotation angle (RD denotes rotation degree).

Raw Scores of Carlini & Wagner (2017) (C&W) and shifting with corresponding perceptual similarity P similarity , and the shifting [t i , t j ].

Generalized Impacts ofCarlini & Wagner (2017) (C&W) and rotation with corresponding perceptual similarity P similarity , and the rotation angle (RD denotes rotation degree). C&W I general Rotation I general C&W P similarity Rotation P similarity RD

Generalized Impacts of Carlini & Wagner (2017) (C&W) and shifting with corresponding perceptual similarity P similarity , and the shifting [t i , t j ]. Games C&W I general Shifting I general C&W P similarity Shifting P similarity [t i , t j ]

Generalized Impacts of Carlini & Wagner (2017) (C&W) and compression artifacts (CA) with corresponding perceptual similarity P similarity . Games C&W I general CA I general C&W P similarity CA P similarity

Generalized Impacts ofCarlini & Wagner (2017) (C&W) and perspective transformation (PT) with corresponding perceptual similarity P similarity , and the perspective norm.In this section we investigate policy gradient methods. In particular, Table20shows the raw scores, and generalized impacts I general of the agent trained with A3C under the Daylight framework with following observation modifications: brightness & constrast, blurring, rotation, shifting, compression artifacts and perspective transform. In Table20theexact same hyperparameters have been used as stated in Table1 through Table

Raw Scores and generalized impacts of the agent trained with A3C algorithm in Pong environment and evaluated with Daylight frame work: brightness &contrast, blurring, rotation, shifting, compression artifacts (CA) and perspective transform (PT).

Impact comparison with the fraction of adversarial observations per episode e adv .s adv (i, j) = s M 11 s i + M 12 s j + M 13 M 31 s i + M 32 s j + M 33 , M 21 s i + M 22 s j + M 23 M 31 s i + M 32 s j + M 33(6)

annex

For the scope of the paper we used the Impact definition in Equation 2 when we compare our proposed Daylight framework to the state-of-the-art targeted attacks. For more generalized comparison between different algorithms and different games one can use Score clean as the score of the agent without any modification to agent's observations system at the end of the episode and Score fixed min as the fixed minimum score for a given game. Thus, we define the generalized impact as,From Table 14 through Table 19 we set Score fixed min for Bankheist 0, JamesBond 0, Pong -21, Riverraid 0, and TimePilot 0. 

