CROSS-STATE SELF-CONSTRAINT FOR FEATURE GEN-ERALIZATION IN DEEP REINFORCEMENT LEARNING

Abstract

Representation learning on visualized input is an important yet challenging task for deep reinforcement learning (RL). The feature space learned from visualized input not only dominates the agent's generalization ability in new environments but also affect the data efficiency during training. To help the RL agent learn more general and discriminative representation among various states, we present cross-state self-constraint(CSSC), a novel technique that regularizes the representation feature space by comparing representation similarity across different pairs of state. Based on the implicit feedback between state and action from the agent's experience, this constraint helps reinforce the general feature recognition during the learning process and thus enhance the generalization to unseen environment. We test our proposed method on the OpenAI ProcGen benchmark and see significant improvement on generalization performance across most of Procgen games.

1. INTRODUCTION

Deep Reinforcement learning has achieved tremendous success on mastering video games (Mnih et al., 2015) and the game of GO (Silver et al., 2017) . While training agent by using deep reinforcement learning algorithms, we usually assume that the agent could extract appropriate and effective features from different states and take actions accordingly. However, as more and more research works (Zhang et al. (2018) , Song et al. (2019) , Dabney et al. (2020) ) have pointed out, even welltrained RL agents that learns from visualized input tend to memorizing spurious pattern rather than understanding the essential generic features of a given state. For example, an agent might pay more attention to the change of irrelevant background rather than noticing the obstacles or enemies (Song et al., 2019) . To improve generalization in the new environment, various kinds of regularization method like dropout (Farebrother et al., 2018) and data augmentation (Laskin et al., 2020) has been proposed and tested in combination with reinforcement learning. Conventional methods like dropout and batch-norm has been proven to be effective in supervised-learning, and for self-supervised learning like RL we see multiple related applications across various environments. Data augmentation like random crop (Laskin et al., 2020) or random convolution (Lee et al., 2019) have also been proposed recently and provide considerable generalization enhancement to the unseen levels of various tested environment (Tassa et al. (2018) , Cobbe et al. (2018) , Cobbe et al. (2020) ). The agent is acting on multiple augmented views of the same input and learn from these prior injected data. However, modifying state information(injecting prior to the data) may be risky or even detrimental for representation learning because vital features may be altered or lost (ex: flipping state image might change the corresponding behavioral meaning, cropping the input image might lose critical features like the enemy position in the game). To avoid losing informative features of the visualized input, we choose a different approach. As a human learner, we rarely depend on multiple augmented views of the same input to discriminate important or fictitious features. Instead, human learners try to recognize general patterns across multiple states and act accordingly. In other words, if the same action(or behavior) has been conducted by a well-trained agent in two different states, we would infer that the agent has conceived similar feature patterns in these states. For example, if one car stops for ninety seconds at two different intersections, we would guess that the car might be stopped by the red light at both places(Figure 1b ). From here we get the intuition about the relation between action(or behavior) and representation feature space learned by the agent: for a RL agent acting rationally across sequence of states, its behavior would be similar if it perceives similar critical patterns while acting differently if it perceives different patterns. Based on this intuition, we designed a novel constraint that performs regularization directly in the representation feature space of the learning agent. Our hypothesis for this constraint is simple: states in which the rational agent behaves identically should share more representation resemblance than states the agent behave differently. We test this novel constraint in combination with Rainbow (Hessel et al., 2017) , the state-of-theart Q-learning method that combines numerous improvements with the original DQN model (Mnih et al., 2015) . Inspired by the pair-wise structure used in BPR-opt (Rendle et al., 2012) , we design the self-constraint based on the agent's behavior and utilize implicit feedback between positive and negative state pairs in the replay buffer. One thing worth noticing is that no change is needed for the underlying RL algorithm or model to adapt the proposed method, and it can be easily applied to other models with minimal effort(Figure 1a ). To measure the improvement on generalization to unseen environment levels, We test CSSC across 16 games on the OpenAI ProcGen benchmark (Cobbe et al., 2020) . We see significant improvement on generalization capability for most ProcGen games in comparison with base Rainbow model. We highlight the main contributions of CSSC as follows: • Directly optimize feature generalization across various input states • Requires no additional modification on the input data and the base model Representation Learning has been a vital part for deep RL algorithms. Even though the main goal of RL is to find the optimal value function, Mccallum & Ballard (1996) and Li et al. (2006) show that a representation specialized to this function may not be suitable for the sequence of value functions leading to it. On the other hand, Dabney et al. (2020) argue that the path toward optimal value function might be hindered by overfitting the representation to any intermediate value function during training. One popular and effective method to address this challenge is by using auxiliary tasks (Jaderberg et al. (2016) , Bellemare et al. (2019) ). Dabney et al. (2020) also propose new auxiliary tasks by considering the value-improvement path holistically.

2.2.1. ADDING STOCHASTICITY TO RL

Conventional practices like stochastic policy (Hausknecht & Stone, 2015) , random starts (Mnih et al., 2015) , sticky actions (Hausknecht & Stone, 2015) and frame skipping are wildly used for popular tasks like Atari (Machado et al., 2017) . By adding stochasticity to the environments during training and testing, we would prevent simple algorithms like trajectory tree (Kearns et al., 1999) Kostrikov et al. (2020) uses augmented data and weighted Q-functions to achieve state-of-the-art data-efficiency on the DMControl (Tassa et al., 2018) . Laskin et al. (2020) investigate ten different data augmentation on RL and point out that random crop is the most effective on DMControl (Tassa et al., 2018) 3 BACKGROUND CSSC is a general framework of cross-state representation regularization for RL. In principle, one can apply CSSC to other variation of DQN or policy-based models for discrete action-space environments. In this work we pick Rainbow (Hessel et al., 2017) and PPO (Schulman et al., 2017) as our base model to show that CSSC is compatible with the original Nature DQN (Mnih et al., 2015) , its multiple improvements and policy-based algorithms. In the following subsections we will review Rainbow DQN and introduce the concept of implicit feedback in Bayesian Personalized Ranking (BPR) proposed by Rendle et al. (2012) .

3.1. RAINBOW

In combination with a convolutional neural network as visualized input encoder, Deep Q network (Mnih et al., 2015) demonstrates that it is possible to use neural network as a function approximator that maps raw pixels to Q values. Since then, multiple improvements such as Double Q Learning (van Hasselt et al., 2015) , Dueling Network Architecture (Wang et al., 2015) , Prioritized Experience Replay (Schaul et al., 2015) and Noisy Networks (Fortunato et al., 2017) have been proposed. In addition, Bellemare et al. (2017) proposed the method of predicting a distribution over possible value support through the C51 algorithm. By combining all the above techniques into a single off-policy algorithm, Rainbow DQN showcases the state-of-the-art sample-efficiency on Atari Benchmarks. The resulting loss function for Rainbow DQN is as follows: L Rainbow = D KL (Φ z d (n) t ||d t ) (1) where Φ z is the projection onto fixed support z. d (n) t and d t are the n-step target return distribution and the model-predicted return distribution respectively.

3.2. BPR-OPT

As opposed to explicit feedback like user ratings, implicit feedback in recommendation systems focuses on interactions like click or view between users and items. In implicit feedback system only positive observations are available, and non-observed user-item pairs -e.g. a user has not view an item yet -are considered as negative observations. Instead of rating prediction, BPR-opt (Rendle et al., 2012) is designed for direct ranking optimization based on the implicit feedback between users and items. BPR extends the user preference from observed interaction pairs to non-observed data by ranking the preference of positive observation and negatived observation across the training data. The formulation of the maximum posterior estimator for the personalized ranking optimization (BPR-opt) can be written as follows: ln p(Θ| > u ) = (u,i,j)∈Ds ln σ(x uij ) -λ Θ ||Θ|| 2 (2) where Θ represents the parameter vector of the base model. 

4. METHOD

The main idea of CSSC is to extend the similarity ranking across various state representation based on our hypothesis: representations motivating identical behavior should share more similarity than those motivating different behavior. We name this method as "self-constraint" to emphasize that the state pairing is decided by the agent's behavior rather than pre-defined fixed labels. In the following subsection below, we introduce the following concepts to provide further explanation and insight: (i) definition of the behavior of an agent at a given state (ii) implicit feedback between state and behavior (iii) cross-state self-constraint as an auxiliary loss for representation regularization

4.1. BEHAVIOR DEFINITION

We describe a typical Markov Decision Process(MDP) as (X, A, R, P, γ) with state space as X, action space as A, reward function as R, state transition function as P and discount factor as γ. The agent would take an action a i at state x i and then being transited to x i+1 by the environment with a given step reward r i . Here we define the behavior set B n i = {(a i , a i+1 , ..., a i+n-1 )|a ∈ A} as a set of action series of length n taken by the agent since state x i . For b n i ∈ B n i and b n j ∈ B n j , b n i = b n j only if a i+p = a j+p for 0 ≤ p ≤ n -1. With this definition in mind, we can infer that if the agent conduct the behavior b 2 i = (lef t, f ire) at state x i , it would move left in state x i and fire in the next state x i+1 . In the following paragraph we coin names like unigram, bigram and trigram for behaviors of length one, two and three respectively.

4.2. IMPLICIT FEEDBACK IN REINFORCEMENT LEARNING

In the traditional deep Q-learning scenario, each state representation is highly correlated to neighboring states by the Bellman equation during the training process. However, observational overfitting could still happen as long as the representation of neighboring states can fit to specific sub-optimal version of Q-function indefinitely as mentioned in Dabney et al. (2020) . Besides the base RL algorithm, we can view the interaction between state and behavior as an implicit feedback system by regarding states sharing same action as positive observation. The general feature among these positive observation would be reinforced if we compare their similarity with negative observation -e.g. states sharing different action. Through this "ranking" procedure the pair-wise relationship would be extended across non-neighboring states and therefore preventing observational overfitting to specific Q-function.

4.3. IMPLEMENTATION OF CROSS-STATE SELF-CONSTRAINT IN COMBINATION WITH RAINBOW AND PPO

For each sample of state triples(x p , x q , x r ) ∈ X with (b n p , b n q , b n r ) ∈ B n and b n p = b n q = b n r , we decompose the estimator xpqr and define it as: xpqr := xpq -xpr = e θ (x p ) • e θ (x q ) -e θ (x p ) • e θ (x r ) (3) where e θ is the encoder function with weight θ that maps the pixel input to 1-D array feature vector with shape like [element 0, element 1, ..., element n] as the state representation. We directly perform inner-product on e θ (x p ) and e θ (x q ) to calculate the representation similarity xpq between x p and x q . Please note that the encoder function is the same part of the original base model used for feature extraction on visualized input. For more details about neural network structures used in this paper, please refer to Figure 7 and Figure 8 . To sample these state triplets without modifying the base DQN or PPO algorithm, we collect these state triples in the same batch of transitions used to calculate Bellman loss or policy loss. For policy-based algorithms and DQN without prioritized replay (Schaul et al., 2015) , we pair up the state triples for each transition in the sample and calculate the CSSC loss as follows: L CSSC = - 1 N Ds (p,q,r)∈Ds ln σ(x pqr ) where D s represents the sample batch of size N Ds from the replay buffer and σ is the sigmoid function. Then, we train the base algorithm in combination with the auxiliary CSSC loss as follows: L total = L base + β cssc • L CSSC (5) where β cssc is the hyperparameter to control the contribution of CSSC during training. For pseudocode of PPO with unigram CSSC, please refer to Algorithm 2 in the appendix. In the case of Rainbow with CSSC, we design the loss function in combination with the importance sampling (IP) weight as follows: L total = ( L Rainbow (β cssc • L CSSC )) W IP where and are element-wise add and multiplication respectively. For pseudocode of Rainbow with unigram CSSC, please refer to Algorithm 1 in the appendix. We find that setting β cssc to 0.01 works quite well in most cases, so we stick with this setting for all the experiment conducted in the next section.

5. EXPERIMENT

Our primary goal for CSSC is to enhance the generalization capability of RL algorithm to unseen levels that share similar mechanism. Fortunately, OpenAI ProcGen (Cobbe et al., 2020) presents a collection of game-like environments where the training and testing environments differs in visual appearance and structure. Therefore, we evaluate CSSC in three different ways:(i) generalization improvement on 12 easy-mode games and 8 hard-mode games with Rainbow on OpenAI Proc-Gen (ii) generalization improvement on 16 easy-mode games with PPO on OpenAI ProcGen (iii) performance improvement on 32 games with Rainbow on Gym Atari (iv) visualization on the representation space learned with CSSC. 2020) and show the result in Figure 2a and Figure 12a respectively. We also display the learning curve of bigram-CSSC in Figure 3 and unigram-CSSC in Figure 13 . In the following we summarize the main findings below: • In supervised learning, posing regularization or constraint to the model would usually improve its testing performance on unseen data at the expense of hurting training performance. However, we see that the proposed self-constraint improves both training and testing performance across most of the 12 game tested. 5 . Every curve is smoothed with an exponential moving average of 0.95 to improve readability. • For games like Starpilot, Chaser and Bigfish, we see that the testing performance is even better than the training performance of vanilla Rainbow. • For the mean normalized score in Table 1 , Bigram-CSSC and Unigram-CSSC bring 85% and 64% improvement respectively on the testing performance in comparison with base Rainbow model. 2020) and show the result in Figure 2b . In the following we summarize the main findings below:

5.3. GENERALIZATION ON PROCGEN IN HARD MODE WITH RAINBOW

• As shown in Figure4, CSSC substantially improves both training and testing performance in Bigfish, Dodgeball, Starpilot, Fruitbot and Bossfight. • In particular, we see nearly 4x performance jump in Bigfish at 30M timestep in comparison with the base Rainbow model. The mean episodic return at 30M timestep of bigram CSSC is even higher than that of PPO at 200M timestep as shown in Figure 4 of Cobbe et al. (2020) . • In the case of Heist (a puzzle-solving task in maze-like layout), coinrun(a 2-D scroll platform game) and Plunder(a challenging shooting game), the gain from CSSC is not as obvious because all these games require careful manipulation and planning. Add constraint that modifies state representation directly would be risky or even detrimental for the learning process of the base RL algorithm.

5.4. GENERALIZATION ON PROCGEN IN EASY MODE WITH PPO

We test the CSSC with PPO on all the 16 games of the Procgen benchmark in Easy mode. We conduct the training of 25 million timesteps across 3 seeds on 200 training levels and evaluate the generalization improvement on the full distribution of testing levels across all 16 ProcGen games. From the normalized learning curve shown in Figure 5a we can tell that CSSC helps reduce the gap between training and testing performance. For detailed final scores and learning curves, please refer to Table 7 and Figure 14 , 15 in the Appendix.

5.5. PERFORMANCE ON GYM ATARI WITH RAINBOW

We take Gym Atari benchmark (Brockman et al., 2016) as our second set of environments to measure the effectiveness of CSSC. Even though this classic benchmark does not explicitly split the training levels out of the training levels in all environments, we still see significant improvement on 23 out of 38 games tested as shown in Figure 5b . For training hyperparameters and scoreboard on Gym Atari benchmark, please refer to Table 3 and 8 in the Appendix.

5.6. VISUALIZATION OF REPRESENTATION EMBEDDING

To give a more tangible explanation about the effectiveness of CSSC, we plot the representation distribution in the embedding feature space across 4096 states from the replay buffer. We first perform dimension reduction on these state embeddings using principal decomposition analysis and label each embedding point with the index of corresponding action. Here we show the embedding space of both Rainbow and bigram-CSSC models trained to play the Bigfish game. In figure 6 we can see that states of the same action are more clustered in bigram-CSSC than that of the vanilla Rainbow model. We believe this is the possible reason behind the significant improvement on generalization as shown in Figure 4 . For representation visualization in other ProcGen games, please refer to A.4.

6. CONCLUSION

In this paper we propose a novel regularization on state representation learning based on the connection between agent behavior and the visual input. Our hypothesis is derived from the observation that the behavior of a rational agent would have certain relationship with general cross-state features or patterns. We see significant improvement on generalization brought by the proposed cross-state self-constraint(CSSC) on most of the games in OpenAI ProcGen benchmark. The connection between behavior and conceived visual input can be considered as some kind of "motivation" that acts as a decisive factor behind the learned policy. It worth further study to better understand the derivation and transformation of agent's "motivation" during the Markov Decision Process, and we believe the concept proposed in this work would facilitate more research in this direction.  if t mod K ≡ 0 then 8: Sample batch of transition J = [j 0 , j 1 , ..., j k |j ∼ P (j) = p α j /Σ i p α i ] 9: Compute importance-sampling weight: W = [w 0 , w 1 , ..., w k |w j = (N • P (j)) -β / max i w i ] 10: Predict state feature vector e θ (s j ), e θ (s j+n ) and return distribution d j , d Sample positive pair (s p , s q ), p = q with j p , j q ∈ J and behavior b 1 p = b 1 q for each transition j p in J 14: Sample negative pair (s p , s r ), p = r with j p , j r ∈ J and behavior b 1 p = b 1 r for each transition j p in J 



Figure 1: Model structure and CSSC concept

D s represent the batch samples from training data. p(Θ| > u ) is the posterior probability conditioned on latent preference structure > u for user u, λ Θ is the model specific regularization parameter, σ is the sigmoid function, and xuij (Θ) is the real-valued function model Θ which captures the preference relationship between user u, item i and item j. Rendle et al. (2012) use xuij := xui -xuj to indicate that the user u prefers item i over item j.

Figure 2: Mean normalized score curve on OpenAI ProcGen environment. We normalized the episodic return of each game by the constants from Cobbe et al. (2020) and report the mean score. Every curve is smoothed with an exponential moving average of 0.95 to improve readability.

Figure 3: Learning Curve of Rainbow with bigram CSSC in 12 ProcGen Games. We report the raw episodic return for both training and testing. All final scores are listed in Table5. Every curve is smoothed with an exponential moving average of 0.95 to improve readability.

Figure 4: Learning Curve of Rainbow with bigram CSSC in 8 ProcGen Games. We report the mean raw episodic return for training and testing. Every mean return is shown across 3 seeds and all final scores are listed in Table6. Every curve is smoothed with an exponential moving average of 0.95 to improve readability.

Figure 5: Left: Mean normalized score curve on OpenAI ProcGen environment. We normalized the episodic return of each game by the constants from Cobbe et al. (2020) and report the mean score. Every curve is averaged across 3 seeds and is smoothed with an exponential moving average of 0.95 to improve readability. Right: the performance improvement of bigram CSSC with Rainbow on 38 games from the Gym Atari benchmark. All scores are recorded using the same random seed.

Figure 6: Representation embedding projected to 2D space by PCA. We use the same 4096 state frames to extract representation using the bigram-CSSC and vanilla Rainbow model and display the distribution across all state representation.

Figure 9: Dodgeball: representation embedding projected to 2D space by PCA

θ (s j )), Θ(e θ (s j+n )) for each transition in the batch 11: Compute L Rainbow = [l 0 , l 1 , ..., l k |l j = D KL (Φ z d (n) j ||d j )] using distribution projection function Φ z 12:Update transition priority p j ← l j for each transition in the batch 13:

Figure 12: Mean normalized score on OpenAI ProcGen environment in easy mode. Every curve is smoothed with an exponential moving average of 0.95 to improve readability.

Figure 13: Learning Curve of Rainbow with unigram CSSC in 12 ProcGen Games. We report the raw episodic return for both training and testing. Every curve is smoothed with an exponential moving average of 0.95 to improve readability.

Figure 14: Learning Curves of PPO in 16 ProcGen Games on training levels. Every curve is averaged across 3 seeds and smoothed with an exponential moving average of 0.95 to improve readability.

Figure 15: Learning Curves of PPO with CSSC in 16 ProcGen Games on testing levels. Every curve is averaged across 3 seeds and smoothed with an exponential moving average of 0.95 to improve readability.

and brute(Machado et al., 2017) from optimizing over open-loop sequences of actions without even considering the input states. However, asZhang et al. (2018) has pointed out, injecting stochasticity to the maze-like environments does not necessarily prevent the deep RL agents from overfitting.

Mean normalized score of Rainbow in OpenAI ProcGen Easy Mode



Hyperparameters of Rainbow for OpenAI ProcGen

Hyperparameters of PPO for OpenAI ProcGen

Rainbow: Easy mode scores evaluated after 25M timesteps of training on 1 seed.

Rainbow: Hard mode scores evaluated after 30M timesteps of training on 3 seeds.

PPO: Easy mode scores evaluated after 25M timesteps of training on 3 seeds.

Input: minibatch size k, multi-step return n, replay period K and buffer size N , exponents α and β, gamma γ, CSSC coefficient β cssc , total-timesteps T . 2: Initialize replay memory H = Ø, feature extractor θ, return distribution predictor Θ, p 1 = 1 3: Observe state s 0 and choose action a 0 ∼ π θ (s 0 ) 4: for t = 1 to T do Store transition j t = (s t , a t , r , d t , s t+n ) in H with maximal priority p t = max i<t p i

Gym Atari scores evaluated after 50M timesteps of training on 1 seed.

A APPENDIX

A.1 NEURAL NETWORK ARCHITECTURE Compute advantages estimates Ât using current value function V φ 6:for each batch in H do 9:Predict state feature vector e θ (s j ), log prob, values using π θ and V φ for transitions in the batch Compute L policy = policy loss.mean() 13:Sample positive pair (s p , s q ), p = q with j p , j q ∈ J and behavior b 1 p = b 1 q for each transition j p in batch transitions J 14:Sample negative pair (s p , s r ), p = r with j p , j r ∈ J and behavior b 1 p = b 1 r for each transition j p in batch transitions J 15:Compute cssc loss = [ ŝ0p 0q0 r0 , ..., ŝkp k q k r k | ŝj pj qj rj := -1.0×σ(e θ (s pj )•e θ (s qj )e θ (s pj ) • e θ (s rj ))] 

