CROSS-STATE SELF-CONSTRAINT FOR FEATURE GEN-ERALIZATION IN DEEP REINFORCEMENT LEARNING

Abstract

Representation learning on visualized input is an important yet challenging task for deep reinforcement learning (RL). The feature space learned from visualized input not only dominates the agent's generalization ability in new environments but also affect the data efficiency during training. To help the RL agent learn more general and discriminative representation among various states, we present cross-state self-constraint(CSSC), a novel technique that regularizes the representation feature space by comparing representation similarity across different pairs of state. Based on the implicit feedback between state and action from the agent's experience, this constraint helps reinforce the general feature recognition during the learning process and thus enhance the generalization to unseen environment. We test our proposed method on the OpenAI ProcGen benchmark and see significant improvement on generalization performance across most of Procgen games.

1. INTRODUCTION

Deep Reinforcement learning has achieved tremendous success on mastering video games (Mnih et al., 2015) and the game of GO(Silver et al., 2017) . While training agent by using deep reinforcement learning algorithms, we usually assume that the agent could extract appropriate and effective features from different states and take actions accordingly. However, as more and more research works (Zhang et al. (2018 ), Song et al. (2019) , Dabney et al. (2020) ) have pointed out, even welltrained RL agents that learns from visualized input tend to memorizing spurious pattern rather than understanding the essential generic features of a given state. For example, an agent might pay more attention to the change of irrelevant background rather than noticing the obstacles or enemies (Song et al., 2019) . To improve generalization in the new environment, various kinds of regularization method like dropout (Farebrother et al., 2018) 2020)). The agent is acting on multiple augmented views of the same input and learn from these prior injected data. However, modifying state information(injecting prior to the data) may be risky or even detrimental for representation learning because vital features may be altered or lost (ex: flipping state image might change the corresponding behavioral meaning, cropping the input image might lose critical features like the enemy position in the game). To avoid losing informative features of the visualized input, we choose a different approach. As a human learner, we rarely depend on multiple augmented views of the same input to discriminate important or fictitious features. Instead, human learners try to recognize general patterns across multiple states and act accordingly. In other words, if the same action(or behavior) has been conducted by a well-trained agent in two different states, we would infer that the agent has conceived similar feature patterns in these states. For example, if one car stops for ninety seconds at two different intersections, we would guess that the car might be stopped by the red light at both places(Figure 1b ).



and data augmentation(Laskin et al., 2020)  has been proposed and tested in combination with reinforcement learning. Conventional methods like dropout and batch-norm has been proven to be effective in supervised-learning, and for self-supervised learning like RL we see multiple related applications across various environments. Data augmentation like random crop(Laskin et al., 2020)  or random convolution(Lee et al., 2019) have also been proposed recently and provide considerable generalization enhancement to the unseen levels of various tested environment(Tassa et al. (2018), Cobbe et al. (2018), Cobbe et al. (

