TEMPORAL CHANGE SENSITIVE REPRESENTATION FOR REINFORCEMENT LEARING

Abstract

Image-based deep reinforcement learning has made a great improvement recently by combining state-of-the-art reinforcement learning algorithms with selfsupervised representation learning algorithms. However, these self-supervised representation learning algorithms are designed to preserve global visual information, which may miss changes in visual information that are important for performing the task, like in Figure 1 . To resolve this problem, self-supervised representation learning specifically designed for better preserving task relevant information is necessary. Following this idea, we introduce Temporal Change Sensitive Representation (TCSR), which is designed for reinforcement learning algorithms that have a latent dynamic model. TCSR enforces the latent state representation of the reinforcement agent to put more emphasis on the part of observation that could potentially change in the future. Our method achieves SoTA performance in Atari100K benchmark.



Figure 1 : The ground truth observation compared with image reconstructed from latent state representation predicted by TCSR and EfficientZero. TCSR can not only predict the movement of enemies in the short term (Marked in the yellow box) but also predict exactly when and where the UFO will release a new enemy till the end of the planning horizon (Marked in the red box). However, EfficientZero fails to predict both of these changes. This shows that TCSR is more sensitive to the changes in the latent state representation. These change includes but not limited to position, appearance and disappearance of task related objects as shown in this figure.

1. INTRODUCTION

Deep reinforcement learning has achieved much success in solving image based tasks over the last several years. A critical step to solving image based tasks is learning a good representation of the image input. One of the biggest challenges for learning a good representation for reinforcement learning is that the reward is sparse (Shelhamer et al., 2016) 



, which cannot generate enough training signal to train the representation network. To resolve this problem, self-supervised representation learning loss is often added to facilitate training.There are many different approaches to image based reinforcement learning. Most of them try to combine state-of-the-art model based or model free backbones like SAC(Haarnoja et al., 2018),  Rainbow (Hessel et al., 2018)  andMuZero (Schrittwieser et al., 2020)  with self-supervised representation learning algorithms to boost the training of representation. Among these methods, SPR(Schwarzer et al., 2020)  andEfficientZero (Ye et al., 2021)  are state-of-the-art model-free and model based methods in the Atari 100K benchmark. They achieved the best score in 21 out of 26 Atari 100K games combined. They train a dynamic model to predict the future latent states from an initial latent state calculated by the image encoder. Both the image encoder and the dynamic model are trained using the SimSiam(Chen & He, 2020) loss between the predicted latent state and the latent state calculated directly from the future observations. However, most representation learning algorithms used in reinforcement learning do not emphasize the change of visual information, while creatures, including humans, are innately sensitive to the change of visual information. A very important part of the neural system is the middle temporal visual area (MT)(Von Bonin & Bailey, 1947). Visual information is integrated and differentiated in MT to capture the movement of objects contained in visual information(Allman et al., 1985). The ability to capture the changes in visual information helps creatures catch prey or escape an enemyMaturana et al. (1960); Suzuki et al. (2019). To help reinforcement learning agents acquire such ability, we propose Temporal Change Sensitive Representation (TCSR), a self-supervised auxiliary loss specifically designed for reinforcement learning methods that have a latent dynamic model. TCSR enforces the difference between two consecutive unrolled latent states to be the same as the difference between two target latent states generated from two consecutive observations with the same augmentation. TCSR uses EfficientZero (Ye et al., 2021) as the backbone and inherit most of the hyper-parameter. On the Atari 100k benchmark, TCSR surpasses EfficientZero in 19 out of 26 games (as shown in Figure 2) and achieves a new state-of-the-art performance.

Figure 2: The improvement of human normalized score by adding TCSR as an extra self-supervised representation learning auxiliary loss on EfficientZero backbone. TCSR surpasses EfficientZero in 19 out of 26 games in the Atari 100k benchmark

