PRETRAINING THE VISION TRANSFORMER USING SELF-SUPERVISED METHODS FOR VISION BASED DEEP REIN-FORCEMENT LEARNING

Abstract

The Vision Transformer architecture has shown to be competitive in the computer vision (CV) space where it has dethroned convolution-based networks in several benchmarks. Nevertheless, Convolutional Neural Networks (CNN) remain the preferential architecture for the representation module in Reinforcement Learning. In this work, we study pretraining a Vision Transformer using several state-of-the-art self-supervised methods and assess data-efficiency gains from this training framework. We propose a new self-supervised learning method called TOV-VICReg that extends VICReg to better capture temporal relations between observations by adding a temporal order verification task. Furthermore, we evaluate the resultant encoders with Atari games in a sample-efficiency regime. Our results show that the vision transformer, when pretrained with TOV-VICReg, outperforms the other self-supervised methods but still struggles to overcome a CNN. Nevertheless, we were able to outperform a CNN in two of the ten games where we perform a 100k steps evaluation. Ultimately, we believe that such approaches in Deep Reinforcement Learning (DRL) might be the key to achieving new levels of performance as seen in natural language processing and computer vision.

1. INTRODUCTION

Despite the successes of deep reinforcement learning agents in the last decade, these still require a large amount of data or interactions to learn good policies. This data inefficiency makes current methods difficult to apply to environments where interactions are more expensive or data is scarce, which is the case in many real-world applications. In environments where the agent doesn't have full access to the current state (partially observable environments), this problem becomes even more prominent, since the agent not only needs to learn the state-to-action mapping but also a state representation function that tries to be informative about a state given an observation. In contrast, humans, when learning a new task, already have a well-developed visual system and a good model of the world which are components that allow us to easily learn new tasks. Previous works have tried to tackle the sample inefficiency problem by using auxiliary learning tasks (Schwarzer et al., 2021b; Stooke et al., 2021; Guo et al., 2020) , that try to help the network's encoder to learn good representations of the observations given by the environments. These tasks can be supervised or unsupervised and can happen during a pretraining phase or a reinforcement learning (RL) phase in a joint-learning or decoupled-learning scheme. In recent years, self-supervised learning has shown to be very useful in computer vision, the increasing interest in this area has resulted in the appearance of new and improved methods that train a network to learn important features from the data using only the data itself as supervision. A common approach to evaluating such methods is to train a network composed of the pretrained encoder, with the parameters frozen, paired with a linear layer in popular datasets, like ImageNet. These evaluations have shown that these methods can achieve high scores in different benchmarks, which shows how well the current state-of-the-art methods are able to encode useful information from the given images without being task-specific. Additionally, it has been shown that pretraining a network using self-supervised learning (or unsupervised) adds robustness to the network and gives better generalization capabilities (Erhan et al., 2010) . Also recently, a new architecture for vision-based tasks called the Vision Transformer (ViT) (Dosovitskiy et al., 2020) has shown impressive results in several benchmarks without using any convolutions. This architecture presents much weaker inductive biases when compared to a CNN, which can result in lower data efficiency. But the Vision Transformer, unlike the CNNs, can capture relations between parts of an image (patches) that are far apart from each other, thus deriving global information that can help the model perform better in certain tasks. Furthermore, when the model is pretrained, using supervised or self-supervised learning, it manages to surpass the best convolution-based models in terms of task performance. Nonetheless, and despite these successes in computer vision these results are yet to be seen in reinforcement learning. Motivated by the potential of the Vision Transformer, in particular when paired with a pretraining phase, and the increasing interest in self-supervised tasks for DRL, we study pretraining ViT using state-of-the-art (SOTA) self-supervised learning methods and use it as the representation module in a Deep RL algorithm. Consequently, we propose extending VICReg (Variance Invariance Covariance Regularization) (Bardes et al., 2022) with a temporal order verification task (Misra et al., 2016) to help the model better capture the temporal relations between consecutive observations. We named this approach Temporal Order Verification-VICReg or in short TOV-VICReg. While we could have adapted any of the other methods, we opted for VICReg due to its computational performance, simplicity, and good results in early experiments and metrics such as the ones presented in Section 7. After our empirical results in the Atari games, we present a small study of the pretrained encoders using several metrics to understand if they suffer from any representational collapse and also analyse the learned representations using similarity matrices and attention maps. Our main contributions are: • We propose a new self-supervised learning method which extends VICReg to capture the temporal relations between consecutive frames through a temporal order verification task, in Section 4. • We pretrain a Vision Transformer using several SOTA self-supervised methods and our proposed method, and study them through metrics (Section 7), visualizations (Section 8) and fine-tuning in reinforcement learning ( Section 6), where we show that temporal relations learned by the model pretrained with our method contribute to a relevant increase in data efficiency.

2. RELATED WORK

Pretraining representations Previous work, similarly to our approach, has explored pretraining representations using self-supervised methods which led to great data-efficiency improvements in the fine-tuning phase (Schwarzer et al., 2021b; Zhan et al., 2020) et al., 2021a) uses an auxiliary task that consists of training the encoder followed by an RNN to predict the encoder representation k steps into the future. PSEs (Agarwal et al., 2021a) combines a policy similarity metric (PSM), that measures the similarity of states in terms of the behaviour of the policy in those states, and a contrastive task for the embeddings (CME) that helps to learn more robust representations. PBL (Guo et al., 2020) learns representations through an interdependence between an encoder, that is trained to be informative about the history that led to that observation, and an RNN that is trained to predict the representations of future observations. Proto-RL (Yarats



or superior results in evaluation tasks, likeAtariARI (Anand et al., 2020). Others have pretrained representations using RL algorithms, like DQN, and transfer those learned representations to a new learning task(Wang et al., 2022).

