R-LATTE: VISUAL CONTROL VIA DEEP REINFORCE-MENT LEARNING WITH ATTENTION NETWORK

Abstract

Attention mechanisms are generic inductive biases that have played a critical role in improving the state-of-the-art in supervised learning, unsupervised pre-training and generative modeling for multiple domains including vision, language and speech. However, they remain relatively under-explored for neural network architectures typically used in reinforcement learning (RL) from high dimensional inputs such as pixels. In this paper, we propose and study the effectiveness of augmenting a simple attention module in the convolutional encoder of an RL agent. Through experiments on the widely benchmarked DeepMind Control Suite environments, we demonstrate that our proposed module can (i) extract interpretable task-relevant information such as agent locations and movements without the need for data augmentations or contrastive losses; (ii) significantly improve the sampleefficiency and final performance of the agents. We hope our simple and effective approach will serve as a strong baseline for future research incorporating attention mechanisms in reinforcement learning and control.

1. INTRODUCTION

Attention plays a crucial rule in human cognitive processing: a commonly-accepted hypothesis is that humans rely on a combination of top-down selection (e.g., zooming into certain parts in our perception field) and bottom-up disruption (e.g., getting distracted by a novel stimuli) to prioritize information that is most likely to be important for survival or task completion (Corbetta & Shulman L., 2002; Kastner & Ungerleider G., 2000) . In machine learning, attention mechanisms for neural networks have been studied for various purposes in multiple domains, such as computer vision (Xu et al., 2015; Zhang et al., 2019a) , natural language processing (Vaswani et al., 2017; Brown et al., 2020) , and speech recognition (Chorowski et al., 2015; Bahdanau et al., 2016) . For example, Zagoruyko & Komodakis (2017) proposed transferring knowledge between two neural networks by aligning the activation (attention) maps of one network with the other. Springenberg et al. (2014) and Selvaraju et al. ( 2017) proposed a gradient-based method to extract attention maps from neural networks for interpretability. Attention also stands behind the success of Transformers (Vaswani et al., 2017) , which uses a self-attention mechanism to model dependencies in long-sequence language data. However, attention mechanisms have received relatively little attention in deep reinforcement learning (RL), even though this generic inductive bias shows the potential to improve the sampleefficiency of agents, especially from pixel-based environments. More frequently explored directions in RL from pixels are unsupervised/self-supervised learning (Oord et al., 2018; Srinivas et al., 2020; Lee et al., 2020; Stooke et al., 2020; Kipf et al., 2020) 2020) further improved the gains using temporal contrast. Another promising direction has focused on latent variable modeling (Watter et al., 2015; Zhang et al., 2019b; Hafner et al., 2020; Sekar et al., 2020; Watters et al., 2019 ): Hafner et al. (2019) proposed to leverage world-modeling in a latent-space for planning, and Hafner et al. ( 2020) utilized the latent dynamics model to generate synthetic roll-outs. However, using such representation learning methods may incurs expensive back-and-forth costs (e.g., hyperparameter tuning). This motivates our search in a more effective neural network architecture in the convolutional encoder of an RL agent. 1



: Jaderberg et al. (2017) introduced unsupervised auxiliary tasks, such as the Pixel Control task. Srinivas et al. (2020) applied contrastive learning for data-efficient RL and Stooke et al. (

